This repository contains practical examples and exercises for data cleaning using Python's Pandas library.
Data cleaning is a crucial step in the data analysis pipeline. This project demonstrates common data cleaning techniques including handling missing values, removing duplicates, standardizing formats, and detecting outliers.
- practicals.py - Hands-on data cleaning examples covering:
- String cleaning (trimming whitespace)
- Currency format standardization
- Missing value imputation using median
- Data type conversions
The main dataset includes employee information with the following fields:
- Name - Employee names (with various formatting issues)
- Join_Date - Joining dates (in multiple formats)
- Salary - Salary information (with currency symbols and missing values)
- Age - Age (includes outliers for demonstration)
df['Name'] = df['Name'].str.strip()Removes leading and trailing whitespace from names.
df['Salary'] = df['Salary'].astype(str).str.replace(r'[$,]','', regex=True)
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')Removes currency symbols and converts to numeric format, handling parsing errors gracefully.
median_val = df['Salary'].median()
df['Salary'] = df['Salary'].fillna(median_val)Fills missing salary values with the median to maintain data distribution.
- Python 3.x
- pandas
- numpy
pip install pandas numpyRun the practical examples:
python practicals.py- Always inspect data for inconsistencies before analysis
- Use appropriate imputation methods for missing values
- Handle data type conversions carefully with error handling
- Strip whitespace and standardize formats for consistency
- Be aware of outliers in your dataset
This project is for educational purposes.