# 04 - Data Cleaning

## Introduction

Real-world data is often messy. It contains missing values, duplicates, incorrect data types, and inconsistencies. Data cleaning is a crucial step before analysis.

## What You'll Learn

- Identifying missing values
- Handling missing values (drop, fill)
- Finding and removing duplicates
- Converting data types
- Renaming columns
- Reordering columns


In [1]:
import pandas as pd
import numpy as np

# Create a DataFrame with missing values and duplicates
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Bob'],
    'Age': [25, 30, None, 28, 32, 30],
    'City': ['New York', 'London', 'Tokyo', None, 'Sydney', 'London'],
    'Salary': [50000, 60000, 70000, 55000, None, 60000],
    'Department': ['IT', 'Sales', 'IT', 'Marketing', 'IT', 'Sales']
})

print("Sample DataFrame with missing values:")
print(df)
print(f"\nShape: {df.shape}")


Sample DataFrame with missing values:
      Name   Age      City   Salary Department
0    Alice  25.0  New York  50000.0         IT
1      Bob  30.0    London  60000.0      Sales
2  Charlie   NaN     Tokyo  70000.0         IT
3    Diana  28.0      None  55000.0  Marketing
4      Eve  32.0    Sydney      NaN         IT
5      Bob  30.0    London  60000.0      Sales

Shape: (6, 5)


## Identifying Missing Values

First, you need to identify where missing values are in your data.


In [2]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")


Missing values per column:
Name          0
Age           1
City          1
Salary        1
Department    0
dtype: int64

Total missing values: 3


In [3]:
# Check which rows have missing values
print("Rows with missing values:")
print(df[df.isnull().any(axis=1)])


Rows with missing values:
      Name   Age    City   Salary Department
2  Charlie   NaN   Tokyo  70000.0         IT
3    Diana  28.0    None  55000.0  Marketing
4      Eve  32.0  Sydney      NaN         IT


In [4]:
# Percentage of missing values
print("Percentage of missing values:")
print((df.isnull().sum() / len(df)) * 100)


Percentage of missing values:
Name           0.000000
Age           16.666667
City          16.666667
Salary        16.666667
Department     0.000000
dtype: float64


## Handling Missing Values - Dropping

You can drop rows or columns with missing values using `dropna()`.


In [5]:
# Drop rows with any missing values
df_dropped = df.dropna()
print("After dropping rows with missing values:")
print(df_dropped)
print(f"\nOriginal shape: {df.shape}, New shape: {df_dropped.shape}")


After dropping rows with missing values:
    Name   Age      City   Salary Department
0  Alice  25.0  New York  50000.0         IT
1    Bob  30.0    London  60000.0      Sales
5    Bob  30.0    London  60000.0      Sales

Original shape: (6, 5), New shape: (3, 5)


In [6]:
# Drop rows where all values are missing
df_dropped_all = df.dropna(how='all')
print("After dropping rows where ALL values are missing:")
print(df_dropped_all)


After dropping rows where ALL values are missing:
      Name   Age      City   Salary Department
0    Alice  25.0  New York  50000.0         IT
1      Bob  30.0    London  60000.0      Sales
2  Charlie   NaN     Tokyo  70000.0         IT
3    Diana  28.0      None  55000.0  Marketing
4      Eve  32.0    Sydney      NaN         IT
5      Bob  30.0    London  60000.0      Sales


In [7]:
# Drop columns with missing values
df_dropped_cols = df.dropna(axis=1)
print("After dropping columns with missing values:")
print(df_dropped_cols)


After dropping columns with missing values:
      Name Department
0    Alice         IT
1      Bob      Sales
2  Charlie         IT
3    Diana  Marketing
4      Eve         IT
5      Bob      Sales


In [8]:
# Fill missing values with a specific value
df_filled = df.fillna(0)
print("After filling with 0:")
print(df_filled)


After filling with 0:
      Name   Age      City   Salary Department
0    Alice  25.0  New York  50000.0         IT
1      Bob  30.0    London  60000.0      Sales
2  Charlie   0.0     Tokyo  70000.0         IT
3    Diana  28.0         0  55000.0  Marketing
4      Eve  32.0    Sydney      0.0         IT
5      Bob  30.0    London  60000.0      Sales


In [9]:
# Fill missing values with column-specific values
df_filled_specific = df.fillna({
    'Age': df['Age'].mean(),  # Fill Age with mean
    'City': 'Unknown',         # Fill City with 'Unknown'
    'Salary': df['Salary'].median()  # Fill Salary with median
})
print("After filling with specific values:")
print(df_filled_specific)


After filling with specific values:
      Name   Age      City   Salary Department
0    Alice  25.0  New York  50000.0         IT
1      Bob  30.0    London  60000.0      Sales
2  Charlie  29.0     Tokyo  70000.0         IT
3    Diana  28.0   Unknown  55000.0  Marketing
4      Eve  32.0    Sydney  60000.0         IT
5      Bob  30.0    London  60000.0      Sales


In [10]:
# Forward fill (use previous value)
df_ffill = df.fillna(method='ffill')
print("After forward fill:")
print(df_ffill)


After forward fill:
      Name   Age      City   Salary Department
0    Alice  25.0  New York  50000.0         IT
1      Bob  30.0    London  60000.0      Sales
2  Charlie  30.0     Tokyo  70000.0         IT
3    Diana  28.0     Tokyo  55000.0  Marketing
4      Eve  32.0    Sydney  55000.0         IT
5      Bob  30.0    London  60000.0      Sales


  df_ffill = df.fillna(method='ffill')


In [11]:
# Backward fill (use next value)
df_bfill = df.fillna(method='bfill')
print("After backward fill:")
print(df_bfill)


After backward fill:
      Name   Age      City   Salary Department
0    Alice  25.0  New York  50000.0         IT
1      Bob  30.0    London  60000.0      Sales
2  Charlie  28.0     Tokyo  70000.0         IT
3    Diana  28.0    Sydney  55000.0  Marketing
4      Eve  32.0    Sydney  60000.0         IT
5      Bob  30.0    London  60000.0      Sales


  df_bfill = df.fillna(method='bfill')


## Finding and Removing Duplicates

Duplicates can cause issues in analysis. Let's find and remove them.


In [12]:
# Check for duplicate rows
print("Number of duplicate rows:")
print(df.duplicated().sum())
print("\nDuplicate rows:")
print(df[df.duplicated()])


Number of duplicate rows:
1

Duplicate rows:
  Name   Age    City   Salary Department
5  Bob  30.0  London  60000.0      Sales


In [13]:
# Remove duplicate rows (keep first occurrence)
df_no_duplicates = df.drop_duplicates()
print("After removing duplicates:")
print(df_no_duplicates)
print(f"\nOriginal shape: {df.shape}, New shape: {df_no_duplicates.shape}")


After removing duplicates:
      Name   Age      City   Salary Department
0    Alice  25.0  New York  50000.0         IT
1      Bob  30.0    London  60000.0      Sales
2  Charlie   NaN     Tokyo  70000.0         IT
3    Diana  28.0      None  55000.0  Marketing
4      Eve  32.0    Sydney      NaN         IT

Original shape: (6, 5), New shape: (5, 5)


In [14]:
# Remove duplicates based on specific columns
df_no_dup_cols = df.drop_duplicates(subset=['Name', 'Age'])
print("After removing duplicates based on Name and Age:")
print(df_no_dup_cols)


After removing duplicates based on Name and Age:
      Name   Age      City   Salary Department
0    Alice  25.0  New York  50000.0         IT
1      Bob  30.0    London  60000.0      Sales
2  Charlie   NaN     Tokyo  70000.0         IT
3    Diana  28.0      None  55000.0  Marketing
4      Eve  32.0    Sydney      NaN         IT


## Converting Data Types

Sometimes columns have incorrect data types. You can convert them using `astype()` or `pd.to_numeric()`.


In [15]:
# Create DataFrame with incorrect data types
df_types = pd.DataFrame({
    'Age': ['25', '30', '35'],  # String instead of int
    'Salary': ['50000', '60000', '70000'],  # String instead of int
    'Active': ['True', 'False', 'True']  # String instead of bool
})

print("Original data types:")
print(df_types.dtypes)
print("\nDataFrame:")
print(df_types)


Original data types:
Age       object
Salary    object
Active    object
dtype: object

DataFrame:
  Age Salary Active
0  25  50000   True
1  30  60000  False
2  35  70000   True


In [16]:
# Convert data types
df_types['Age'] = df_types['Age'].astype(int)
df_types['Salary'] = pd.to_numeric(df_types['Salary'])
df_types['Active'] = df_types['Active'].astype(bool)

print("After conversion:")
print(df_types.dtypes)
print("\nDataFrame:")
print(df_types)


After conversion:
Age       int64
Salary    int64
Active     bool
dtype: object

DataFrame:
   Age  Salary  Active
0   25   50000    True
1   30   60000    True
2   35   70000    True


## Renaming Columns

You can rename columns using `rename()`.


In [17]:
# Rename columns
df_renamed = df.rename(columns={'Name': 'Employee_Name', 'Age': 'Employee_Age'})
print("After renaming:")
print(df_renamed.columns.tolist())


After renaming:
['Employee_Name', 'Employee_Age', 'City', 'Salary', 'Department']


## Reordering Columns

You can reorder columns by selecting them in the desired order.


In [18]:
# Reorder columns
df_reordered = df[['Name', 'Department', 'Age', 'Salary', 'City']]
print("After reordering columns:")
print(df_reordered.head())


After reordering columns:
      Name Department   Age   Salary      City
0    Alice         IT  25.0  50000.0  New York
1      Bob      Sales  30.0  60000.0    London
2  Charlie         IT   NaN  70000.0     Tokyo
3    Diana  Marketing  28.0  55000.0      None
4      Eve         IT  32.0      NaN    Sydney


## Summary

In this notebook, you learned:
- ✅ How to identify missing values
- ✅ How to drop rows/columns with missing values
- ✅ How to fill missing values with various strategies
- ✅ How to find and remove duplicates
- ✅ How to convert data types
- ✅ How to rename and reorder columns

**Next:** Learn data transformation in `05_data_transformation.ipynb`
