# Data Cleaning in Pandas

This notebook covers essential data cleaning techniques in Pandas, including handling missing values, duplicates, and data type conversions.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Pandas version: 2.2.3
NumPy version: 2.2.4


## Sample Data with Missing Values

Let's create a DataFrame with missing values to demonstrate cleaning techniques.

In [2]:
# Create DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', np.nan, 'Diana', 'Eve'],
    'Age': [25, np.nan, 35, 28, np.nan],
    'City': ['New York', 'London', np.nan, 'Tokyo', 'Sydney'],
    'Salary': [50000, 60000, 70000, np.nan, 65000],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'HR']
}

df = pd.DataFrame(data)
print("DataFrame with missing values:")
print(df)

print("\nMissing value counts:")
print(df.isnull().sum())

DataFrame with missing values:
    Name   Age      City   Salary Department
0  Alice  25.0  New York  50000.0         HR
1    Bob   NaN    London  60000.0         IT
2    NaN  35.0       NaN  70000.0    Finance
3  Diana  28.0     Tokyo      NaN         IT
4    Eve   NaN    Sydney  65000.0         HR

Missing value counts:
Name          1
Age           2
City          1
Salary        1
Department    0
dtype: int64


## Handling Missing Values

Pandas provides several methods to handle missing values: drop, fill, or interpolate.

In [3]:
# Drop rows with missing values
print("Drop rows with any missing values:")
df_dropped = df.dropna()
print(df_dropped)

# Drop columns with missing values
print("\nDrop columns with any missing values:")
df_dropped_cols = df.dropna(axis=1)
print(df_dropped_cols)

# Fill missing values with a specific value
print("\nFill missing values with 'Unknown' for Name and City:")
df_filled = df.copy()
df_filled['Name'] = df_filled['Name'].fillna('Unknown')
df_filled['City'] = df_filled['City'].fillna('Unknown')
print(df_filled)

# Fill numeric missing values with mean
print("\nFill numeric missing values with column mean:")
df_filled_mean = df.copy()
df_filled_mean['Age'] = df_filled_mean['Age'].fillna(df_filled_mean['Age'].mean())
df_filled_mean['Salary'] = df_filled_mean['Salary'].fillna(df_filled_mean['Salary'].mean())
print(df_filled_mean)

Drop rows with any missing values:
    Name   Age      City   Salary Department
0  Alice  25.0  New York  50000.0         HR

Drop columns with any missing values:
  Department
0         HR
1         IT
2    Finance
3         IT
4         HR

Fill missing values with 'Unknown' for Name and City:
      Name   Age      City   Salary Department
0    Alice  25.0  New York  50000.0         HR
1      Bob   NaN    London  60000.0         IT
2  Unknown  35.0   Unknown  70000.0    Finance
3    Diana  28.0     Tokyo      NaN         IT
4      Eve   NaN    Sydney  65000.0         HR

Fill numeric missing values with column mean:
    Name        Age      City   Salary Department
0  Alice  25.000000  New York  50000.0         HR
1    Bob  29.333333    London  60000.0         IT
2    NaN  35.000000       NaN  70000.0    Finance
3  Diana  28.000000     Tokyo  61250.0         IT
4    Eve  29.333333    Sydney  65000.0         HR


## Handling Duplicates

Pandas provides methods to identify and remove duplicate rows.

In [4]:
# Create DataFrame with duplicates
duplicate_data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'Department': ['HR', 'IT', 'HR', 'Finance', 'IT']
}

df_dup = pd.DataFrame(duplicate_data)
print("DataFrame with duplicates:")
print(df_dup)

# Check for duplicates
print("\nDuplicate rows:")
print(df_dup.duplicated())

# Remove duplicates
print("\nRemove duplicates:")
df_no_dup = df_dup.drop_duplicates()
print(df_no_dup)

# Remove duplicates based on specific columns
print("\nRemove duplicates based on 'Name' column:")
df_no_dup_name = df_dup.drop_duplicates(subset=['Name'])
print(df_no_dup_name)

DataFrame with duplicates:
      Name  Age Department
0    Alice   25         HR
1      Bob   30         IT
2    Alice   25         HR
3  Charlie   35    Finance
4      Bob   30         IT

Duplicate rows:
0    False
1    False
2     True
3    False
4     True
dtype: bool

Remove duplicates:
      Name  Age Department
0    Alice   25         HR
1      Bob   30         IT
3  Charlie   35    Finance

Remove duplicates based on 'Name' column:
      Name  Age Department
0    Alice   25         HR
1      Bob   30         IT
3  Charlie   35    Finance


## Data Type Conversion

Pandas allows you to convert data types using `astype()` method.

In [5]:
# Create DataFrame with mixed data types
mixed_data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': ['25', '30', '35'],  # String instead of int
    'Salary': ['50000.0', '60000.5', '70000.0'],  # String instead of float
    'Is_Manager': ['True', 'False', 'True']  # String instead of bool
}

df_types = pd.DataFrame(mixed_data)
print("DataFrame with mixed types:")
print(df_types)
print("\nData types:")
print(df_types.dtypes)

# Convert data types
df_converted = df_types.copy()
df_converted['Age'] = df_converted['Age'].astype(int)
df_converted['Salary'] = df_converted['Salary'].astype(float)
df_converted['Is_Manager'] = df_converted['Is_Manager'].astype(bool)

print("\nAfter type conversion:")
print(df_converted)
print("\nData types after conversion:")
print(df_converted.dtypes)

DataFrame with mixed types:
      Name Age   Salary Is_Manager
0    Alice  25  50000.0       True
1      Bob  30  60000.5      False
2  Charlie  35  70000.0       True

Data types:
Name          object
Age           object
Salary        object
Is_Manager    object
dtype: object

After type conversion:
      Name  Age   Salary  Is_Manager
0    Alice   25  50000.0        True
1      Bob   30  60000.5        True
2  Charlie   35  70000.0        True

Data types after conversion:
Name           object
Age             int64
Salary        float64
Is_Manager       bool
dtype: object


## Summary

You have learned essential data cleaning techniques in Pandas:

- **Handling Missing Values**: Using `dropna()`, `fillna()`, and `isnull()`
- **Removing Duplicates**: Using `duplicated()` and `drop_duplicates()`
- **Data Type Conversion**: Using `astype()` to convert between data types

These techniques are crucial for preparing data for analysis and ensuring data quality.