# Data Cleaning in Python

This notebook demonstrates common data cleaning techniques using pandas, including handling missing values, removing duplicates, fixing inconsistent formats, and detecting/removing outliers.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np

## Sample Data

Let's create a sample DataFrame with missing values, duplicates, inconsistent formats, and outliers.

In [2]:
# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'bob', 'David', 'Eve', 'Frank', 'Alice', None],
    'Age': [25, 30, np.nan, 30, 22, 120, 28, 25, 27],
    'City': ['New York', 'Los Angeles', 'new york', 'Los Angeles', 'Chicago', 'Chicago', None, 'New York', 'Boston'],
    'Score': [85, 90, 88, 90, np.nan, 95, 70, 85, 100]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,New York,85.0
1,Bob,30.0,Los Angeles,90.0
2,Charlie,,new york,88.0
3,bob,30.0,Los Angeles,90.0
4,David,22.0,Chicago,
5,Eve,120.0,Chicago,95.0
6,Frank,28.0,,70.0
7,Alice,25.0,New York,85.0
8,,27.0,Boston,100.0


## Handling Missing Values

We can detect and handle missing values using pandas methods such as `isnull()`, `dropna()`, and `fillna()`.

In [3]:
# Check for missing values
df.isnull().sum()

# Fill missing 'Age' with the mean and 'Score' with the median
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Score'].fillna(df['Score'].median(), inplace=True)

# Fill missing 'Name' and 'City' with 'Unknown'
df['Name'].fillna('Unknown', inplace=True)
df['City'].fillna('Unknown', inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Score'].fillna(df['Score'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are sett

Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,New York,85.0
1,Bob,30.0,Los Angeles,90.0
2,Charlie,38.375,new york,88.0
3,bob,30.0,Los Angeles,90.0
4,David,22.0,Chicago,89.0
5,Eve,120.0,Chicago,95.0
6,Frank,28.0,Unknown,70.0
7,Alice,25.0,New York,85.0
8,Unknown,27.0,Boston,100.0


## Removing Duplicates

Duplicates can be removed using the `drop_duplicates()` method.

In [4]:
# Remove duplicate rows
df = df.drop_duplicates()
df

Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,New York,85.0
1,Bob,30.0,Los Angeles,90.0
2,Charlie,38.375,new york,88.0
3,bob,30.0,Los Angeles,90.0
4,David,22.0,Chicago,89.0
5,Eve,120.0,Chicago,95.0
6,Frank,28.0,Unknown,70.0
8,Unknown,27.0,Boston,100.0


## Fixing Inconsistent Formats

We can standardize text data (e.g., names, cities) by converting to lowercase or using string methods.

In [5]:
# Standardize 'Name' and 'City' columns to title case
df['Name'] = df['Name'].str.title()
df['City'] = df['City'].str.title()
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Name'] = df['Name'].str.title()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['City'] = df['City'].str.title()


Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,New York,85.0
1,Bob,30.0,Los Angeles,90.0
2,Charlie,38.375,New York,88.0
3,Bob,30.0,Los Angeles,90.0
4,David,22.0,Chicago,89.0
5,Eve,120.0,Chicago,95.0
6,Frank,28.0,Unknown,70.0
8,Unknown,27.0,Boston,100.0


## Detecting and Handling Outliers

We can detect outliers using statistical methods such as the IQR (Interquartile Range) method.

In [6]:
# Detect outliers in 'Age' using the IQR method
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
df = df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]
df

Unnamed: 0,Name,Age,City,Score
0,Alice,25.0,New York,85.0
1,Bob,30.0,Los Angeles,90.0
2,Charlie,38.375,New York,88.0
3,Bob,30.0,Los Angeles,90.0
4,David,22.0,Chicago,89.0
6,Frank,28.0,Unknown,70.0
8,Unknown,27.0,Boston,100.0


## Summary

In this notebook, we demonstrated how to clean data by handling missing values, removing duplicates, fixing inconsistent formats, and detecting/removing outliers using pandas.