<a href="https://colab.research.google.com/github/MehrdadJalali-AI/Data_Management/blob/main/data_cleaning_process.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning Process

## Definition


**Data Cleaning** is the process of identifying and correcting or removing inaccurate, incomplete, or irrelevant data from a dataset.


## Why is Data Cleaning Critical?


1. **Improves Data Quality**: Ensures data is accurate, complete, and consistent.
2. **Enhances Decision-making**: Leads to more reliable analysis and conclusions.
3. **Prevents Errors**: Eliminates bias, inconsistencies, and misinterpretations during analysis.


## Example Dataset with Issues


| **Customer ID** | **Name**       | **Age** | **Email**              | **Purchase Amount** |
|------------------|----------------|---------|------------------------|----------------------|
| 1001             | John Doe       | 35      | john.doe@email.com     | 150.00              |
| 1002             | Jane Smith     | NULL    | jane.smith@email.com   | 200.00              |
| 1003             | Alice Johnson  | Thirty  | alice.johnson#email    | 175.00              |
| 1004             | Bob Brown      | 42      | bob.brown@email.com    | NULL                |
| 1005             | John Doe       | 35      | john.doe@email.com     | 150.00              |


## Goals of Data Cleaning


1. **Correct Inaccuracies**:
   - Convert `"Thirty"` to `30`.
   - Correct invalid email formats.
2. **Handle Missing Values**:
   - Impute or remove rows with missing values.
3. **Remove Duplicates**:
   - Ensure each record is unique.


In [1]:

# Step 1: Load Dataset
import pandas as pd

# Example dataset with issues
data = {
    "Customer ID": [1001, 1002, 1003, 1004, 1005],
    "Name": ["John Doe", "Jane Smith", "Alice Johnson", "Bob Brown", "John Doe"],
    "Age": [35, None, "Thirty", 42, 35],
    "Email": ["john.doe@email.com", "jane.smith@email.com", "alice.johnson#email", "bob.brown@email.com", "john.doe@email.com"],
    "Purchase Amount": [150.0, 200.0, 175.0, None, 150.0]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Customer ID,Name,Age,Email,Purchase Amount
0,1001,John Doe,35,john.doe@email.com,150.0
1,1002,Jane Smith,,jane.smith@email.com,200.0
2,1003,Alice Johnson,Thirty,alice.johnson#email,175.0
3,1004,Bob Brown,42,bob.brown@email.com,
4,1005,John Doe,35,john.doe@email.com,150.0


In [2]:

# Step 2: Identify Missing Values
missing_values = df.isnull().sum()
missing_values


Unnamed: 0,0
Customer ID,0
Name,0
Age,1
Email,0
Purchase Amount,1


In [3]:

# Step 3: Handle Missing Values
# Fill missing 'Age' with median age and 'Purchase Amount' with 0
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')  # Convert 'Age' to numeric
df['Age'].fillna(df['Age'].median(), inplace=True)  # Impute median for missing 'Age'
df['Purchase Amount'].fillna(0, inplace=True)  # Replace missing 'Purchase Amount' with 0
df


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)  # Impute median for missing 'Age'
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Purchase Amount'].fillna(0, inplace=True)  # Replace missing 'Purchase Amount' with 0


Unnamed: 0,Customer ID,Name,Age,Email,Purchase Amount
0,1001,John Doe,35.0,john.doe@email.com,150.0
1,1002,Jane Smith,35.0,jane.smith@email.com,200.0
2,1003,Alice Johnson,35.0,alice.johnson#email,175.0
3,1004,Bob Brown,42.0,bob.brown@email.com,0.0
4,1005,John Doe,35.0,john.doe@email.com,150.0


In [4]:

# Step 4: Correct Invalid Formats
# Validate 'Email' format using a regex
import re

def validate_email(email):
    pattern = r'^\S+@\S+\.\S+$'
    return bool(re.match(pattern, email))

df['Email'] = df['Email'].apply(lambda x: x if validate_email(x) else None)
df


Unnamed: 0,Customer ID,Name,Age,Email,Purchase Amount
0,1001,John Doe,35.0,john.doe@email.com,150.0
1,1002,Jane Smith,35.0,jane.smith@email.com,200.0
2,1003,Alice Johnson,35.0,,175.0
3,1004,Bob Brown,42.0,bob.brown@email.com,0.0
4,1005,John Doe,35.0,john.doe@email.com,150.0


In [5]:

# Step 5: Remove Duplicates
df = df.drop_duplicates()
df


Unnamed: 0,Customer ID,Name,Age,Email,Purchase Amount
0,1001,John Doe,35.0,john.doe@email.com,150.0
1,1002,Jane Smith,35.0,jane.smith@email.com,200.0
2,1003,Alice Johnson,35.0,,175.0
3,1004,Bob Brown,42.0,bob.brown@email.com,0.0
4,1005,John Doe,35.0,john.doe@email.com,150.0
