<a href="https://colab.research.google.com/github/MehrdadJalali-AI/Data_Management/blob/main/steps_in_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steps in Data Cleaning

## 1. Data Collection


**Definition**: Gather raw data from various sources, such as databases, APIs, or files.
- This step involves consolidating data into a unified format for analysis.
- Challenges include inconsistent formats, incomplete records, and integration issues.


## 2. Data Profiling


**Definition**: Analyze data to detect errors, inconsistencies, and patterns.
- Helps identify missing values, duplicate records, and incorrect formats.
- Tools like `pandas-profiling` or `Great Expectations` can be used for detailed reports.


## 3. Data Transformation


**Definition**: Apply appropriate cleaning techniques to address issues identified in the profiling step.
- Examples include:
  - Standardization: Ensuring uniform formats (e.g., date formats).
  - Normalization: Adjusting values to a common scale.
  - Missing Value Treatment: Filling or removing missing data.


## 4. Data Validation


**Definition**: Ensure data adheres to predefined rules and standards.
- Validate against schema definitions, patterns, and required domains.
- Use automated tools to enforce rules and generate validation reports.


## 5. Final Output


**Definition**: Deliver clean, structured data ready for analysis.
- The final output is free of errors, consistent, and adheres to analytical requirements.
- Ready for advanced analysis or integration into downstream systems.


In [1]:

# Step 1: Data Collection
# Simulate data collection from different sources
import pandas as pd

data_source1 = {"CustomerID": [1001, 1002], "Name": ["John Doe", "Jane Smith"], "Age": [35, None]}
data_source2 = {"CustomerID": [1003, 1004], "Name": ["Alice Johnson", "Bob Brown"], "Age": ["Thirty", 42]}

df1 = pd.DataFrame(data_source1)
df2 = pd.DataFrame(data_source2)

# Combine data from sources
raw_data = pd.concat([df1, df2], ignore_index=True)
raw_data


Unnamed: 0,CustomerID,Name,Age
0,1001,John Doe,35.0
1,1002,Jane Smith,
2,1003,Alice Johnson,Thirty
3,1004,Bob Brown,42


In [2]:

# Step 2: Data Profiling
# Detect missing values and data type inconsistencies
missing_values = raw_data.isnull().sum()
data_types = raw_data.dtypes

print("Missing Values:\n", missing_values)
print("Data Types:\n", data_types)


Missing Values:
 CustomerID    0
Name          0
Age           1
dtype: int64
Data Types:
 CustomerID     int64
Name          object
Age           object
dtype: object


In [3]:

# Step 3: Data Transformation
# Handle missing values and correct data types
raw_data['Age'] = pd.to_numeric(raw_data['Age'], errors='coerce')  # Convert 'Age' to numeric
raw_data['Age'].fillna(raw_data['Age'].median(), inplace=True)  # Fill missing 'Age' with median
raw_data


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  raw_data['Age'].fillna(raw_data['Age'].median(), inplace=True)  # Fill missing 'Age' with median


Unnamed: 0,CustomerID,Name,Age
0,1001,John Doe,35.0
1,1002,Jane Smith,38.5
2,1003,Alice Johnson,38.5
3,1004,Bob Brown,42.0


In [4]:

# Step 4: Data Validation
# Check that all ages are within a valid range (e.g., 0-120)
valid_ages = raw_data['Age'].between(0, 120)
invalid_rows = raw_data[~valid_ages]

if invalid_rows.empty:
    print("All rows are valid.")
else:
    print("Invalid Rows:\n", invalid_rows)


All rows are valid.


In [5]:

# Step 5: Final Output
# The cleaned and validated dataset is ready for analysis
cleaned_data = raw_data[valid_ages]  # Keep only valid rows
cleaned_data


Unnamed: 0,CustomerID,Name,Age
0,1001,John Doe,35.0
1,1002,Jane Smith,38.5
2,1003,Alice Johnson,38.5
3,1004,Bob Brown,42.0
