# **Automated Data Cleaning Pipeline**

**Author**: Abdul Qadeer
<br>
*AI & Data Science Consultant | AI Researcher | Delivering Scalable AI Solutions for Real-World Challenges*  
**Date**: September 2024  
**Contact**: itsabdulqadeer.55@gmail.com  
**LinkedIn**: [Abdul Qadeer](https://www.linkedin.com/in/abdulqadeer99/)

---

This notebook contains scripts and functions to automate common data cleaning tasks, including handling missing values, encoding categorical variables, detecting and removing outliers, and scaling numeric features. This pipeline is designed to be a reusable, plug-and-play solution for any data preprocessing task.


In [13]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler
from scipy import stats

# Loading a sample dataset (Replace with your dataset)
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
# Defining column names for this dataset
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names=columns)

# Displaying first few rows of the dataset
print("Original Dataset:\n", df.head())

Original Dataset:
    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


pip install sklearn

In [14]:
pip install -U scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


# ---------------------
# 1. Handling Missing Data
# ---------------------


In [15]:

# Checking for missing values in Glucose and BloodPressure columns
print("\nChecking Missing Data:\n", df.isnull().sum())

# Using SimpleImputer to fill missing values with the mean for numerical columns
imputer = SimpleImputer(strategy='mean')
df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = imputer.fit_transform(df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']])

print("\nDataset after Imputation (Filling Missing Data):\n", df.head())


Checking Missing Data:
 Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Dataset after Imputation (Filling Missing Data):
    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6    148.0           72.0           35.0      0.0  33.6   
1            1     85.0           66.0           29.0      0.0  26.6   
2            8    183.0           64.0            0.0      0.0  23.3   
3            1     89.0           66.0           23.0     94.0  28.1   
4            0    137.0           40.0           35.0    168.0  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21

# ---------------------
# 2. Encoding Categorical Variables
# ---------------------


In [16]:

# Encoding the 'Outcome' column (binary classification) using Label Encoding
label_encoder = LabelEncoder()
df['Outcome'] = label_encoder.fit_transform(df['Outcome'])

print("\nDataset after Encoding 'Outcome' Column:\n", df.head())


Dataset after Encoding 'Outcome' Column:
    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6    148.0           72.0           35.0      0.0  33.6   
1            1     85.0           66.0           29.0      0.0  26.6   
2            8    183.0           64.0            0.0      0.0  23.3   
3            1     89.0           66.0           23.0     94.0  28.1   
4            0    137.0           40.0           35.0    168.0  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


# ---------------------
# 3. Outlier Detection and Removal
# ---------------------


In [17]:

# Detecting outliers using Z-Score for selected columns
z_scores = np.abs(stats.zscore(df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']]))
outliers = (z_scores > 3)

# Displaying number of rows with outliers
print("\nNumber of Outliers Detected:\n", outliers.any(axis=1).sum())

# Removing rows with outliers (i.e., where any feature has z-score > 3)
df_cleaned = df[(z_scores < 3).all(axis=1)]

print("\nDataset after Outlier Removal:\n", df_cleaned.head())


Number of Outliers Detected:
 69

Dataset after Outlier Removal:
    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6    148.0           72.0           35.0      0.0  33.6   
1            1     85.0           66.0           29.0      0.0  26.6   
2            8    183.0           64.0            0.0      0.0  23.3   
3            1     89.0           66.0           23.0     94.0  28.1   
4            0    137.0           40.0           35.0    168.0  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


# ---------------------
# 4. Scaling/Normalization
# ---------------------

In [18]:

# Normalizing numeric features
scaler = StandardScaler()
df_cleaned[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']] = scaler.fit_transform(df_cleaned[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']])

print("\nDataset after Scaling:\n", df_cleaned.head())


Dataset after Scaling:
    Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \
0            6  0.915971      -0.020204       0.920562 -0.807617  0.204965   
1            1 -1.181180      -0.507197       0.529631 -0.807617 -0.851603   
2            8  2.081054      -0.669528      -1.359871 -0.807617 -1.349700   
3            1 -1.048028      -0.507197       0.138700  0.235928 -0.625196   
4            0  0.549801      -2.617499       0.920562  1.057442  1.638879   

   DiabetesPedigreeFunction       Age  Outcome  
0                     0.627  1.478354        1  
1                     0.351 -0.188229        0  
2                     0.672 -0.100515        1  
3                     0.167 -1.065379        0  
4                     2.288 -0.012800        1  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']] = scaler.fit_transform(df_cleaned[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']])


# ---------------------
# 4. Saving the Dataset
# ---------------------


In [19]:

# Saving the cleaned dataset to a CSV file
df_cleaned.to_csv('cleaned_data.csv', index=False)
print("\nCleaned dataset saved as 'cleaned_data.csv'")



Cleaned dataset saved as 'cleaned_data.csv'
