<a href="https://colab.research.google.com/github/Anish-S-tech/my-ml-journey/blob/main/Data_cleaning_in_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data cleaning involves some steps and processes. Here the sample data cleaning process which is done by using titanic dataset

**STEP 1**: Importing libraries and load dataset

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("Titanic-Dataset.csv")
df.info()
df.head()

**STEP 2:** Checking for duplicate rows

In [None]:
df.duplicated().sum()  # No duplicates here

**STEP 3:** Identifying column datatypes

In [None]:
cat_col = [col for col in df.columns if df[col].dtype == 'object']
num_col = [col for col in df.columns if df[col].dtype != 'object']

print("Categorical columns: ", cat_col)
print("Numerical columns: ", num_col)

**STEP 4:** Counting unique values in the categorical columns

In [None]:
df[cat_col].nunique()

**STEP 5:** Calculate missing values as percentage

In [None]:
round((df.isnull().sum())/df.shape[0]*100 , 2)  # Like, example, "Age" column, (177/891)*100 [null rows/total rows]*100

**STEP 6:** Drop irrelevant or data heavy missing columns

In [None]:
df1 = df.drop(columns=['Name','Ticket','Cabin'])               # Dropping irrelevant columns
df1.dropna(subset=['Embarked'],inplace=True)           # Dropping null values
df1['Age'].fillna(df1['Age'].mean(),inplace=True)      # Filling null values with mean

**STEP 7:** Detect outliers with box plot

In [None]:
import matplotlib.pyplot as plt

plt.boxplot(df1['Age'],vert=False)
plt.xlabel('Age')
plt.ylabel('Variable')
plt.title("Age distribution box plot")
plt.show()

**STEP 8:** Calculate outlier boundaries and remove them

In [None]:
mean = df1['Age'].mean()
std = df1['Age'].std()

lower_bound = mean - 2 * std
upper_bound = mean + 2 * std

print("Lower bound:",lower_bound)
print("Upper bound:",upper_bound)

df2 = df1[(df1['Age']>=lower_bound) & (df1['Age']<=upper_bound)]
display(df2)

**STEP 9:** Check for missing data again if any

In [None]:
df3 = df2.fillna(df['Age'].mean())
df3.isnull().sum()

**STEP 10:** Recalculate outlier bounds and remove outliers from the updated data

In [None]:
mean = df3['Age'].mean()
std = df3['Age'].std()

upper_bound = mean + 2 * std
lower_bound = mean - 2 * std

df4 = df3[(df3['Age']>=lower_bound) & (df3['Age']<=upper_bound)]
display(df4)

**STEP 11:** Data validation and verification

In [None]:
X = df3[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']]  # Separate dataframes for X and Y
Y = df3['Survived']

**STEP 12:** Data formatting (Here using MinMax scaling)

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range = (0,1))
num_col_ = [col for col in X.columns if X[col].dtype != 'object']
x1 = X
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()