<a href="https://colab.research.google.com/github/Anish-S-tech/my-ml-journey/blob/main/Data_preprocessing_in_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* DATA PREPROCESSING is the first step in any data analysis or machine     learning pipeline

* It involves cleaning, transforming and organizing raw data to ensure it is accurate, consistent and ready for modeling




STEPS IN DATA PREPROCESSING

**STEP 1:** Import libraries and load dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import seaborn as sns

df = pd.read_csv("diabetes.csv")
df.head()

**STEP 2:** Inspect data structure and check missing values

In [None]:
df.info()  # Summary about the dataframe
df.isnull().sum()  # How many null values are there

**STEP 3:** Statistical summary and visualizing outliers (using boxplot)

In [None]:
df.describe()  # Statistical summary about data

fig, axs = plt.subplots(len(df.columns),figsize=(7,18),dpi=95)

for i,col in enumerate(df.columns):
  axs[i].boxplot(df[col],vert=False)   # Creating boxplot for all the columns
  axs[i].set_ylabel(col)
plt.tight_layout()
plt.show()

**STEP 4:** Removing outliers using Inter-Quartile Range (IQR) method

In [None]:
q1, q3 = np.percentile(df['Insulin'],[25,75])
iqr = q3-q1

upper_bound = q3 + 1.5 * iqr
lower_bound = q1 - 1.5*iqr

cleaned_df = df[(df['Insulin']>= lower_bound) & (df['Insulin']<=upper_bound)]

**STEP 5:** Correlation analysis

In [None]:
cleaned_df.corr()  # Builds correlation matrix on the dataset
plt.figure()
sns.heatmap(cleaned_df.corr(),annot=True,cmap='coolwarm')  # Builds a heatmap on the correlation matrix
plt.show()


**STEP 6:** Visualize target variable distribution (balanced or not)

In [None]:
# Here, outcome is the target variable, so we're gonna try on outcome for target variable distribution

plt.pie(df['Outcome'].value_counts(),labels=['Diabetes','Non diabetes'],autopct='%.f%%')
plt.title("Target variable distribution")
plt.show()

**STEP 7:** Separate features and target variable

In [None]:
x = df.drop(columns=['Outcome'])  # Feature variables
y = df['Outcome']  # Target variable

**STEP 8:** Feature scaling (Standardization and Normalization)

In [None]:
scaler = MinMaxScaler()   # Normalization
X_normalized = scaler.fit_transform(x)
print(X_normalized[:5])


In [None]:
scaler = StandardScaler()   # Standardization
X_standardized = scaler.fit_transform(x)
print(X_standardized[:5])