Data preprocessing is the first step in any data analysis or machine learning pipeline.It involves cleaning, transforming and organizing raw data to ensure it is accurate, consistent and ready for modeling.

### Step 1: Import Libraries and Load Dataset

pandas is used for data manipulation (tables) , and numpy handles the math . 

In [None]:
import pandas as pd 
import numpy as np 

#load the dataset 
df = pd.read_csv("load_data.csv")


### Step 2: Inspect Data Structure and Check Missing Values

In [None]:
df.info() 
# prints concise summary of the DataFrame, including data types and non-null counts for each column.
df.isnull().sum()
# calculates the total number of missing values in each column of the DataFrame and returns a Series .

### Step 3 : Fixing the missing values 

In [None]:
# Strategy 1: Drop rows with missing values
df_dropped = df.dropna()
# This will remove any rows that contain at least one missing value.

# Strategy 2: Fill missing values with mean (for numerical columns)
df_filled_mean = df.fillna(df.mean())
# This will replace missing values in numerical columns with the mean of that column.
#if the data is normally distributed, the mean is a good measure of central tendency, and filling missing values with the mean can help maintain the overall distribution of the data.

# Strategy 3: Fill missing values with median (for numerical columns)
df_filled_median = df.fillna(df.median())
# This will replace missing values in numerical columns with the median of that column.
# if the data is skewed or contains outliers, the median is a better measure of central tendency, and filling missing values with the median can help reduce the impact of outliers on the imputed values.

# Strategy 4: Fill missing values with mode (for categorical columns)
df_filled_mode = df.fillna(df.mode().iloc[0])
# This will replace missing values in categorical columns with the mode (most frequent value) of that

### Step 4 : Outlier Detection and removal . 

An outlier is a data point that differs significantly from other observations. it is an anamoly . 
we determinse outlier using IQR method (interquartile range) . 
IQR = Q3(75th percentile) - Q1(25th percentile) . 

In [None]:
# calculate Q1 and Q3
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
# define outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# filter out outliers
df_no_outliers = df[~((df < lower_bound) | (df > upper_bound)).any(axis=1)]


### Step 5 : Feature Scaling : Normalization and Standardization .


The Bias Problem : Machine Learning models become biased towards bigger numbers. 
The computer does not understand the importance of features (e.g , that "Job Stability" is just as important as "Income") . it Only understand numbers. 

Income : 5000 , Job stability : 2.5 -> machine thinks that income is 2000 more important than job stability . 

Solution : Normalization (min-max scaling ) :
we shrink all columns so they fit betweeen 0 and 1 . 
New_value = (value - min) / (max- min )


In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

df_no_outliers['Income'] = scaler.fit_transform(df_no_outliers['Income']) 
# fit_transform(): Learns min/max from data and applies scaling.


the industry standard is to use Standerdization 
it Transforms features to have mean = 0 and standard deviation = 1, useful for normally distributed features 
it better handles outliers. 

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_no_outliers)
# fit_transform(): Learns mean/std from data and applies scaling.

### Step 6 : Encoding Categorical Data 

The Problem : Machine learning models are mathematical equations they can multiple , devide , and subtract numbers 
they cannot understand text . 
we must translate text (categorical data) into numbers .

Label Encoding : assigns each category a unique integer . it imple an order(ranks) among categories . 
cons : introduces implicit order which is not needed in nominal data like gender . 

The solution : One - hot Encoding . 
One - hot Encoding  : 
instead of ranking them .it converts categories into binary columns with each column representing a separate  category . 
cons : can cause high dimensionality and sparse data when feature has many categories. 



In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

df['Education'] = le.fit_transform(df['Education'])
# btech - > 0, mtech -> 1, phd -> 2



In [None]:
df = pd.get_dummies(df, columns=['gender'], drop_first=True)
# This will create a new binary column for each category in
# ex : gender column has two categories (male,female) , it will create two new column 'gender_male' with    1 for male and 0 for female and 'gender_female' with 1 for female and 0 for male.

# drop_first=True is used to avoid the dummy variable trap, which occurs when one of the dummy variables can be perfectly predicted from the others. By dropping one category, we can prevent multicollinearity in our model.
