# Data Preprocessing

* Data preprocessing ka matlab raw data ko clean, structured aur model-ready form me convert karna. Real world data hamesha messy hota hai, isliye preprocessing karna zaroori hota hai.

* Why we use in AIML:

    - Machine learning algorithms ko clean aur consistent data chahiye. Agar data me missing values, outliers, unscaled features ya categorical labels ka confusion ho to model galat patterns seekh leta hai.

In [3]:
import pandas as pd

df = pd.read_csv("sample.csv")

df = df.drop_duplicates()
# duplicates hata rahe hain taki model redundant rows na sikhe
df.head()

Unnamed: 0,Index,User Id,First Name,Last Name,Sex,Email,Phone,Date of birth,Job Title
0,1,88F7B33d2bcf9f5,Shelby,Terrell,,elijah57@example.net,001-084-906-7849x73518,1945-10-26,Games developer
1,2,f90cD3E76f1A9b9,Phillip,Summers,,bethany14@example.com,214.112.6044x4913,1910-03-24,Phytotherapist
2,3,DbeAb8CcdfeFC2c,Kristine,Travis,Male,bthompson@example.com,277.609.7938,1992-07-02,Homeopath
3,4,A31Bee3c201ef58,Yesenia,Martinez,Male,kaitlinkaiser@example.com,584.094.6111,2017-08-03,Market researcher
4,5,1bA7A3dc874da3c,Lori,Todd,Male,buchananmanuel@example.net,689-207-3558x7233,1938-12-01,Veterinary surgeon


# Train–Test Split

* Train-test split technique dataset ko do parts me divide karti hai:

    - Training set: Jise model ko pattern sikhane ke liye diya jata hai

    - Testing set: Jise unseen data ki tarah treat kiya jata hai, taaki model ki real-world performance jaani ja sake

* Agar split nahi karte to model test bhi wahi data pe karega jisme woh trained hua hai, jise overfitting bolte hain.

* Why we use: 

    - Model ka objective real-world unseen data ko predict karna hota hai. Train-test split ensure karta hai ki evaluation fair ho.

    

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X , y , test_size=0.2)

# 20% data testing ke liye rakh rahe hain, taaki unbiased evaluation ho

# Scaling

* Scaling matlab numerical features ko ek common uniform scale me lana.

### (1) StandardScaler

* StandardScaler har numeric feature ko aise transform karta hai ki

    - mean = 0

    - Standard Deviation = 1

* Matlab har value represent hoti hai ki woh average se kitna above ya below hai.

* Why in AIML:

    - Kuch algorithms jaise Logistic Regression, SVM, Neural Networks input magnitude ke liye sensitive hote hain. Agar ek feature km value range me hai aur dusra large me, model biased ho jata hai.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# features ko standard scale par la rahe hain taki model sabko equal importance de
# Age aur Salary ko equal scale par la diya gaya, taaki model unpar fair weight de.

### (2) MinMaxScaler

* MinMaxScaler values ko 0 se 1 ke range me compress karta hai.
Formula hota hai: (x - min) / (max - min)

* Why in AIML :

    - Neural networks, KNN, KMeans jaise algorithms bounded range me ache kaam karte hain. Yeh scaling outliers ka impact bhi kam karti hai.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# values ko 0-1 range me normalize kar rahe hain taaki training stable 
# Har value 0-1 range me aa gayi, model training smooth ho gayi.

# Encoding Categorical Data

### a) OneHot Encoding

* OneHot categorical text ko multiple binary columns me convert karta hai.
Har category ka ek column hota hai jisme 0/1 represent hota hai.

* Why in AIML:

    - ML algorithms text nahi samajh sakte. OneHot ensure karta hai ki categories ke beech koi fake order assume na ho.

* Runnable Example:
    - Upar python me dummies nahi chalaya, lekin concept wahi hai.

* Comment:

    - Gender = Male, Female ko numeric binary form me convert kar sakte ho.

In [None]:
pd.get_dummies(df['Sex'])
# Gender column ko binary columns me convert kar rahe hain (Male, Female)


### b) Label Encoding

* LabelEncoder har category ko ek unique integer assign karta hai. Ye simple aur efficient hota hai, but categories me order introduce kar deta hai.

* Why in AIML:

    - Tree-based algorithms (Decision Tree, RandomForest, XGBoost) order se affect nahi hote, isliye label encoding perfect hoti hai.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['City'] = le.fit_transform(df['City'])

# Outlier Detection Basics

* Outliers wo values hoti hain jo baaki values se extreme difference rakhti hain.

* Example: Salary column me 700000 ek clear outlier hai.

* Agar unhe detect ya fix nahi karte, to model un values ko priority se learn karne lagta hai.

* Common methods:

    - Z score

    - IQR method

    - Boxplot visualization

* Why in AIML:

    - Outliers training ko unstable bana dete hain. Regression models me toh performance heavily degrade hota hai.

In [None]:
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

outliers = df[(df['Salary'] < Q1 - 1.5 * IQR) | (df['Salary'] > Q3 + 1.5 * IQR)]
# IQR method se extreme salary values detect kar rahe hain

# Pipeline Basics

* Pipeline ek combined workflow hota hai jisme preprocessing + model training ek sequence me hota hai.
Tum steps ko chain kar sakte ho jaise:

* scaling → encoding → model

* Why in AIML:

    - Data leakage avoid hota hai

    - Train/test dono me same transformation apply hota hai

    - Code clean and maintainable hota hai

# Example

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
import numpy as np

# Sample dataset
df = pd.DataFrame({
    "Age": [25, 30, None, 22, 28],
    "Salary": [50000, 60000, 55000, 700000, 58000],  # contains an outlier (700000)
    "Gender": ["Male", "Female", "Female", None, "Male"],
    "City": ["Mumbai", "Delhi", "Mumbai", "Chennai", "Delhi"]
})

output = {}

# Handling Missing Values
df_filled = df.copy()
df_filled["Age"] = df_filled["Age"].fillna(df_filled["Age"].median())
df_filled["Gender"] = df_filled["Gender"].fillna("Unknown")
output["filled_df"] = df_filled

# Train-test split
X = df_filled[["Age", "Salary"]]
y = df_filled["City"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
output["train_test_shapes"] = (X_train.shape, X_test.shape)

# StandardScaler
scaler_std = StandardScaler()
X_std = scaler_std.fit_transform(X)
output["standard_scaled"] = X_std


# MinMaxScaler
scaler_mm = MinMaxScaler()
X_mm = scaler_mm.fit_transform(X)
output["minmax_scaled"] = X_mm

# Label Encoding
le = LabelEncoder()
df_filled["City_encoded"] = le.fit_transform(df_filled["City"])
output["label_encoded"] = df_filled[["City", "City_encoded"]]


# Outlier detection using IQR
Q1 = df_filled["Salary"].quantile(0.25)
Q3 = df_filled["Salary"].quantile(0.75)
IQR = Q3 - Q1
outliers = df_filled[(df_filled["Salary"] < Q1 - 1.5 * IQR) |
                     (df_filled["Salary"] > Q3 + 1.5 * IQR)]
output["outliers"] = outliers

output


{'filled_df':     Age  Salary   Gender     City  City_encoded
 0  25.0   50000     Male   Mumbai             2
 1  30.0   60000   Female    Delhi             1
 2  26.5   55000   Female   Mumbai             2
 3  22.0  700000  Unknown  Chennai             0
 4  28.0   58000     Male    Delhi             1,
 'train_test_shapes': ((4, 2), (1, 2)),
 'standard_scaled': array([[-0.47918636, -0.52226814],
        [ 1.36383809, -0.48346664],
        [ 0.07372098, -0.50286739],
        [-1.58500103,  1.99982911],
        [ 0.62662831, -0.49122694]]),
 'minmax_scaled': array([[0.375     , 0.        ],
        [1.        , 0.01538462],
        [0.5625    , 0.00769231],
        [0.        , 1.        ],
        [0.75      , 0.01230769]]),
 'label_encoded':       City  City_encoded
 0   Mumbai             2
 1    Delhi             1
 2   Mumbai             2
 3  Chennai             0
 4    Delhi             1,
 'outliers':     Age  Salary   Gender     City  City_encoded
 3  22.0  700000  Unknown  