# Data Pre Processing

this is the foundation of Machine Learning.<br> Raw data is messy -> models need clean, consistent, and numerical data to learn patterns

1. Ordinal categorical -> Numeric 

ex, Low > Medium > High. Convert these to numbers

- Preserve order when encoding
- Don't use one hot encoding for ordinal, as it removes order info.

Use: OrdinalEncoder from sklearn

In [None]:
2. Binary Nominal -> 0/1

ex, gender = male or female

- simple mapping works, without ranking
- accross dataset consistent

Usage Ex, Encode["Yes", "No"] -> [1, 0]

In [None]:
Nominal Categorical -> Numeric / Continuous

ex, port = S, C, Q, city = Delhi, Mumbai, Banglore

- prevents false ordering
- use drop_first = True to avoid dummy variable trap

Use: pd.get_dummies()

In [None]:
4. Standardization vs. Normalization

standardization - rescales data to mean=0, std=1/-1 -> (x-mean)/std
    - used in algorithms assuming Gaussian/ Normal Distribution (Logistic Regresion, SVM, PCA)

Normalization - rescales values between [0, 1]
    - Used in distance-based algorithms (KNN, Neural Nets)

Use: StandardScaler or MinMaxScaler from sklearn

In [None]:
5. Train-Test Split

Definition: Split dataset into training (model learns) and testing (model evaluated).

Rule of thumb: 70-80% train, 20-30% test.

Hack: Use train_test_split(X,y,test_size=0.2,random_state=42)

Exercise: Split Iris dataset into 80/20.

✅ Solution Code
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# Step 1: Create dataset
data = {
    'education': ['Low','Medium','High','Medium','Low'],
    'gender': ['Male','Female','Female','Male','Male'],
    'city': ['Delhi','Mumbai','Chennai','Delhi','Mumbai'],
    'salary': [20000, 35000, 50000, 30000, 25000],
    'target': [0,1,1,0,0]   # Example classification label
}

df = pd.DataFrame(data)
print("Original Data:\n", df)

# Step 2: Ordinal encode education
edu_map = {'Low':1, 'Medium':2, 'High':3}
df['education'] = df['education'].map(edu_map)

# Step 3: Binary encode gender
df['gender'] = df['gender'].map({'Male':0, 'Female':1})

# Step 4: One-hot encode city
df = pd.get_dummies(df, columns=['city'], drop_first=True)

print("\nEncoded Data:\n", df)

# Step 5: Standardize & Normalize salary
scaler_std = StandardScaler()
scaler_norm = MinMaxScaler()

df['salary_std'] = scaler_std.fit_transform(df[['salary']])
df['salary_norm'] = scaler_norm.fit_transform(df[['salary']])

print("\nAfter Scaling:\n", df[['salary','salary_std','salary_norm']])

# Step 6: Train-test split
X = df.drop(columns=['target'])
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nX_train:\n", X_train)
print("\nX_test:\n", X_test)


6. Types of ML

Supervised Learning: Uses labeled data. (Regression, Classification)

Unsupervised Learning: Uses unlabeled data. (Clustering, Dimensionality reduction)

Semi-supervised Learning: Few labeled + many unlabeled samples.

Reinforcement Learning: Agent learns by trial/error with rewards.

📌 Short Assignment

Task:

Create a small dataset with:

education (Low/Medium/High) → ordinal

gender (Male/Female) → binary nominal

city (Delhi/Mumbai/Chennai) → nominal categorical

salary (numeric)

Convert all categorical features properly.

Standardize and normalize salary.

Split into train-test sets (80/20).


ex, 
data = {
    'education': ['Low','Medium','High','Medium','Low'],
    'gender': ['Male','Female','Female','Male','Male'],
    'city': ['Delhi','Mumbai','Chennai','Delhi','Mumbai'],
    'salary': [20000, 35000, 50000, 30000, 25000],
    'target': [0,1,1,0,0]   # Example classification label
}
