## **Data Transformation**

After cleaning, our data is accurate and consistent. But there’s one more challenge: machine learning models expect data to be in a specific numeric and scaled format.

Data transformation is about making all features numerical, consistent, and ready for training a model.

### **Encoding Categorical Values**

**Why do we encode?**

ML models only understand numbers — they can’t work with text directly. For example, in the column “workclass” with values like “Private”, “Self-Employed”, etc., we need to turn these into numbers.

**How do we encode?**

- Label Encoding → Each category gets a number (e.g., Private = 0, Self-Employed = 1). Best for ordered categories.

- One-Hot Encoding → Creates new columns for each category (e.g., “Private”, “Self-Employed”), with 1s and 0s. Best for non-ordered categories.


Encoding ensures categorical data can be included in the model without losing meaning.

In [1]:
import pandas as pd
df_adult = pd.read_csv('data/adult_cleaned.csv')

In [None]:
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
df_adult['workclass']=le.fit_transform(df_adult['workclass'])
df_adult['education']=le.fit_transform(df_adult['education'])
df_adult['marital-status']=le.fit_transform(df_adult['marital-status'])
df_adult['relationship']=le.fit_transform(df_adult['relationship'])
df_adult['race']=le.fit_transform(df_adult['race'])
df_adult['sex']=le.fit_transform(df_adult['sex'])
df_adult['native-country']=le.fit_transform(df_adult['native-country'])
df_adult['class']=le.fit_transform(df_adult['class'])
df_adult['occupation']=le.fit_transform(df_adult['occupation'])


In [None]:
df_adult.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,3,1,7,4,6,3,2,1,0.0,0.0,40,38,0
1,38,3,11,9,2,4,0,4,1,0.0,0.0,50,38,0
2,28,1,7,12,2,10,0,4,1,0.0,0.0,40,38,1
3,44,3,15,10,2,6,0,2,1,8.947546,0.0,40,38,1
4,18,7,15,10,4,14,3,4,0,0.0,0.0,30,38,0


Now we can see that the dataset is purely numerical , in a format it can be fed into any model

### **Train Test Split**

**Why split the data?**

In machine learning, we want to train our model on one set of data and test it on a separate set to see how well it generalizes to new, unseen data. If we test on the same data we train on, the model might memorize patterns instead of learning them, giving misleadingly high accuracy.

**What we are doing here:**

- X_train and y_train → used to fit the model.

- X_test and y_test → used to evaluate the model’s performance.

We also stratify by the target so the class distribution stays consistent in both sets.


In [None]:
from sklearn.model_selection import train_test_split

X = df_adult.drop('class', axis=1)  # Features
y = df_adult['class']               # Target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


Now that we have separate training and testing sets, we can safely apply **transformations** like scaling and normalization.
These transformations will be fitted on the **training set only**, and then applied to the test set — this prevents the model from “seeing” test data during training.

### **Scaling and Normalization**
Machine learning models work better when features are on a similar scale. If one feature (e.g., capital-gain) has values in the thousands, while another (e.g., education-num) ranges only from 1–16, the model may give more importance to the larger-scale feature—even if it’s not actually more important.

To prevent this, we use scaling and normalization techniques.

- Scaling → Rescales features to a specific range (e.g., 0–1).

- Standardization → Centers features around mean = 0 and std = 1.

- Normalization → Converts features to have unit norm (useful for distance-based algorithms).


This can be done in multiple ways which we will see below:

**Key point:**

- Always fit the scaler/normalizer on the training data only.

- Then apply the same transformation to the test data.

- This avoids data leakage, ensuring your model doesn’t “peek” at the test set.

### **Method 1: Min- Max Normalization**

Min-Max scaling compresses all feature values into a fixed range, usually 0 to 1.

$$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$$

**Useful when:**
- Data has a known fixed boundary (e.g., percentages, pixel values in images).

- Algorithms that rely on distances or gradients (e.g., KNN, neural networks).

In [None]:
from sklearn.preprocessing import MinMaxScaler

minmax = MinMaxScaler()
X_train_minmax = minmax.fit_transform(X_train)
X_test_minmax = minmax.transform(X_test)


### **Method 2: Z- Score Normalization**
Z-score standardization transforms the data into a standard normal distribution, meaning the values are shifted so that:

Mean = 0

Standard Deviation = 1

$$X' = \frac{X - \mu}{ \sigma }$$

**Useful when:**

- Data is roughly normally distributed (bell-shaped).

- Algorithms assume Gaussian distribution (e.g., Logistic Regression, SVM, PCA).

- Useful when features don’t have a clear upper/lower bound.

In [None]:
from sklearn.preprocessing import StandardScaler

standard = StandardScaler()
X_train_standard = standard.fit_transform(X_train)
X_test_standard = standard.transform(X_test)