# **7 Days Data transformation Course**

# **Course** : Machine Learning 

# **Day 2:** Numerical Feature Transformation

# **Student**: Muhammad Shafiq

-----------------------------------

### **Today Covered Topics**

| Topic                                      | Covered Today |
| ------------------------------------------ | ------------- |
| Why we scale/transform numbers             | ✅             |
| Visualizing numeric feature distribution   | ✅             |
| StandardScaler, MinMaxScaler, RobustScaler | ✅             |
| Log transformation for skewed data         | ✅             |
| Fitting transformers on train data only    | ✅             |
| Real code + before-after visualization     | ✅             |


### **Load Dataset and Explore**

In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/AmesHousing.csv"
df = pd.read_csv(url)

print(df.shape)
df.head()

### **Identify Numeric Features**

In [None]:
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
print("Numeric Columns: ", numeric_cols)


### **Visualize Before scalling**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df['GrlivArea'], kde=True)
plt.title("Before scalling : GrLivArea")
plt.show()

### Scaling 

| Feature      | Range            |
| ------------ | ---------------- |
| LotArea      | 3,000 – 20,000   |
| BedroomAbvGr | 0 – 10           |
| SalePrice    | 35,000 – 750,000 |


### **Apply Standard Scaler**

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit on training data (for now full data)
df['GrLivArea_scaled'] = scaler.fit_transform(df[['GrLivArea']])

# Compare visually
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
sns.histplot(df['GrLivArea'], kde=True, ax=axs[0])
axs[0].set_title('Original GrLivArea')
sns.histplot(df['GrLivArea_scaled'], kde=True, ax=axs[1])
axs[1].set_title('Standard Scaled GrLivArea')
plt.show()


### **MixMax Scaler**

In [None]:
from sklearn.preprocessing import MinMaxScaler

minmax = MinMaxScaler()
df['GrLivArea_minmax'] = minmax.fit_transform(df[['GrLivArea']])

df[['GrLivArea', 'GrLivArea_scaled', 'GrLivArea_minmax']].head()


# Data must be between 0 and 1 (neural networks)

### **Robust Scler**

In [None]:
from sklearn.preprocessing import RobustScaler

robust = RobustScaler()
df['GrLivArea_robust'] = robust.fit_transform(df[['GrLivArea']])


### **Log Transformation(for Skewed Distributions)**

In [None]:
import numpy as np

df['GrLivArea_log'] = np.log1p(df['GrLivArea'])  # log(x+1)

sns.histplot(df['GrLivArea_log'], kde=True)
plt.title("Log-Transformed GrLivArea")
plt.show()


### **Fit Only on Trianing Data**

In [None]:
# For real projects:
scaler = StandardScaler()
scaler.fit(X_train[['GrLivArea']])       # only train
X_train_scaled = scaler.transform(X_train[['GrLivArea']])
X_test_scaled = scaler.transform(X_test[['GrLivArea']])


### **✅ Assignment**

- List top 5 numeric columns from the Ames dataset.

- Visualize their distribution.

- Try 2 different transformations per column.

- Comment: which scaler worked best and why?