### **Chunk 2: Data Splitting & Basic Preprocessing**

#### **1. Concept Introduction**

-   **Data Splitting Revisited**: `train_test_split` is great for a quick validation, but a single split can be lucky or unlucky. The gold standard is **cross-validation**, where we split the data into multiple "folds" and train/test the model multiple times. This gives a much more robust estimate of model performance. We'll dive deep into this later, but for now, know that `train_test_split` is our primary tool for fast iteration.

-   **Why Preprocessing Matters**: Imagine you're building a model to predict house prices, and you have two features: `number_of_rooms` (from 1 to 10) and `square_footage` (from 500 to 5000). An algorithm that uses distance (like K-Nearest Neighbors) will be completely dominated by `square_footage`. A change of 100 sq ft will seem vastly more important than a change of 2 rooms, even if that isn't true. **Feature scaling** fixes this by putting all features on a common scale.

-   **Core Preprocessing Tools**:
    1.  `StandardScaler`: The most common scaler. It transforms each feature to have a mean of 0 and a standard deviation of 1. It assumes your data is normally distributed (a Gaussian distribution) and is not sensitive to outliers.
    2.  `MinMaxScaler`: Scales features to a fixed range, usually 0 to 1. It's useful when you need bounded values or if the data is not normally distributed. It can be sensitive to outliers.
    3.  `SimpleImputer`: Real-world data is often missing values (represented as `NaN`). You can't feed `NaN`s to a model. `SimpleImputer` is a basic strategy to fill them in, for example, with the mean, median, or most frequent value of the column.

The most critical rule of preprocessing is: **You must learn the scaling parameters (like the mean and standard deviation) from the training data ONLY.** Applying `.fit()` to the entire dataset before splitting is a form of **data leakage** and will give you an artificially optimistic performance estimate.

#### **2. Dataset EDA: The Wine Quality Dataset**

This dataset contains chemical properties of red wines and a quality score from 3 to 8. It's great for our purpose because the features are all numeric but have very different scales. Our goal will be to classify a wine as "good" (quality > 5) or "bad" (quality <= 5).


In [None]:
# IMports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml

# Set plot style
sns.set_style("whitegrid")

**Load Data**

In [None]:
# Fetch the Wine Wuality dataset form OPenML
wine = fetch_openml(name='wine-quality-red',
            version='active',
            as_frame=True,
            parser='auto')
df = wine.frame

# Dataset info
print("Dataset info")
df.info()

In [None]:
df

**Create Target Variable**

For this example, let's make it a binary classification problem.
Quality > 5 is 'good' ( 1 ), otherwise 'bad' ( 0 )

In [None]:
df['class'] = pd.to_numeric(df['class'], errors='coerce')  # convert to numeric
df['quality_binary'] = (df['class'] > 5).astype(int)
df = df.drop('class', axis=1) # Drop the original multiclass target


In [None]:
df

**Basic Statistics**


In [None]:
df.describe()

In [None]:
# First 5 rows
df.head()

In [None]:
# Check for any missing values
df.isnull().sum()

In [None]:
# Target Varibale Distribution
plt.figure(figsize=(6,4))
sns.countplot(x='quality_binary', data=df)
plt.title("Distribution of Wine Quality (0=bad, 1=Good)")
plt.show()

In [None]:
# Feature Distribution ( Histogram ) 
# This clearly shows the different scales and distributions of each feature
df.drop('quality_binary', axis=1).hist(figsize=(15, 12),
bins=30,
edgecolor='black')
plt.suptitle('Histogram of Feature Distributions', y=0.92)
plt.show()

In [None]:
# Correlation Matrix Heatmap
plt.figure(figsize=(12,10))
crlm = df.corr()
sns.heatmap(crlm, annot=True, cmap='viridis', fmt='.2f')
plt.title('Correlation Matrix of Wine Features')
plt.show()

**3. Minimal Working Example: The Power of Scaling**

Let's see the dramatic effect of scaling on a K-Nearest Neighbors (KNN) model, which is highly sensitive to feature scales.

In [None]:
# Imports, Data, Splitting
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Define features (X) and target (y)
X = df.drop('quality_binary', axis=1)
y = df['quality_binary']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42,
                                                    stratify=y)




**Attemp 1: Without Scaling**

In [None]:
knn_raw = KNeighborsClassifier(n_neighbors=5) # 5 are default. THis was just to show that it exists. like you can change it if you want
knn_raw.fit(X_train, y_train)
y_pred_raw = knn_raw.predict(X_test)
accuracy_raw = accuracy_score(y_pred=y_pred_raw, y_true=y_test)

print(f"KNN Accuracy WITHOUT Scaling: {accuracy_raw * 100:.2f}%")

**Attempt 2: With Proper Scaling**

In [None]:
# Create and fit the scaler ON THE TRAINING DATA ONLY

scaler = StandardScaler()
X_train_scaled  = scaler.fit_transform(X_train) # Fit and transform on train set

# tranform the test data using the FITTED scaler
# We must only use .transform() here to prevent data leakage from the test set.
X_test_scaled = scaler.transform(X_test)

# Train and evaluate a new KNN model on the SCALED DATA
knn_normal = KNeighborsClassifier()
knn_normal.fit(X_train_scaled, y_train)
y_pred_scaled = knn_normal.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"KNN Accuracy WITH Scaling: {accuracy_scaled * 100:.2f}%")

See just by doing this one extra step we boosted the accuracy to 74%.

This is why scaling is not optional but essential 

**Variations :** `MinMaxScaler` and `SimpleImputer`


In [None]:
# Using MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
min_max = MinMaxScaler()
X_train_mm = min_max.fit_transform(X_train)
X_test_mm  = min_max.transform(X_test)

print("First 5 rows of MinMaxScaler transformed data : ")
X_train_mm[:5]

In [None]:
# Using SimpleImputer
# Let's artificially add some missing values to demonstrate
from sklearn.impute import SimpleImputer

X_train_nan = X_train.copy()
# Set 10% of values in the 'PH' column to NaN
nan_indices  =  np.random.choice(X_train_nan.index,
                                 size=int(len(X_train_nan)*0.1),
                                 replace=False)
X_train_nan.loc[nan_indices, 'pH'] = np.nan

print(f"Missing pH values bfore Imputation : {X_train_nan['pH'].isnull().sum()}")

# 1. Create imputer to fill with the mean
imputer = SimpleImputer(strategy='mean')

# 2. Fit on the Training Data and Transform it
X_train_impu = imputer.fit_transform(X_train_nan)

# 3. We would then use this simple Imputer to transform the test set
X_test_impu = imputer.transform(X_test)

print(f"Missing values after imputation (on a new array): {np.isnan(X_train_impu).sum()}")


## 5. Common Pitfalls
**1. Leaking Data With the Scaler (CRITICAL MISTAKE) :** Never, ever fit your scaler on the whole dataset

In [None]:
scaler_leaker = StandardScaler()
X_leaky = scaler_leaker.fit_transform(X) # Fit on train AND Test data
X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(X_leaky, y)

# Results will be overly optimistic and will not generalize

**2. Fitting the Scaler on the Test Set :** A more subtle but equally wrong mistake. This also leaks information.

In [None]:
scaler_wrong = StandardScaler()
X_train_w    = scaler_wrong.fit_transform(X_train),
# This re-learns the mean/std from the test set, which is wrong.
X_test_w = scaler.fit_transform(X_test)
print("Never use fit transform on test set")

**3. Forgetting to Scale New Data :** When you deploy your model, any new, single prediction instance must also be scaled using the `scaler` you saved from your training process


### Congrats on Writing the code by hand. It takes dedication and self control to not copy paste code. You earned your right to move onto Chunk 3: First Supervised Learning models