## 🔧 Why Preprocessing is Essential

Many machine learning algorithms — especially **linear models**, **SVMs**, and **neural networks** — are sensitive to the **scale of the input features**.

📌 For example:
- If one feature ranges from `0 to 1` and another ranges from `0 to 1,000,000`, the algorithm might incorrectly assume that the second feature is more important, just because it has a larger scale.

✅ To address this:
- We **normalize** (scale values between 0 and 1) or **standardize** (mean = 0, std = 1) our data to ensure all features contribute equally to the learning process.

---

Additionally, most machine learning algorithms require all inputs to be **numeric**.

🧠 That means:
- Categorical variables like `"Male"/"Female"` or `"New York"/"Chicago"` must be **converted to numbers** using encoding techniques such as:
  - **One-Hot Encoding**
  - **Label Encoding**
  - **Ordinal Encoding**

---

⚠️ **Without preprocessing**, our models might:
- Make wrong assumptions about feature importance
- Fail to process non-numeric data
- Yield poor performance or convergence issues


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Let's create a more realistic, messy dataset
data = {
    'age': [25, 30, 35, 40, 45, 50, 55, 60],
    'salary': [50000, 54000, 60000, 68000, 75000, 80000, 90000, 110000],
    'city': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
    'purchased': [0, 1, 0, 0, 1, 1, 1, 1]
}
df = pd.DataFrame(data)

X = df[['age', 'salary', 'city']]
y = df['purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## 🔄 Transformers and Preprocessing

Scikit-learn's preprocessing tools follow the same **`fit` / `transform` API** as models.

### 📌 Core Methods

- **`fit(data)`**  
  The transformer "learns" the necessary parameters from the data.  
  - Example:  
    - `StandardScaler` learns the **mean** and **standard deviation**.
    - `OneHotEncoder` learns the **unique categories**.
  - ⚠️ You should **only fit on the training data**.

- **`transform(data)`**  
  Applies the learned transformation to the data.  
  ✅ Use this on **both training and testing data**.

- **`fit_transform(data)`**  
  A shortcut that performs both `fit` and `transform`.  
  ✅ Use only on **training data**.

---

## A. 🔢 Scaling Numerical Features

Scaling ensures that features contribute equally to the model by bringing them to the same scale.

### 🧪 `StandardScaler`
- **Standardizes** features by removing the mean and scaling to unit variance.
- The resulting distribution will have:
  - Mean = 0
  - Standard Deviation = 1
- Best suited when data follows a normal distribution.

### 📏 `MinMaxScaler`
- Scales features to a **specific range**, typically [0, 1].
- Preserves the **shape** of the original distribution.
- Useful when features have **varying units or scales**.

---

✅ Scaling is especially important for models like:
- Linear Regression
- SVM
- KNN
- Neural Networks

Without scaling, features with larger magnitudes can **dominate** the learning process.


In [2]:
# Separate numerical and categorical columns for now
X_train_num = X_train[['age', 'salary']]
X_test_num = X_test[['age', 'salary']]

# --- StandardScaler ---
scaler = StandardScaler()
# Fit on the training data ONLY to learn the mean and std dev
scaler.fit(X_train_num)
# Transform both train and test data
X_train_scaled = scaler.transform(X_train_num)
X_test_scaled = scaler.transform(X_test_num)

print("--- Standard Scaled Training Data ---")
print(X_train_scaled)
print(f"Mean: {X_train_scaled.mean(axis=0)}") # Should be close to 0
print(f"Std Dev: {X_train_scaled.std(axis=0)}")   # Should be close to 1

--- Standard Scaled Training Data ---
[[-1.55563492 -1.28917835]
 [ 1.41421356  1.74418248]
 [-0.70710678 -0.78361822]
 [ 0.14142136 -0.02527801]
 [-0.28284271 -0.3791701 ]
 [ 0.98994949  0.7330622 ]]
Mean: [-2.03540888e-16  0.00000000e+00]
Std Dev: [1. 1.]


## B. 🧩 Encoding Categorical Features

When working with categorical data, we need to convert it into a numeric format because most machine learning models can’t handle strings directly.

### 🎯 OneHotEncoder

- Converts categorical variables into a **"one-hot" encoded format**.
- It creates a **new binary column** for each category.
- Each row will have a `1` in the column corresponding to its category and `0` elsewhere.
- This is the **preferred method** for **nominal** (unordered) categorical variables.

#### ✅ Example

| City        | Chicago | Los Angeles | New York |
|-------------|---------|-------------|----------|
| New York    | 0       | 0           | 1        |
| Los Angeles | 0       | 1           | 0        |
| Chicago     | 1       | 0           | 0        |

---

### 🛠 Why One-Hot Encoding?

- Prevents the model from assuming **ordinal relationships** between categories.
- Keeps distance-based algorithms like **KNN** or **SVM** from being misled.

⚠️ Note:  
OneHotEncoder can produce many columns if the category has many unique values. This is known as the **curse of dimensionality**, so use carefully with high-cardinality features.


In [3]:
# Separate the categorical column
X_train_cat = X_train[['city']]
X_test_cat = X_test[['city']]

# --- OneHotEncoder ---
# handle_unknown='ignore' tells it to ignore categories in the test set that weren't in the training set
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
# Fit on training data to learn the categories ('New York', 'Los Angeles', 'Chicago')
ohe.fit(X_train_cat)
# Transform both
X_train_ohe = ohe.transform(X_train_cat)
X_test_ohe = ohe.transform(X_test_cat)

print("\n--- One-Hot Encoded Training Data ---")
# The columns correspond to the learned categories: ['Chicago', 'Los Angeles', 'New York']
print(X_train_ohe)


--- One-Hot Encoded Training Data ---
[[0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]


- This manual process of separating columns, fitting, and transforming is tedious and prone to errors. This is why we use Pipelines.

## 🔗 2. Pipelines: Streamlining the Workflow

A **Pipeline** is a way to **chain multiple steps together** — like data transformations (scaling, encoding) and the final machine learning model — into **one single object**.

### 🛠 How It Works

- When you call `.fit()` on a pipeline:
  - It fits all the transformers (like scalers or encoders) in order.
  - Then it fits the final estimator (your model).

- When you call `.predict()`:
  - It first transforms the input data using the fitted transformers.
  - Then it makes predictions using the trained model.

---

### ✅ Why Pipelines are Useful

- 🔒 **Prevents data leakage**  
  (Example: accidentally fitting the scaler on test data).

- 🧹 **Cleaner and simpler code**  
  No need to manually fit and transform step-by-step.

- 📦 **Reusable and exportable**  
  You can save the entire pipeline and load it later to make predictions on new data.

---

### 🔧 Need to Transform Different Columns Differently?

We use **`ColumnTransformer`** when:
- Some columns need scaling (numeric).
- Some columns need encoding (categorical).

It lets us **apply different preprocessing steps to different columns** in a structured way — all inside the pipeline.


In [4]:
from sklearn.compose import ColumnTransformer

# Redefine our full train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# 1. Define the preprocessing steps for numerical and categorical features
# The first element is a name, the second is the transformer, the third is the list of columns
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# 2. Create the preprocessor object with ColumnTransformer
# This applies the right transformer to the right columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['age', 'salary']),
        ('cat', categorical_transformer, ['city'])
    ])

# 3. Create the full pipeline
# It has two steps: 'preprocessor' and 'classifier'
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('classifier', LogisticRegression())])

# 4. Now, train the entire pipeline on the raw training data!
model_pipeline.fit(X_train, y_train)
print("\n--- Pipeline has been trained! ---")

# 5. Make predictions and evaluate
# The pipeline handles all the transformations for X_test automatically
accuracy = model_pipeline.score(X_test, y_test)
print(f"\nPipeline Accuracy on Test Set: {accuracy:.4f}")

# You can even make predictions on new, raw data
new_data = pd.DataFrame([{'age': 38, 'salary': 72000, 'city': 'Los Angeles'}])
prediction = model_pipeline.predict(new_data)
print(f"\nPrediction for new data point: {'Purchased' if prediction[0] else 'Not Purchased'}")


--- Pipeline has been trained! ---

Pipeline Accuracy on Test Set: 0.5000

Prediction for new data point: Not Purchased


## Exercises

**1. Manual Preprocessing:**
- Using the titanic dataset (sns.load_dataset('titanic')).
- Create X from the columns pclass, sex, age, and fare. Create y from the survived column.
- Handle the missing age values by filling them with the median.
- Perform a train-test split (test_size=0.3, random_state=42).
- On the training and testing sets separately:
    - One-hot encode the sex column.
    - Scale the age and fare columns using StandardScaler.
    - Combine the preprocessed numerical and categorical features back into a final X_train_processed and X_test_processed. (Hint: use np.hstack).
- Train a LogisticRegression model on the processed training data and evaluate its accuracy on the processed test data.

In [17]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

titanic_df=sns.load_dataset('titanic')

features= ['pclass', 'sex', 'age', 'fare']
target = 'survived'
X= titanic_df[features]
y=titanic_df[target]

# Handle the missing age values by filling them with the median
age_median=X['age'].median()
X.loc[X['age'].isnull(), 'age'] = age_median

# Perform a train-test split (test_size=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42, stratify =y)

X_train = X_train.copy()
X_test = X_test.copy()

print(f"\nTraining set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}\n")

categorical_features= ['sex']
numerical_features = ['age', 'fare', 'pclass']

# One-hot encode the sex column
print("Applying One-Hot Encoding to the 'sex' column...")
ohe =OneHotEncoder(sparse_output = False, handle_unknown='ignore')

X_train_sex_encoded = ohe.fit_transform(X_train[categorical_features])
X_test_sex_encoded = ohe.transform(X_test[categorical_features])
print(f"One-Hot Encoded feature names:{ohe.get_feature_names_out()}")

# Scale the age and fare columns using StandardScaler
scaler= StandardScaler()

X_train_numerical_scaled = scaler.fit_transform(X_train[numerical_features])
X_test_numerical_scaled =scaler.transform(X_test[numerical_features])

# Combine the preprocessed numerical and categorical features back into a final X_train_processed and X_test_processed
X_train_processed = np.hstack((X_train_numerical_scaled, X_train_sex_encoded))
X_test_processed = np.hstack((X_test_numerical_scaled, X_test_sex_encoded))

print(f"\nFinal processed training set shape: {X_train_processed.shape}")
print(f"Final processed testing set shape: {X_test_processed.shape}\n")

# Train a LogisticRegression model on the processed training data and evaluate its accuracy on the processed test data
model = LogisticRegression(random_state= 42)
model.fit(X_train_processed, y_train)

y_pred= model.predict(X_test_processed)
accuracy= accuracy_score(y_test, y_pred)
print("\n--- Model Evaluation ---")
print(f"Accuracy on the processed test data: {accuracy:.4f}")


Training set shape: (623, 4)
Testing set shape: (268, 4)

Applying One-Hot Encoding to the 'sex' column...
One-Hot Encoded feature names:['sex_female' 'sex_male']

Final processed training set shape: (623, 5)
Final processed testing set shape: (268, 5)


--- Model Evaluation ---
Accuracy on the processed test data: 0.7836


**2. Building a Pipeline (The "Amaze Factor" Way):**
- Using the same titanic data and the same X and y as Exercise 1.
- This time, do not manually fill the missing age values. We will handle this in the pipeline.
- Perform a train-test split on the raw data.
- Create a pipeline that performs the following steps in order:
1. A ColumnTransformer that:
    - Applies a SimpleImputer(strategy='median') and a StandardScaler to the numerical columns (age, fare). (Hint: you'll need to create a numeric_pipeline using Pipeline for this).
    - Applies a OneHotEncoder to the categorical columns (pclass, sex).
2. A final LogisticRegression classifier.
- Train this single pipeline object on your raw training data.
- Evaluate the pipeline's accuracy on the raw test data.
- In a Markdown cell, compare the simplicity and robustness of the pipeline approach (Exercise 2) to the manual approach (Exercise 1).

In [24]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Perform a train-test split (test_size=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state= 42, stratify =y)

print(f"\nRaw training set shape: {X_train.shape}")
print(f"Raw testing set shape: {X_test.shape}\n")

numerical_features= ['age', 'fare']
categorical_features = ['pclass', 'sex']

numeric_transformer= Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
categorical_transformer= Pipeline(steps=[
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor=ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

model_pipeline= Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

model_pipeline.fit(X_train, y_train)
y_pred= model_pipeline.predict(X_test)

accuracy=accuracy_score(y_test, y_pred)
print(f"Accuracy of the pipeline on the test data: {accuracy:.4f}")


Raw training set shape: (623, 4)
Raw testing set shape: (268, 4)

Accuracy of the pipeline on the test data: 0.7873


- **Simplicity:** The manual approach is messy with many separate steps. A pipeline organizes everything into one clean, readable object.

- **Safety:** It's easy to make mistakes and "leak" test data during manual preprocessing, which leads to misleadingly high accuracy scores. A pipeline automatically prevents this, ensuring a more honest evaluation.

- **Robustness:** A pipeline is a single, reusable object that's easy to save, modify, and use in a real application. The manual process is brittle and hard to manage.

In short, the pipeline approach is the professional standard because it's cleaner, safer, and more reliable.