# Integrating PCA in Pipelines - Lab

## Introduction

In a previous section, you learned about how to use pipelines in scikit-learn to combine several supervised learning algorithms in a manageable pipeline. In this lesson, you will integrate PCA along with classifiers in the pipeline. 

## Objectives

In this lab you will: 

- Integrate PCA in scikit-learn pipelines 

## The Data Science Workflow

You will be following the data science workflow:

1. Initial data inspection, exploratory data analysis, and cleaning
2. Feature engineering and selection
3. Create a baseline model
4. Create a machine learning pipeline and compare results with the baseline model
5. Interpret the model and draw conclusions

##  Initial data inspection, exploratory data analysis, and cleaning

You'll use a dataset created by the Otto group, which was also used in a [Kaggle competition](https://www.kaggle.com/c/otto-group-product-classification-challenge/data). The description of the dataset is as follows:

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). They are selling millions of products worldwide every day, with several thousand products being added to their product line.

A consistent analysis of the performance of their products is crucial. However, due to their global infrastructure, many identical products get classified differently. Therefore, the quality of product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights the Otto Group can generate about their product range.

In this lab, you'll use a dataset containing:
- A column `id`, which is an anonymous id unique to a product
- 93 columns `feat_1`, `feat_2`, ..., `feat_93`, which are the various features of a product
- a column `target` - the class of a product



The dataset is stored in the `'otto_group.csv'` file. Import this file into a DataFrame called `data`, and then: 

- Check for missing values 
- Check the distribution of columns 
- ... and any other things that come to your mind to explore the data 

In [None]:
# Your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import io

In [None]:
# 1. Load the dataset
data = pd.read_csv("otto_group.csv")

In [None]:
# 2. Capture .info() output WITHOUT print
buffer = io.StringIO()
data.info(buf=buffer)
info_output = buffer.getvalue()
info_output   # This displays the info in Jupyter naturally

In [None]:
# 3. Missing values
data.isna().sum()

In [None]:
# 4. Statistical summary
data.describe()

In [None]:
# 5. Target distribution
data["target"].value_counts()

plt.figure(figsize=(10,5))
sns.countplot(data=data, x="target")
plt.title("Target Class Distribution")
plt.xticks(rotation=45)
plt.show()

In [None]:
# 8. Correlation heatmap (sampled)
sample = data[feature_cols].sample(2000, random_state=42)
plt.figure(figsize=(10,8))
sns.heatmap(sample.corr(), cmap="coolwarm")
plt.title("Correlation Heatmap (Sampled)")
plt.show()

If you look at all the histograms, you can tell that a lot of the data are zero-inflated, so most of the variables contain mostly zeros and then some higher values here and there. No normality, but for most machine learning techniques this is not an issue. 

In [None]:
# 6. Feature histograms
feature_cols = [col for col in data.columns if col.startswith("feat_")]
data[feature_cols].hist(figsize=(20,20), bins=20)
plt.suptitle("Feature Distributions", y=1.02)
plt.show()

Because there are so many zeroes, most values above zero will seem to be outliers. The safe decision for this data is to not delete any outliers and see what happens. With many 0s, sparse data is available and high values may be super informative. Moreover, without having any intuitive meaning for each of the features, we don't know if a value of ~260 is actually an outlier.

In [None]:
# Since the features are zero-inflated and we lack domain knowledge,
# we will NOT remove any outliers.

# This cell intentionally performs no outlier filtering.
# We simply continue with the full dataset.

data_no_outlier_removal = data.copy()

data_no_outlier_removal.head()

## Feature engineering and selection with PCA

Have a look at the correlation structure of your features using a [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Select feature columns only
feature_cols = [col for col in data.columns if col.startswith("feat_")]
X = data[feature_cols]

# Use a sample to avoid a huge heatmap
corr_sample = X.sample(2000, random_state=42)

plt.figure(figsize=(12,10))
sns.heatmap(corr_sample.corr(), cmap="coolwarm")
plt.title("Correlation Heatmap of Features (Sampled)")
plt.show()

Use PCA to select a number of features in a way that you still keep 80% of your explained variance.

In [None]:
# Use a sample to avoid a huge heatmap
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA keeping 80% variance
pca = PCA(n_components=0.8, random_state=42)
X_pca = pca.fit_transform(X_scaled)

# Display number of components chosen
pca.n_components_

In [None]:
plt.figure(figsize=(8,5))
plt.plot(pca.explained_variance_ratio_.cumsum(), marker="o")
plt.axhline(0.80, color="red", linestyle="--")
plt.title("Cumulative Explained Variance")
plt.xlabel("Number of Components")
plt.ylabel("Explained Variance")
plt.show()

## Create a train-test split with a test size of 40%

This is a relatively big training set, so you can assign 40% to the test set. Set the `random_state` to 42. 

In [None]:
from sklearn.model_selection import train_test_split

# Features and target
feature_cols = [col for col in data.columns if col.startswith("feat_")]
X = data[feature_cols]
y = data["target"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.40,
    random_state=42,
    stratify=y  # helps maintain class balance
)

X_train.shape, X_test.shape

In [None]:
# Your code here

## Create a baseline model

Create your baseline model *in a pipeline setting*. In the pipeline: 

- Your first step will be to scale your features down to the number of features that ensure you keep just 80% of your explained variance (which we saw before)
- Your second step will be to build a basic logistic regression model 

Make sure to fit the model using the training set and test the result by obtaining the accuracy using the test set. Set the `random_state` to 123. 

In [None]:
# Your code here
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# 1. Build the pipeline
baseline_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.8, random_state=123)),
    ('logreg', LogisticRegression(max_iter=1000, random_state=123))
])

In [None]:
# 2. Fit the model on the training set
baseline_pipeline.fit(X_train, y_train)

In [None]:
# 3. Predict on the test set
y_pred_baseline = baseline_pipeline.predict(X_test)

In [None]:
# 4. Compute accuracy
accuracy_score(y_test, y_pred_baseline)

## Create a pipeline consisting of a linear SVM, a simple decision tree, and a simple random forest classifier

Repeat the above, but now create three different pipelines:
- One for a standard linear SVM
- One for a default decision tree
- One for a random forest classifier

In [None]:
# Your code here
# ⏰ This cell may take several minutes to run
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.8, random_state=123)),
    ('svm', LinearSVC(random_state=123, max_iter=5000))
])

svm_pipeline.fit(X_train, y_train)
y_pred_svm = svm_pipeline.predict(X_test)
accuracy_score(y_test, y_pred_svm)

In [None]:
tree_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.8, random_state=123)),
    ('tree', DecisionTreeClassifier(random_state=123))
])

tree_pipeline.fit(X_train, y_train)
y_pred_tree = tree_pipeline.predict(X_test)
accuracy_score(y_test, y_pred_tree)

In [None]:
forest_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.8, random_state=123)),
    ('forest', RandomForestClassifier(random_state=123))
])

forest_pipeline.fit(X_train, y_train)
y_pred_forest = forest_pipeline.predict(X_test)
accuracy_score(y_test, y_pred_forest)

## Pipeline with grid search

Construct two pipelines with grid search:
- one for random forests - try to have around 40 different models
- one for the AdaBoost algorithm 

### Random Forest pipeline with grid search

In [None]:
# Your code here 
# imports
# Your code here 

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import pandas as pd

In [None]:
# Your code here
# ⏰ This cell may take a long time to run!
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.8, random_state=123)),
    ('rf', RandomForestClassifier(random_state=123))
])


In [None]:
param_grid_rf = {
    'rf__n_estimators': [100, 300],          # 2
    'rf__max_depth': [None, 10],             # 2
    'rf__min_samples_split': [2, 5],         # 2
    'rf__min_samples_leaf': [1, 2],          # 2
    'rf__max_features': ['sqrt', 'log2']     # 2
}

Use your grid search object along with `.cv_results` to get the full result overview

In [None]:
# Your code here 
# ⏰ This cell may take a long time to run!

rf_grid = GridSearchCV(
    estimator=rf_pipeline,
    param_grid=param_grid_rf,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

In [None]:
rf_results = pd.DataFrame(rf_grid.cv_results_)
rf_results

### AdaBoost

In [None]:
# Your code here
# ⏰ This cell may take several minutes to run
from sklearn.ensemble import AdaBoostClassifier

# Create pipeline: scaler -> PCA -> AdaBoost
ada_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.8, random_state=123)),
    ('ada', AdaBoostClassifier(random_state=123))
])

# Define grid (~40 models)
param_grid_ada = {
    'ada__n_estimators': [50, 100, 200],
    'ada__learning_rate': [0.5, 1.0, 1.5],
    'ada__algorithm': ['SAMME', 'SAMME.R']
}

# Grid search
ada_grid = GridSearchCV(
    estimator=ada_pipeline,
    param_grid=param_grid_ada,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit model (may take several minutes)
ada_grid.fit(X_train, y_train)


Use your grid search object along with `.cv_results` to get the full result overview: 

In [None]:
# Your code here 
# View full results
ada_results = pd.DataFrame(ada_grid.cv_results_)
ada_results

### Level-up (Optional): SVM pipeline with grid search 

As extra level-up work, construct a pipeline with grid search for support vector machines. 
* Make sure your grid isn't too big. You'll see it takes quite a while to fit SVMs with non-linear kernel functions!

In [None]:
# Your code here
# ⏰ This cell may take a very long time to run!
from sklearn.svm import SVC

# Pipeline: scaler -> PCA -> SVM
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.8, random_state=123)),
    ('svc', SVC(random_state=123))
])

# Small parameter grid for speed
param_grid_svm = {
    'svc__C': [0.1, 1, 10],
    'svc__kernel': ['linear', 'rbf'],   # linear + RBF
    'svc__gamma': ['scale', 'auto']     # for RBF kernel
}

svm_grid = GridSearchCV(
    estimator=svm_pipeline,
    param_grid=param_grid_svm,
    cv=3,                # smaller CV for speed
    scoring='accuracy',
    n_jobs=-1
)

# Fit model (may take a very long time!)
svm_grid.fit(X_train, y_train)

Use your grid search object along with `.cv_results` to get the full result overview: 

In [None]:
# Your code here 
# SVM full results
svm_results = pd.DataFrame(svm_grid.cv_results_)
svm_results

## Note

Note that this solution is only one of many options. The results in the Random Forest and AdaBoost models show that there is a lot of improvement possible by tuning the hyperparameters further, so make sure to explore this yourself!

## Summary 

Great! You've gotten a lot of practice in using PCA in pipelines. What algorithm would you choose and why?