# Lab 13 - Pipelines and Imbalanced Data
- **Author:** Satej Soman, Suraj R. Nair
- **Date:** April 16, 2025
- **Course:** INFO 251: Applied machine learning

## Learning Goals:

- Building pipelines using sklearn
- Implement methods to handle imbalanced data
- Introduction to [Imbalanced-learn package](https://imbalanced-learn.org/stable/index.html)


Today, we'll be working with an extract of US Census data (1994). Our goal is to predict whether individuals make over 50K USD, or not (Kohavi, 1996).  

[Dataset documentation](https://www.openml.org/search?type=data&sort=runs&id=179&status=active)

Reference: adapted from [this](https://imbalanced-learn.org/stable/auto_examples/applications/plot_impact_imbalanced_classes.html) imblearn tutorial

# 1. Load data

In [5]:
from sklearn.datasets import fetch_openml
import pandas as pd

df, y = fetch_openml("adult", version=2, as_frame=True, return_X_y=True)
df = df.drop(columns=["fnlwgt", "education-num"])

In [6]:
classes_count = y.value_counts()
classes_count

class
<=50K    37155
>50K     11687
Name: count, dtype: int64

In [8]:
from imblearn.datasets import make_imbalance

Let's make the classes further imbalanced

In [9]:
ratio = 10 ## Feel free to tinker with this
df_res, y_res = make_imbalance(
    df,
    y,
    sampling_strategy={classes_count.idxmin(): classes_count.max() // ratio},
)
y_res_clean = (y_res == ">50K")*1
y_res_clean.value_counts()

class
0    37155
1     3715
Name: count, dtype: int64

## A. Dummy Classifier

The [DummyClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) in sklearn makes predictions which ignore the input features. This serves a useful (and naive) baseline.

In [10]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

dummy_clf = DummyClassifier(strategy="most_frequent") ## Predicts the most frequent class
scoring = ["accuracy"]
dummy_cv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring)
print(f"Dummy Accuracy: {dummy_cv_result['test_accuracy'].mean():.3f}")

Dummy Accuracy: 0.909


Let's create a some variables to store the results from this, and all following experiments.

Before we continue, intuition check -- what are the precision and recall for this dummy classifier?

In [11]:
index = ["Dummy Classifier"]
scores = {'Accuracy':[dummy_cv_result['test_accuracy'].mean()],
          'Precision':[ratio/(ratio + 1)], ## FILL IN PRECISION
          'Recall':[1]} ## FILL IN RECALL
pd.DataFrame(scores, index)

Unnamed: 0,Accuracy,Precision,Recall
Dummy Classifier,0.909102,0.909091,1


Artifically high statitics, just reflecting class imbalance > actual statistical accuracy

## B. Logistic Regression

Let's start with a linear model. Since we're going to repeat a lot of our pre-processing steps, we can build a pipeline to simplify things.

The sklearn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) makes it easy to chain / link together several steps which can be cross-validated together while setting different parameters.

In [12]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Specify how to handle numeric variables
num_pipe = make_pipeline(
    StandardScaler(),
    SimpleImputer(strategy="mean", add_indicator=True)
)

#Specify how to handle categorical variables
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)


from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

# Send numeric columns to the numeric pipeline, and categorical columns to the categorical pipeline
preprocessor_linear = make_column_transformer(
    (num_pipe, selector(dtype_include="number")),
    (cat_pipe, selector(dtype_include="category"))
)


from sklearn.linear_model import LogisticRegression

## Add in details of your model
lr_clf = make_pipeline(preprocessor_linear, LogisticRegression(max_iter=1000))

## 5 fold cross-validation
lr_cv_result = cross_validate(lr_clf,
                              df_res,
                              y_res_clean,
                              scoring=["accuracy", "precision", "recall"])

In [13]:
## Store results in the dictionary from above
index += ["Logistic regression (LR)"]

scores["Accuracy"].append(lr_cv_result["test_accuracy"].mean())
scores["Precision"].append(lr_cv_result["test_precision"].mean())
scores["Recall"].append(lr_cv_result["test_recall"].mean())

pd.DataFrame(scores, index)

Unnamed: 0,Accuracy,Precision,Recall
Dummy Classifier,0.909102,0.909091,1.0
Logistic regression (LR),0.926866,0.727426,0.313594


S curve? Vanilla classifiers not useful?


## C. Logistic Regression with class weights

Most of the models in `scikit-learn` have a parameter `class_weight`, which influences the computation of the loss/criterion -- applying different penalties to incorrect classification from the minority and majority class.

`class_weight="balanced"`: weight applied is inversely proportional to the class frequency.

In [14]:
## Update the pipeline parameters
lr_clf.set_params(logisticregression__class_weight="balanced")

#Cross-validation
lr_cv_result_w = cross_validate(lr_clf, df_res, y_res_clean, scoring=["accuracy", "precision", "recall"])


# Save the scores
index += ["LR + class weights"]
scores["Accuracy"].append(lr_cv_result_w["test_accuracy"].mean())
scores["Precision"].append(lr_cv_result_w["test_precision"].mean())
scores["Recall"].append(lr_cv_result_w["test_recall"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Precision,Recall
Dummy Classifier,0.909102,0.909091,1.0
Logistic regression (LR),0.926866,0.727426,0.313594
LR + class weights,0.798116,0.289704,0.840646


## D. Custom Loss Functions

In satellite imagery, for example, many of the potential inputs to do not have a useful label, but we may end up passing them to the model during training because of our sampling mechanism. We can create a custom loss function that ignores those labels. 


![satellite imagery with labeled classes](satellite_imagery.png)

In [15]:
# example loss function:

import torch

def mse(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
    loss = (y_pred - y_true)**2 
    return loss.mean()

def mse_custom(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
    idx = (y_true == 0) | (y_true == 1)
    return mse(y_pred[idx], y_true[idx])

y_true = torch.Tensor([1, 0, 0, -1, 0, -1, 0])
y_pred = torch.Tensor([1, 0, 1,  1, 0,  1, 0])

original_loss = mse(y_pred, y_true)
class_sensitive_loss = mse_custom(y_pred, y_true)
print("basic mse:", original_loss, type(original_loss))
print("custom mse:", class_sensitive_loss, type(class_sensitive_loss))



basic mse: tensor(1.2857) <class 'torch.Tensor'>
custom mse: tensor(0.2000) <class 'torch.Tensor'>


Can build your own loss functions. Custom loss functions are much easier to use in pytorch, keras etc instead of implementing in scikitlearn

We can use this custom function in a typical `PyTorch` training loop. 

```python
    # Zero your gradients for every batch!
    optimizer.zero_grad()

    # Make predictions for this batch
    outputs = model(inputs)

    # Compute the loss and its gradients
    loss = mse_custom(labels, outputs)
    loss.backward()

    # Adjust learning weights
    optimizer.step()
```

Any instance of `torch.Tensor` has a `backwards()` method (as does any implementation as of `torch.autograd.Function`). 

References: [1](https://discuss.pytorch.org/t/custom-loss-functions/29387), [2](https://discuss.pytorch.org/t/from-where-does-the-backward-method-come-in-custom-loss-functions/89416), [3](https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html)

### _How is this related to the class weights we just discussed?_

this is another way of class weighting

## E. Resampling (Under/ Over)

`imbalanced-learn` provides some samplers to handle resampling. Here, we'll example 1) Undersampling, and 2) SMOTE

In [16]:
from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
from imblearn.under_sampling import RandomUnderSampler

##### UNDER SAMPLING
## Add the undersampling step to our pipeline
lr_clf = make_pipeline_with_sampler(
    preprocessor_linear,
    RandomUnderSampler(random_state=23),
    LogisticRegression(max_iter=1000),
)

## Cross-validation
lr_cv_result_s = cross_validate(lr_clf, df_res, y_res_clean, scoring = ['accuracy', 'precision', 'recall'])


## Save scores
index += ["LR + Under sampling"]
scores["Accuracy"].append(lr_cv_result_s["test_accuracy"].mean())
scores["Precision"].append(lr_cv_result_s["test_precision"].mean())
scores["Recall"].append(lr_cv_result_s["test_recall"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Precision,Recall
Dummy Classifier,0.909102,0.909091,1.0
Logistic regression (LR),0.926866,0.727426,0.313594
LR + class weights,0.798116,0.289704,0.840646
LR + Under sampling,0.794593,0.286344,0.843338


In [17]:
##### SMOTE


from imblearn.over_sampling import SMOTE


### Add SMOTE to the pipeline
lr_clf = make_pipeline_with_sampler(
    preprocessor_linear,
    SMOTE(random_state=42),
    LogisticRegression(max_iter=1000),
)

### Cross-validate
lr_cv_result_smote = cross_validate(lr_clf, df_res, y_res_clean, scoring = ['accuracy', 'precision', 'recall'])


### Store scores
index += ["LR + SMOTE"]
scores["Accuracy"].append(lr_cv_result_smote["test_accuracy"].mean())
scores["Precision"].append(lr_cv_result_smote["test_precision"].mean())
scores["Recall"].append(lr_cv_result_smote["test_recall"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Precision,Recall
Dummy Classifier,0.909102,0.909091,1.0
Logistic regression (LR),0.926866,0.727426,0.313594
LR + class weights,0.798116,0.289704,0.840646
LR + Under sampling,0.794593,0.286344,0.843338
LR + SMOTE,0.800685,0.2911,0.830686


# YOUR TURN

Your task is to replicate the workflow above, focusing on a random-forest classifier.

Steps:

1. Start by building the pre-processing pipeline
2. Build and evaluate:

*   Baseline random forest-classifier
*   with class weights = 'balanced'
*   with under-sampling
*   with over-sampling

3. Assess and compare performance across all models

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder

num_pipe = SimpleImputer(strategy="mean", add_indicator=True)
cat_pipe = make_pipeline() ##### COMPLETE THIS STEP

preprocessor_RF = make_column_transformer(

) ### FILL OUT THE make column transformer function

rf_pipeline = make_pipeline(
    preprocessor_RF, RandomForestClassifier(random_state=23)
)

RF_SCORING = ["accuracy", "precision", "recall"]

### A. RANDOM FOREST BASELINE

In [None]:


cv_result = cross_validate() ### TO COMPLETE



In [None]:
## Evaluate Performance
index = ["Random forest (RF)"]
scores = {'Accuracy':[],'Precision':[], 'Recall':[]}
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Precision"].append(cv_result["test_precision"].mean())
scores["Recall"].append(cv_result["test_recall"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

### B. Balanced Class Weights

In [None]:
### RANDOM FOREST with class weights

rf_pipeline.set_params() #### COMPLETE THIS STEP

cv_result = cross_validate() #### COMPLETE THIS STEP



In [None]:
#### Evaluate Performance
index += ["RF (Class Weights)"]
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Precision"].append(cv_result["test_precision"].mean())
scores["Recall"].append(cv_result["test_recall"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

### C. Under Sampling

In [None]:
#### COMPLETE THIS STEP
rf_pipeline = make_pipeline_with_sampler(

)

In [None]:
index += ["RF + Under sampling"]
cv_result = cross_validate(rf_pipeline, df_res, y_res_clean, scoring=RF_SCORING)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Precision"].append(cv_result["test_precision"].mean())
scores["Recall"].append(cv_result["test_recall"].mean())


df_scores = pd.DataFrame(scores, index=index)
df_scores

### D. OVER SAMPLING (SMOTE)

In [None]:
#### COMPLETE THIS
rf_pipeline = make_pipeline_with_sampler(

)

In [None]:
index += ["RF + SMOTE"]
cv_result = cross_validate(rf_pipeline, df_res, y_res_clean, scoring=RF_SCORING)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Precision"].append(cv_result["test_precision"].mean())
scores["Recall"].append(cv_result["test_recall"].mean())


df_scores = pd.DataFrame(scores, index=index)
df_scores