# Lab work 2 : Logistic regression

This notebook builds on the fourth lecture of Foundations of Machine Learning. We'll focus on the logistic regression model and how to deal with class imbalance.

Important note: the steps shown here are not always the most efficient or the most "industry-approved." Their main purpose is pedagogical. So don't panic if something looks suboptimal—it's meant to be.

If you have questions (theoretical or practical), don't hesitate to bug your lecturer.

We will try to accurately predict if a star observation is actually a [pulsars](https://en.wikipedia.org/wiki/Pulsar) on a [dataset](https://www.kaggle.com/datasets/colearninglounge/predicting-pulsar-starintermediate). Let's first load the dataset.

In [None]:
import pandas as pd

df = pd.read_csv("NB4 - Pulsars.csv")
df.head()

Since this is a classification problem, our first step is to check how imbalanced the classes are.

**Task**: Measure the class imbalance in the dataset by calculating how often the target takes the value of interest (in this case, 1).

That means any model with accuracy below 90% is actually worse than a trivial model that always predicts "not a pulsar."

But the challenge doesn't stop at measuring performance—there's also a problem with how we split the data:

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["target"])
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

imbalance_rate_train = 100 * y_train.mean()
print(f"Interest class rate: {imbalance_rate_train:.2f}% (train)")
imbalance_rate_test = 100 * y_test.mean()
print(f"Interest class rate: {imbalance_rate_test:.2f}% (test)")

The [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) function splits the dataset randomly, which means it doesn't preserve the class imbalance in each split.
To fix this, we can use the *stratify* parameter.

**Task** : Modify the code above to include the *stratify* parameter, then check the result and conclude.

## Data preparation

As always, the bread and butter of data science : data preparation.

**Task** : Inspect the dataset with the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method as every column are numerical.

**Task** : Given the code below, using [`scatter_matrix`](https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html) function, interpret its output for our problem.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_style(style="whitegrid")

X = df.drop(columns=["target"])
pd.plotting.scatter_matrix(X, figsize=(12, 6), alpha=0.3)
plt.show()

This figure gives us plenty of insights, but it doesn't show the actual distribution of the two classes.

**Task**: Using the function provided below, re-explore the relationships between the variables and the target.

In [None]:
def scatter_classification(df, target_column, column_1, column_2, figsize=(12, 6), **kwargs):
    plt.figure(figsize=figsize)
    for index, value in enumerate(df[target_column].unique()):
        subset = df.loc[df["target"] == value, ]
        plt.scatter(subset[column_1], subset[column_2], color=sns.color_palette()[index], label=f"Class {index}", **kwargs)
    plt.title(f"{column_1} vs {column_2}")
    plt.xlabel(column_1)
    plt.ylabel(column_2)
    plt.legend()
    plt.show()

We clearly get that : Skewness = Excess_kurtosis ** 2. Therefore, we decide to remove the Excess_kurtosis variable from our work.
Also, given the name of the column *Mean* and *Std*, we decide to compute the *Z-Score* variable defined as the ratio between the two.

**Task** : Implement the changes described above.

Now is time model !

## Modelisation and pipeline

Logistic regression learns its coefficient using variants of gradient descent. Therefore it requires a standardisation of its feature to stabilize the training.

**Task** : Using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) class, train a [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) after splitting with the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) function carefully.
Then, display performance metric such as accuracy, precision, recall and F1-Score. One can use the [`precision_recall_fscore_support`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) function.

The performance are already quite good, as expected ! Currently, we perform two steps when training:
1. Learn and standardize the inputs
2. Train the model

And two steps when predicting:
1. Standardize the input using the scaling learned from training
2. Make predictions

These two workflows are very similar, so we can combine them using a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) class. The goal is to simplify the code, like this:

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("logistic", LogisticRegression())
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
precision, recall, f1_score, _ = precision_recall_fscore_support(y_true=y_test, y_pred=y_pred, average="binary")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1_score:.4f}")

We highly encourage to use the `Pipeline` class to avoid data leakage when applying the `StandardScaler`.

## Choosing the right threshold

Logistic regression outputs a score that can be interpreted as a probability. By default, this score is converted into class labels (0 or 1) using a threshold of 0.5
But what if 0.5 isn't the best threshold for optimizing the F1-score?

**Task** : Compute for several thresholds the precision, recall and F1-Score value. Then make a plot highlighting where the best performance is achieved.

We're going to dive deeper into some topics in the rest of this notebook, so let's first wrap up the useful code.

**Task** : Define a function `train_experiment` which will fit a model given a training dataset, then test this model against a test dataset and display the metrics curves we've just wrote.

## Handle imbalance

So far, aside from splitting the data and measuring performance, class imbalance hasn't caused much trouble because the problem seems relatively simple.
However, in general, imbalanced datasets often require resampling techniques. We'll cover some of these approaches.

The first method is **random under-sampling**. This technique randomly removes observations from the majority class until the desired ratio between the majority and minority classes is reached.

**Task** : Using the [`RandomUnderSampler`](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html) class, balance the training dataset and measure the number of row before and after. Then, fit as usual and display performance metrics and visuals.

The second method works in the opposite direction: we duplicate observations from the minority class until we reach the desired ratio. This is called **random over-sampling**.

**Task** : Using the [`RandomOverSampler`](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html) class, balance the training dataset and measure the number of row before and after. Then, fit as usual and display performance metrics and visuals.

Another method, also in the under-sampling family, is [**SMOTE**](https://www.jair.org/index.php/jair/article/view/10302/24590) (Synthetic Minority Over-sampling Technique). Instead of simply duplicating samples, SMOTE creates new observations by interpolating between existing minority-class examples and their neighbors. For a more visual explanation, one can look at [these slides](https://github.com/theo-lq/Conferences/blob/main/M2%20IASD%20Exec%20-%20ML%20LCLF/Support%202024.pdf) (in french).

**Task** : Using the [`SMOTE`](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) class, balance the training dataset and measure the number of row before and after. Then, fit as usual and display performance metrics and visuals.

For our dataset and setup, balancing the dataset didn't improve performance much, but it did affect the model's confidence and calibration.

Another approach, which doesn't require changing the dataset, is to incorporate class imbalance directly into the training loss.

**Task** : Using the *class_weight* parameter in the [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class, fit a model and display performance metrics and visuals.

## Polynomial feature and hyperparameter tuning

As we saw in Session 2, polynomial features can help linear models capture more complex relationships. It's worth trying them again here.

**Task** : Using the [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) class, process the training set and transform the test set. Then, fit as usual and display performance metrics and visuals.

It's a bit better, but there's still room for improvement. Now it's your turn to experiment!