# Lab work 2 : Logistic regression

This notebook builds on the fourth lecture of Foundations of Machine Learning. We'll focus on the logistic regression model and how to handle class imbalance.

Important note: the steps shown here are not always the most efficient or the most “industry-approved.” Their main purpose is pedagogical. So don't panic if something looks suboptimal—it's meant to be.

If you have questions (theoretical or practical), don't hesitate to bug your lecturer.

We will try to accurately predict if a star observation is actually a [pulsars](https://en.wikipedia.org/wiki/Pulsar) on a [dataset](https://www.kaggle.com/datasets/colearninglounge/predicting-pulsar-starintermediate). Let's first load the dataset.

In [None]:
import pandas as pd

df = pd.read_csv("NB4 - Pulsars.csv")
df.head()

As we are working on a classification problem, one need first to measure how imbalanced are classes.

**Task**: Measure class imbalance on the dataset by measuring how often the target is the interest modality (here 1).

Therefore, a model with accuracy below 90% is a model performing worst than a model predicting "it is not a pulsar" everytime. Beyond the difficulty in measuring performance, there is also a problem in the splitting process :

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["target"])
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

imbalance_rate_train = 100 * y_train.mean()
print(f"Interest class rate: {imbalance_rate_train:.2f}% (train)")
imbalance_rate_test = 100 * y_test.mean()
print(f"Interest class rate: {imbalance_rate_test:.2f}% (test)")

The [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) function split randomly the dataset, therefore doesn't preserve the imbalance across splits.
To overcome this, one can use the *stratify* parameter.

**Task** : Modify the code above to use the *stratify* parameter and conclude.

## Data preparation

As always, the bread and butter of data science : data preparation.

**Task** : Inspect the dataset with the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method as every column are numerical.

**Task** : Given the code below, using [`scatter_matrix`](https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html) function, interpret its output for our problem.

This figure gives a lot of insights, but it doesn't show the proper repartition of the two classes.

**Task** : Using the function below, explore again the different relationship between the variables and the target.

We clearly get that : Skewness = Excess_kurtosis ** 2. Therefore, we decide to remove the Excess_kurtosis variable from our work.
Also, given the name of the column *Mean* and *Std*, we decide to compute the *Z-Score* variable defined as the ratio between the two.

**Task** : Implement the changes detailled above.

Now is time model !

## Modelisation and pipeline

Logistic regression learns its coefficient using variants of gradient descent. Therefore it requires a standardisation of its feature to stabilize the training.

**Task** : Using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) class, train a [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) after splitting with the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) function carefully.
Then, display performance metric such as accuracy, precision, recall and F1-Score. One can use the [`precision_recall_fscore_support`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) function.

The performance are already quite good, as expected ! We are performing two steps to train :
1. Learn and then standardize the inputs
2. Train

We are also performing two steps to predict :
1. Standardize the input, using the learning in the training
2. Predict

This two use case are very similar : the can be combined using a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) class. Its goal is to simplify the code, as follow :

In [None]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("logistic", LogisticRegression())
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
precision, recall, f1_score, _ = precision_recall_fscore_support(y_true=y_test, y_pred=y_pred, average="binary")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1_score:.4f}")

We highly encourage to use the Pipeline class in order to prevent data leakage with the StandardScaler.

## Choosing the right threshold

We know that a logistic regression output natively a score interpreted as a probability. Then is it transformed into classes (0 / 1) using the threshold 0.5
What if this wasn't the best threshold for the F1-Score ?

**Task** : Compute for several thresholds the precision, recall and F1-Score value. Then make a plot highlighting where the best performance is achieved.

We are going to dive deeper in some subject for the continuation of this notebook, so let's wrap useful code up.

**Task** : Define a function `train_experiment` which will fit a model given a training dataset, then test this model against a test dataset and display the metrics curves we've just wrote.

## Handle imbalance

So far, apart from splitting sets and performance measure, it doesn't feel like class imbalance really bothers us as the problem seems to be *simple*. Yet this type of problem can necessitate the need for **resampling**. We are going to cover some of them.

The first is **random under sampling**. The method is going to randomly drop observations from the majority class, until the desired ratio between majority and minority class.

**Task** : Using the [`RandomUnderSampler`](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html) class, balance the training dataset and measure the number of row before and after. Then, fit as usual and display performance metrics and visuals.

The second method is the opposite : we are going to duplicate observations from the minority class until the desired ratio. This is called **random over sampling**.

**Task** : Using the [`RandomOverSampler`](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html) class, balance the training dataset and measure the number of row before and after. Then, fit as usual and display performance metrics and visuals.

One last method that belongs to the under-sampling category is [**SMOTE**](https://www.jair.org/index.php/jair/article/view/10302/24590) (Synthetic Minority Over-sampling Technique). The algorithm is going to **create** observations based on neighbors of the future new observation. For a more visual explanation, one can look at [these slides](https://github.com/theo-lq/Conferences/blob/main/M2%20IASD%20Exec%20-%20ML%20LCLF/Support%202024.pdf) (in french).

**Task** : Using the [`SMOTE`](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) class, balance the training dataset and measure the number of row before and after. Then, fit as usual and display performance metrics and visuals.

For our dataset and our work, balancing the dataset didn't work out in performance but change the model confidence and calibration.

Another method, not relying on balancing the dataset is to take class imbalance into the training loss.

**Task** : Using the *class_weight* parameter in the [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class, fit a model and display performance metrics and visuals.

## Polynomial feature and hyperparameter tuning

As we have seen in session 2, polynomial features can help linear model handle complex relationships. We shoud try them again.

**Task** : Using the [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) class, process the training set and transform the test set. Then, fit as usual and display performance metrics and visuals.

It is a bit better ! There is still some work to do !