# <font color=#023F7C> **Machine Learning and Deep Learning**

---



<font color=#023F7C>**Hi! PARIS DataBootcamp 2024 🚀**</font> <br>


<img src = https://www.hi-paris.fr/wp-content/uploads/2020/09/logo-hi-paris-retina.png width = "300" height = "200" >


**Name**: ...           <br>
**School**: ...         <br>
**Group number**: ...    <br>
**Track**: ...           <br>
**Teaching Assistants**: Thibault Porssut and Farouk Kadri, Machine Learning Research Engineer @ Hi! PARIS

# **1. Important guidelines**


The RUL (Remaining Useful Life) refers to the remaining lifespan of an aircraft engine before failure. In the context of the NASA C-MAPSS dataset, this term represents the number of operational cycles an engine can still perform before breaking down. **The goal of this practical session** is to build a Machine Learning model that can p**redict this value using sensor data collected during previous cycles.**

To achieve this goal, you are provided with three datasets: train_FD001.txt, test_FD001.txt, and RUL_FD001.txt, originating from the NASA C-MAPSS dataset. <br>
These datasets contain sensor measurements and operational settings for multiple aircraft engines recorded over a series of operational cycles. Each engine is run until it fails, and the sensor data capture the degradation process over time. <br>
The file RUL_FD001.txt contains the actual Remaining Useful Life (RUL) values for the engines in the test set at the point where the test data end.

The variable to predict is RUL, which represents the number of operational cycles remaining before the engine fails.

**<font size='5'><u>How to work on this notebook</u>**</font> <br>
The notebook is split in two parts: Machine Learning and Deep Learning.

- **Beginner track**: You only have to complete the Machine Learning and Deeep Learning section.
- **Intermediate track**: Please complete Machine Learning, Deep Learning and the optional section.

**<font size='5'><u>Bootcamp deliverables</u>**</font> <br>

Here are the two deadlines for the bootcamp deliverables:
- <u>**Friday 12:30 PM**</u>: <br> Send us the "Machine Learning and Deep Learning" notebooks (no need to send us data cleaning) <br>
    - **Each group member should send his own notebooks** (we won't accept one notebook per group)
    - Don't forget to complete the start of the notebook with your information (name, school, group number and track)
    
- <u>**Friday 2:30PM**</u>: <br>Send us the group slides <br>
    - You can send us a single powerpoint per group (no need to send us one per group member)
    - Don't forget to add your group number as well as who is in your group (name, school and track) to the slides

Send both the notebooks and the slides at `data-event@hi-paris.fr`

**<font size='5'><u>Need help ? 🙏</u>**</font> <br>

We will drop later in the week to the Machine Learning course (Beginner track) on HFactory the `Machine_Learning_Beginner_DB2025.ipynb` notebook for those who need help with the Machine Learning part.

**Don't hesitate to ask questions to the bootcamp organizers/staff members if you need help.**


# **2. Machine Learning**

Let's start by importing the libraries needed for this notebook.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

Now load the train and test datasets using `pd.read_csv()`

In [3]:
train_path =  '02_df_train_FD001_wo_nan_denoised.csv'
test_path = '02_df_test_FD001_wo_nan_denoised.csv'

# Train and test data.
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

Train shape: (18816, 28)
Test shape: (12039, 28)


### **2.1 Data preprocessing**

In [None]:
# Preprocessing tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler


In [None]:
def replace_with_gaussian_bins(df, cols):
    """
    Replace numerical columns with 7-category Gaussian bins based on mean ± n*std.

    Parameters:
    df   : DataFrame
    cols : list of column names to transform

    Returns:
    DataFrame with the same structure but selected numeric columns replaced with categorical bins
    """
    df_out = df.copy()

    labels = [
        "Extremely Low",
        "Very Low",
        "Low",
        "Normal",
        "High",
        "Very High",
        "Extremely High"
    ]

    for col in cols:
        mu = df_out[col].mean()
        sigma = df_out[col].std()

        bins = [
          -np.inf,
          mu - 2*sigma,   # Extremely Low
          mu - 1*sigma,   # Very Low
          mu - 0.5*sigma,   # Low
          mu + 0.5*sigma,   # High
          mu + 1*sigma,   # Very High
          mu + 2*sigma,   # Extremely High
          np.inf
      ]

        df_out[col] = pd.cut(df_out[col], bins=bins, labels=labels, include_lowest=True)

    return df_out

In [None]:
train_df=replace_with_gaussian_bins(train_df,["HPC outlet temperature (°C)"])
test_df=replace_with_gaussian_bins(test_df,["HPC outlet temperature (°C)"])


In [None]:


features = ['engine_unit_number',
  'time_cycles',
  'LPC outlet temperature (°C)',
  "HPC outlet temperature (°C)",
  "LPT outlet temperature (°C)",
  "HPC outlet pressure (bar)",
  "Physical core speed (rpm)",
  "Fuel flow / Ps30 (kg/s/bar)",
  "Corrected fan speed (rpm)",
  "Bypass ratio (dimensionless)",
  "Bleed enthalpy (kJ/kg)",
  "HPT coolant bleed flow (kg/s)"]



X_train = train_df[features].copy()
y_train = train_df["RUL_class"].copy()

X_test = test_df[features].copy()
y_test = test_df["RUL_class"].copy()


print("Train rows:", X_train.shape[0], "Test engines:", X_test.shape[0])

**<u>Question 2.1.1:</u>** <br>**Transform the categorical variables in each split with `OneHotEncoder`.** <br>


The column names OneHotEncoder creates can be accessed with `.get_feature_names_out()`. <br>
Go this [page](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for more info on how to use scikit-learn's `OneHotEncoder` function. <br>

*Don't forget, data preprocessing is only applied to the feature variables in the case of binary classification !*


However label encoding is preferred over one-hot encoding for ordinal labels because ordinal features have a meaningful, intrinsic order (e.g., Low < Medium < High). Label encoding preserves this natural ranking by mapping categories to integers that reflect their order, allowing models to interpret the progression between categories. In contrast, one-hot encoding would treat each category as unrelated, discarding the ordinal relationship and increasing the number of features unnecessarily, which can reduce efficiency without adding value.

**<u>Question 2.1.2:</u>** <br>**Transform the categorical variables with `LabelEncoder`.** <br>


The column names LabelEncoder creates can be accessed with `.get_feature_names_out()`. <br>
Go this [page](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) for more info on how to use scikit-learn's `LabelEncoder` function. <br>

*Don't forget, data preprocessing is only applied to the feature variables in the case of binary classification !*

**<u>Question 2.1.3</u>**: <br>
**Scale the continuous variables using either `StandardScaler` (standardization) or `MinMaxScaler` (normalization).**<br>





### **2.2 Model training and evaluation**
Now that our dataset has been preprocessed, we can use it to train Machine Learning models.





In [None]:
# Metrics for evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, roc_curve, auc, precision_recall_curve,
    classification_report, roc_auc_score, PrecisionRecallDisplay, average_precision_score
)

# hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import ConfusionMatrixDisplay

# Plotting & Style
import seaborn as sns
sns.set(style="whitegrid")


**Train three models of your choice** (Logistic Regression, K nearest neighbor, Decision Tree,...) **using scikit-learn's `.fit()` method. <br>**

<u>Help</u>: Train these models on the training set (`X_train` and `y_train`).

advice: Try Random Forest



**Why classification instead of regression?**

Although the Remaining Useful Life (RUL) is a continuous variable, here we approach the problem as classification because:

- Classification simplifies decision-making by focusing on whether an engine is likely to fail soon (within 30 cycles) rather than predicting the exact remaining cycles.
- Regression models on RUL often struggle to be robust due to noise and variability in the sensor data.
- Classification allows for more stable, actionable predictions aligned with maintenance needs: "replace soon" vs "safe".

<u>**Question 2.2.1**:</u> <br>  Choose any three models to try. Which ones do you pick ?

<u>**Question 2.2.2**:</u> <br> Train the FIRST selected model using .fit(X_train_scaled, y_train).

<u>**Question 2.2.3**:</u> <br> Train the SECOND and THIRD selected models.

<u>**Question 2.2.4**:</u> <br> FIRST model: predict on TRAIN/TEST, then print Test Accuracy, Test Recall, Test F1.
If our goal is to catch failures, which metric should we prioritize?

*Help: The .predict() function should be used on the feature of the test set (X_test)*. [page](https://scikit-learn.org/stable/getting_started.html)

accuracy_score(), recall_score(), f1_score()
find all metrics here: [page](https://scikit-learn.org/stable/api/sklearn.metrics.html)





<u>**Question 2.2.5**:</u> <br> Repeat for the SECOND model.

<u>**Question 2.2.6**:</u> <br> Repeat for the THIRD model.

You may notice that we emphasize Recall. Recall answers: “of all the true failures, how many did the model catch?” On imbalanced datasets, a model can show high accuracy while still missing many failures (predicting the majority class most of the time). That means lots of false negatives. When the goal is failure detection, Recall is the right metric because it prioritizes catching positives (fewer missed failures), even if that sometimes increases false alarms. Keep in mind there’s a trade-off: higher Recall can lower Precision, so we pick the threshold (or settings) that gives the best compromise for our use case.

- If Accuracy is high but Recall is low, we are missing failures (false negatives).
- Since the goal is to detect failures, prioritize Recall.

<u>**Question 2.2.7**:</u> <br> Find the best model by TEST Recall among the three and show its TEST confusion matrix.

*help: Look at recall_score() and confusion_matrix() [page](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)*

<u>**Question 2.2.8**:</u> <br> Plot ROC and Precision–Recall curves for the BEST model (needs probabilities).
*help: Look at roc_curve() [page](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) and precision_recall_curve() [page](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html)*

<u>**Question 2.2.9**:</u> <br> Pick the threshold with the highest Recall. ONLY on the best model (picked above)

<u>**Question 2.2.10**:</u> <br> Tiny GridSearch to favor Recall.

*Help: Try GridSearchCV [page](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) on the best model you selected (by Recall) to find hyperparameters that improve Recall.<br>
Report the best hyperparameters (.best_params_) and the best cross-validated Recall (.best_score_). Then re-fit the tuned model and print the Test Recall.*

Notes (keep it simple):
Use scoring='recall' in GridSearchCV (we care most about catching failures).
Keep the grid tiny (2–3 values per parameter) to avoid long runs; set cv=3.
Compare against your baseline Test Recall (before tuning). One line is enough.

*(Optional) If you tried different decision thresholds earlier (e.g., 0.4 / 0.5 / 0.6), also report Test Recall using your chosen threshold with the tuned model.*

## **3. Explainability with shap**

The `shap` library (SHapley Additive exPlanations) is a Python library used for explaining the output of machine learning models. <br> It provides a unified framework for interpreting complex models and understanding the contributions of individual features to model predictions. <br>

Shap is particularly useful for understanding black-box models like boosting, random forests, and deep neural networks, among others. <br>
It can also be used with any classification model.

**Let's install and import the shap library.**

In [None]:
!pip install shap

In [None]:
import shap
np.bool=bool # code from last year

Shap is very heavy and takes a long time to compute. <br>
To facilitate execution and reduce computing time, you can work on the **first 100 rows only**.

*Note: You can use either the train features (X_train) or the test features (X_test) to compute shap values*

In [None]:
df_shap = X_train_scaled.iloc[:100]

<u>**Question 3.1**:</u> <br>
**Create an object `explainer` that can compute shap values.** <br>

*<u>Help</u>: You can use `shap.Explainer` for any trained classification model as input.* <br>
*For tree based models, you can use `shap.TreeExplainer`*.


**Now, compute the shap values of a model with `explainer.shap_values`.** <br>
If it takes too much time, you can reduce to 100-500 values.

<u>**Question 3.2**:</u> <br>
**Display the summary plot of shap values with `shap.summary_plot(...., plot_type=bar)`.**

*Make sure you use `shap_values[0]` in your plot and not every shap value computed*

<u>**Question 3.4**:</u> <br>
**Use the same shap plot as previously but replace `plot_type="bar"` with `plot_type="dot"`.** <br>
**And add the data you used to compute shap_values in `features=...`.**

**Explain what you have understood about this plot**:
- **Which variables are important in terms of explainability ?**
- **How does the values of the important variables affect predictions ?**

## **4. Deep Learning**

We will start by importing one of Python's Deep Learning libraries `tensorflow`/`keras`.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout

In the following cells, `dataset` should be the dataset you transformed with data pre-processing (Onehotencoded, LabelEncoder,...). <br>
You should run the following cells to prepare the data to train a Deep Learning model.

<u>**Question 4.1**:</u> <br>
**Make a small neural network model using `tensorflow`/`keras`, and print the accuracy**

*Note: You can use the following elements to train the neural network* <br>
- *`tf.keras.Sequential`*
- *`layers.Dense(INTEGER, activation='relu')`*,
- *`tf.keras.losses.BinaryCrossentropy`*
- *`model.compile(optimizer='adam', ...)`*
- *`model.fit`*
- *`model.evaluate`* with epoch ~= 10