# Machine Learning Project

 1) data curation: how you choose the features (X) and the target (Y)
 2) exploratory data analysis (including visualization and correlation matrix)
 3) univariate results (and meta analysis)
 4) multivariate results (and meta analysis)
 5) benchmark linear/logistic regressions (including higher-order polynomials and/or interaction terms)
 6) one machine learning algorithm (e.g., random forests or boosting)
 7) k-fold cross-validation 
 8) performance evaluation (R-squared, AUROC, etc.)
 9) key features (dimension reduction and feature selection techniques if necessary)
 10) synthetic interpretation of results

## Introduction

For my Project I choose a **Airplane Engine Dataset** which has **27** columns and **20631** entries for the train dataset the test dataset contains **11939** entries. 
The dataset contains the following columns:
- id: Engine ID, 
- cycle: Cycle number,
- setting1-3: Engine setting 1-3,
- s1-s21: Sensor measurements s1-s21,
- Y: Binary target label indicating engine swap (1: needs swapping, 0: does not need swapping).

## Importing libraries

In [3]:
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D
from prompt_toolkit.key_binding.bindings.named_commands import yank_last_arg
from sklearn.linear_model import LinearRegression

from main import Y_last_10_train

## Load and clear up dataset

### 1) Load and edit the Dataset

In [4]:
# Load the dataset
dataset_train = pd.read_csv("PM_train.csv")
dataset_test = pd.read_csv("PM_test.csv")

# Shift the cycle column forward by 1 to identify the last cycle before it resets
dataset_train['Y'] = (dataset_train['cycle'] == dataset_train.groupby('id')['cycle'].transform('max')).astype(int)
dataset_test['Y'] = (dataset_test['cycle'] == dataset_test.groupby('id')['cycle'].transform('max')).astype(int)

# Define selected columns, including 'Y'
selected_columns = ['id', 'cycle', 'setting1', 'setting2', 's2', 's3', 's4', 's6', 's7', 's8', 's9', 's11', 's12', 's13', 's14', 's15', 's17', 's20', 's21', 'Y']

# Select the subset of data
selected_dataset_train = dataset_train[selected_columns]
selected_dataset_test = dataset_test[selected_columns]

# Sort by 'id' and 'cycle' to ensure the data is in the correct order
dataset_train = dataset_train.sort_values(['id', 'cycle'])
dataset_test = dataset_test.sort_values(['id', 'cycle'])

# Group by 'id' and select the last 10 cycles for each 'id'
last_10_cycles_train = dataset_train.groupby('id').tail(10)
last_10_cycles_test = dataset_test.groupby('id').tail(10)
# Further filter the selected columns for this subset
selected_last_10_cycles_train = last_10_cycles_train[selected_columns]
selected_last_10_cycles_test = last_10_cycles_test[selected_columns]

# Display the result
selected_last_10_cycles_train.head()
selected_last_10_cycles_test.head()

# Define the features (X) and target (Y)
X_train = selected_dataset_train.drop(columns=['Y', 'id', 'cycle'])  # Drop 'Y', 'id', and 'cycle' if they are not part of the model
Y_train = selected_dataset_train['Y']
X_test = selected_dataset_test.drop(columns=['Y', 'id', 'cycle'])  # Drop 'Y', 'id', and 'cycle' if they are not part of the model
Y_test = selected_dataset_test['Y']

# Alternatively, if using only the last 10 cycles:
X_last_10_train = selected_last_10_cycles_train.drop(columns=['Y', 'id', 'cycle'])
Y_last_10_train = selected_last_10_cycles_train['Y']
X_last_10_test = selected_last_10_cycles_test.drop(columns=['Y', 'id', 'cycle'])
Y_last_10_test = selected_last_10_cycles_test['Y']


After visualizing the dataset, I decided to use the following columns as features:
- id: Engine ID,
- cycle: Cycle number,
- setting1: Engine setting 1,
- setting2: Engine setting 2,
- s2: Sensor measurement 2,
- s3: Sensor measurement 3,
- s4: Sensor measurement 4,
- s6: Sensor measurement 6,
- s7: Sensor measurement 7,
- s8: Sensor measurement 8,
- s9: Sensor measurement 9,
- s11: Sensor measurement 11,
- s12: Sensor measurement 12,
- s13: Sensor measurement 13,
- s14: Sensor measurement 14,
- s15: Sensor measurement 15,
- s17: Sensor measurement 17,
- s20: Sensor measurement 20,
- s21: Sensor measurement 21,
- Y: Binary target label indicating engine swap (1: needs swapping, 0: does not need swapping). Which I created on my own by using the last value before the id goes up by one.   

The reason for this is that these values are the most relevant for the prediction of the target label 'Y' because the other columns are null values and don't change.
 

### 2.1) Visualization of the whole Dataset

In [5]:
# # Visualization 
# selected_dataset_train.head()
# selected_dataset_train.info()
# 
# for column in selected_dataset_train.columns:
#     if column not in ['id', 'Y']:  # Skip 'id' and 'Y' for this general visualization
#         plt.figure(figsize=(10, 5))
#         plt.title(f"Distribution of {column}")
#         plt.plot(selected_dataset_train['cycle'], selected_dataset_train[column], alpha=0.7)
#         plt.xlabel("Cycle")
#         plt.ylabel(column)
#         plt.grid(True, alpha=0.5)
#         plt.show()


I this part I visualized the whole dataset and the distribution of the columns in the dataset. To get a rough view about the dataset.

### 2.2) Visualization of the last 10 cycles

In [6]:
# selected_last_10_cycles_train.head()
# selected_last_10_cycles_train.info()
# 
# for column in selected_last_10_cycles_train.columns:
#     if column not in ['id', 'Y']:  # Skip 'id' and 'Y' for this general visualization
#         plt.figure(figsize=(10, 5))
#         plt.title(f"Distribution of {column}")
#         plt.plot(selected_last_10_cycles_train['cycle'], selected_last_10_cycles_train[column], alpha=0.7)
#         plt.xlabel("Cycle")
#         plt.ylabel(column)
#         plt.grid(True, alpha=0.5)
#         plt.show()

I this part I visualized the last 10 cycles of the dataset and the distribution of the columns in the dataset. To get a more narrow insight about the dataset.

### 3) Scatter Plot Matrix

In [7]:
# # Heatmap
# plt.figure(figsize=(12, 10))
# correlation_matrix = selected_last_10_cycles_train.corr()  # Compute the correlation matrix
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
# plt.title('Correlation Matrix Heatmap (Last 10 Cycles)')
# plt.show()
# 
# # Scatter plot matrix
# numeric_columns = [col for col in selected_columns if col not in ['id', 'Y', 'cycle']]
# sns.pairplot(selected_last_10_cycles_train, vars=numeric_columns[:20])  # Limit to first 6 for clarity
# plt.suptitle('Scatter Plot Matrix (Last 10 Cycles)', y=1.02)
# plt.show()
# 
# # 3D Scatter Plot
# fig = plt.figure(figsize=(10, 7))
# ax = fig.add_subplot(111, projection='3d')
# ax.scatter(
#     selected_last_10_cycles_train['setting1'],
#     selected_last_10_cycles_train['s2'],
#     selected_last_10_cycles_train['s3'],
#     c='b', marker='o'
# )
# 
# # Label the axes
# ax.set_xlabel('Setting 1')
# ax.set_ylabel('S2')
# ax.set_zlabel('S3')
# plt.title('3D Scatter Plot (Last 10 Cycles)')
# plt.show()

Correlation Matrix and Scatter Plot Matrix are used to visualize the correlation between the columns and the distribution of the columns in the dataset.
This process helped a lot to find the sweets-pot for the number of cycles I use before the engine gets swapped.

### 4) Logistic Regression

In [16]:
# Logistic Regression Example
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Assuming the dataset is available as `data`
# Step 1: Define features and target
# Replace 'features' and 'target' with actual column names or data
# features = data.drop(columns=['target_column'])
# target = data['target_column']

# Temporary example (remove/comment after replacing with actual data)
# features, target = X, y

# Step 2: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Step 3: Initialize and Train Logistic Regression Model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Step 4: Predict and Evaluate
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Accuracy and Classification Report
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# AUROC Score
auroc = roc_auc_score(y_test, y_pred_proba)
print("AUROC:", auroc)

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"Logistic Regression (AUROC = {auroc:.2f})")
plt.plot([0, 1], [0, 1], 'k--', label="Random Guess")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="best")
plt.grid()
plt.show()

NameError: name 'features' is not defined

In [9]:
# # from sklearn.metrics import accuracy_score, classification_report, r2_score
# # from sklearn.linear_model import LinearRegression
# # 
# # # Fit the regression model
# # model = LinearRegression()
# # # model.fit(X_train, Y_train)
# # model.fit(X_last_10_train, Y_last_10_train)
# # 
# # # Define the threshold for binary classification
# # threshold = 1
# # 
# # # Predict using the regression model
# # y_pred = model.predict(X_last_10_test)
# # 
# # print(y_pred.min(), y_pred.max())
# # 
# # # Convert regression outputs and true labels to binary
# # y_test_binary = selected_last_10_cycles_test['Y']
# # y_pred_binary = (y_pred <= threshold).astype(int)
# # 
# # # Evaluate the model
# # accuracy = accuracy_score(y_test_binary, y_pred_binary)
# # 
# # print(f'Accuracy: {accuracy}')
# # print(classification_report(y_test_binary, y_pred_binary))
# # print(f'R^2 {r2_score(y_test_binary, y_pred_binary)}')
# 
# from sklearn.metrics import accuracy_score, classification_report, r2_score
# from sklearn.linear_model import LogisticRegression
# 
# # Fit the logistic regression model
# model = LogisticRegression()
# model.fit(X_last_10_train, Y_last_10_train)
# 
# # Predict using the logistic regression model
# y_pred_proba = model.predict_proba(X_last_10_test)[:, 1]  # Get probability for the positive class
# 
# # Define the threshold for binary classification
# threshold = 0.5
# 
# # Convert probabilities to binary predictions
# y_pred_binary = (y_pred_proba >= threshold).astype(int)
# 
# # True labels for evaluation
# y_test_binary = selected_last_10_cycles_test['Y']
# 
# # Evaluate the model
# accuracy = accuracy_score(y_test_binary, y_pred_binary)
# 
# print(f'Accuracy: {accuracy}')
# print(classification_report(y_test_binary, y_pred_binary))
# print(f'R^2 {r2_score(y_test_binary, y_pred_binary)}')
# 
# from sklearn.metrics import accuracy_score, classification_report, r2_score
# from sklearn.linear_model import LogisticRegression
# from sklearn.preprocessing import StandardScaler
# from sklearn.utils.class_weight import compute_class_weight
# import numpy as np
# 
# # Scale the data
# scaler = StandardScaler()
# X_last_10_train_scaled = scaler.fit_transform(X_last_10_train)
# X_last_10_test_scaled = scaler.transform(X_last_10_test)
# 
# # Compute class weights to handle imbalance
# class_weights = compute_class_weight('balanced', classes=np.unique(Y_last_10_train), y=Y_last_10_train)
# class_weight_dict = {i: class_weights[i] for i in range(len(class_weights))}
# 
# # Fit the logistic regression model with increased iterations and class weights
# model = LogisticRegression(max_iter=500, class_weight=class_weight_dict)
# model.fit(X_last_10_train_scaled, Y_last_10_train)
# 
# # Predict probabilities for the positive class
# y_pred_proba = model.predict_proba(X_last_10_test_scaled)[:, 1]
# 
# # Define the threshold for binary classification
# threshold = 0.5
# y_pred_binary = (y_pred_proba >= threshold).astype(int)
# 
# # True labels for evaluation
# y_test_binary = selected_last_10_cycles_test['Y']
# 
# # Evaluate the model
# accuracy = accuracy_score(y_test_binary, y_pred_binary)
# print(f'Accuracy: {accuracy}')
# print(classification_report(y_test_binary, y_pred_binary, zero_division=1))
# print(f'R^2: {r2_score(y_test_binary, y_pred_binary)}')
# 
# 
# from sklearn.metrics import roc_auc_score, roc_curve
# import matplotlib.pyplot as plt
# 
# # Predict probabilities for the positive class
# y_pred_proba = model.predict_proba(X_last_10_test_scaled)[:, 1]
# 
# # Calculate AUROC
# auroc = roc_auc_score(y_test_binary, y_pred_proba)
# print(f'AUROC: {auroc}')
# 
# # Plot ROC curve
# fpr, tpr, thresholds = roc_curve(y_test_binary, y_pred_proba)
# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, label=f'Logistic Regression (AUROC = {auroc:.2f})')
# plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver Operating Characteristic (ROC) Curve')
# plt.legend(loc='best')
# plt.show()


### 5.1) Random Forest

In [10]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import train_test_split, cross_val_score
# from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix, roc_curve
# 
# # Step 1: Create target labels indicating engine swap (1: needs swapping, 0: does not need swapping)
# train_data = dataset_train.copy()
# threshold = 10  # Define the threshold for remaining cycles
# train_data['RUL'] = train_data.groupby('id')['cycle'].transform(max) - train_data['cycle']
# train_data['Y'] = (train_data['RUL'] <= threshold).astype(int)
# 
# # Step 2: Define features (X) and target (Y)
# features = train_data.columns.difference(['id', 'cycle', 'RUL', 'Y'])
# X = train_data[features]
# y = train_data['Y']
# 
# # Step 3: Train-Test Split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 
# # Step 4: Random Forest Classifier with Cross-Validation
# rf = RandomForestClassifier(random_state=42, n_estimators=100)
# cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='roc_auc')
# 
# # Fit the model
# rf.fit(X_train, y_train)
# 
# # Evaluate the model on the test set
# y_pred = rf.predict(X_test)
# y_pred_proba = rf.predict_proba(X_test)[:, 1]
# 
# # Performance Metrics
# classification_report_result = classification_report(y_test, y_pred)
# roc_auc = roc_auc_score(y_test, y_pred_proba)
# 
# # Print results
# print("Classification Report:\n", classification_report_result)
# print("ROC-AUC Score:", roc_auc)
# print("Mean Cross-Validation AUC:", cv_scores.mean())
# 
# # Generate the confusion matrix
# conf_matrix_rf = confusion_matrix(y_test, y_pred)
# 
# # Extract TP, FP, FN, TN from the confusion matrix
# TN_nn, FP_nn, FN_nn, TP_nn = conf_matrix_rf.ravel()
# 
# # Print the TP/FP breakdown
# print(f"True Positives (TP): {TP_nn}")
# print(f"False Positives (FP): {FP_nn}")
# print(f"True Negatives (TN): {TN_nn}")
# print(f"False Negatives (FN): {FN_nn}")
# 
# # Optionally, display the confusion matrix
# print("\nConfusion Matrix:")
# print(conf_matrix_rf)
# 
# auroc = roc_auc_score(y_test, y_pred_proba)
# print(f"\nAUROC: {auroc}")
# 
# # Plot the ROC curve
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, label=f"AUROC = {auroc:.2f}")
# plt.plot([0, 1], [0, 1], 'k--', label="Random Guessing")
# plt.xlabel("False Positive Rate")
# plt.ylabel("True Positive Rate")
# plt.title("ROC Curve")
# plt.legend(loc="lower right")
# plt.grid()
# plt.show()

In [11]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import train_test_split, cross_val_score
# from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix, roc_curve
# 
# 
# 
# rf = RandomForestClassifier(random_state=42, n_estimators=100)
# cv_scores = cross_val_score(rf, X_last_10_train, Y_last_10_train, cv=5, scoring='roc_auc')
# 
# rf.fit(X_last_10_train, Y_last_10_train)
# 
# y_pred = rf.predict(X_last_10_test)
# y_pred_proba = rf.predict_proba(X_last_10_test)[:, 1]
# 
# # Performance Metrics
# classification_report_result = classification_report(Y_last_10_test, y_pred)
# roc_auc = roc_auc_score(Y_last_10_test, y_pred_proba)
# 
# # Print results
# print("Classification Report:\n", classification_report_result)
# print("ROC-AUC Score:", roc_auc)
# print("Mean Cross-Validation AUC:", cv_scores.mean())
# 
# # Generate the confusion matrix
# conf_matrix_rf = confusion_matrix(Y_last_10_test, y_pred)
# 
# # Extract TP, FP, FN, TN from the confusion matrix
# TN_nn, FP_nn, FN_nn, TP_nn = conf_matrix_rf.ravel()
# 
# # Print the TP/FP breakdown
# print(f"True Positives (TP): {TP_nn}")
# print(f"False Positives (FP): {FP_nn}")
# print(f"True Negatives (TN): {TN_nn}")
# print(f"False Negatives (FN): {FN_nn}")
# 
# # Optionally, display the confusion matrix
# print("\nConfusion Matrix:")
# print(conf_matrix_rf)
# 
# auroc = roc_auc_score(Y_last_10_test, y_pred_proba)
# print(f"\nAUROC: {auroc}")
# 
# # Plot the ROC curve
# fpr, tpr, thresholds = roc_curve(Y_last_10_train, y_pred_proba)
# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, label=f"AUROC = {auroc:.2f}")
# plt.plot([0, 1], [0, 1], 'k--', label="Random Guessing")
# plt.xlabel("False Positive Rate")
# plt.ylabel("True Positive Rate")
# plt.title("ROC Curve")
# plt.legend(loc="lower right")
# plt.grid()
# plt.show()
# 


In [12]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import train_test_split, cross_val_score
# from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix, roc_curve
# 
# # Step 1: Create target labels indicating engine swap (1: needs swapping, 0: does not need swapping)
# threshold = 10  # Define the threshold for remaining cycles
# 
# # train_data = dataset_train.copy()
# # train_data['RUL'] = train_data.groupby('id')['cycle'].transform(max) - train_data['cycle']
# # train_data['Y'] = (train_data['RUL'] <= threshold).astype(int)
# 
# # Step 2: Define features (X) and target (Y)
# # features = train_data.columns.difference(['id',  'cycle', 'RUL', 'Y'])
# # X = train_data[features]
# # y = train_data['Y']
# 
# # Step 3: Train-Test Split
# # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 
# # Step 4: Random Forest Classifier with Cross-Validation
# rf = RandomForestClassifier(random_state=42, n_estimators=100)
# cv_scores = cross_val_score(rf, X_train, Y_train, cv=5, scoring='roc_auc')
# 
# # Fit the model
# rf.fit(X_train, Y_train)
# 
# # Evaluate the model on the test set
# y_pred = rf.predict(X_test)
# y_pred_proba = rf.predict_proba(X_test)[:, 1]
# 
# # Performance Metrics
# classification_report_result = classification_report(Y_test, y_pred)
# roc_auc = roc_auc_score(Y_test, y_pred_proba)
# 
# # Print results
# print("Classification Report:\n", classification_report_result)
# print("ROC-AUC Score:", roc_auc)
# print("Mean Cross-Validation AUC:", cv_scores.mean())
# 
# # Generate the confusion matrix
# conf_matrix_rf = confusion_matrix(Y_test, y_pred)
# 
# # Extract TP, FP, FN, TN from the confusion matrix
# TN_nn, FP_nn, FN_nn, TP_nn = conf_matrix_rf.ravel()
# 
# # Print the TP/FP breakdown
# print(f"True Positives (TP): {TP_nn}")
# print(f"False Positives (FP): {FP_nn}")
# print(f"True Negatives (TN): {TN_nn}")
# print(f"False Negatives (FN): {FN_nn}")
# 
# # Optionally, display the confusion matrix
# print("\nConfusion Matrix:")
# print(conf_matrix_rf)
# 
# auroc = roc_auc_score(Y_test, y_pred_proba)
# print(f"\nAUROC: {auroc}")
# 
# # Plot the ROC curve
# fpr, tpr, thresholds = roc_curve(Y_test, y_pred_proba)
# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, label=f"AUROC = {auroc:.2f}")
# plt.plot([0, 1], [0, 1], 'k--', label="Random Guessing")
# plt.xlabel("False Positive Rate")
# plt.ylabel("True Positive Rate")
# plt.title("ROC Curve")
# plt.legend(loc="lower right")
# plt.grid()
# plt.show()

### 5.2) Gradient Boosting

In [13]:
# from xgboost import XGBClassifier
# from sklearn.metrics import classification_report, roc_auc_score
# from sklearn.model_selection import cross_val_score, train_test_split
# from sklearn.preprocessing import StandardScaler
# 
# # Step 1: Scale the Features (Gradient Boosting may not require scaling, but it can help in some cases)
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)
# 
# # Step 2: Define the Gradient Boosting Classifier
# xgb = XGBClassifier(
#     n_estimators=100,          # Number of trees
#     learning_rate=0.1,         # Step size for each iteration
#     max_depth=3,               # Maximum tree depth
#     subsample=0.8,             # Fraction of samples used for training each tree
#     colsample_bytree=0.8,      # Fraction of features used for training each tree
#     random_state=42            # For reproducibility
# )
# 
# # Step 3: Cross-Validation
# cv_scores = cross_val_score(xgb, X_train_scaled, y_train, cv=5, scoring='roc_auc')
# 
# # Step 4: Train the Model
# xgb.fit(X_train_scaled, y_train)
# 
# # Step 5: Make Predictions
# y_pred = xgb.predict(X_test_scaled)
# y_pred_proba = xgb.predict_proba(X_test_scaled)[:, 1]
# 
# # Step 6: Evaluate the Model
# print("Classification Report:\n", classification_report(y_test, y_pred))
# print("ROC-AUC Score:", roc_auc_score(y_test, y_pred_proba))
# print("Mean Cross-Validation AUC:", cv_scores.mean())
# 
# # Generate the confusion matrix
# conf_matrix_gb = confusion_matrix(y_test, y_pred)
# 
# # Extract TP, FP, FN, TN from the confusion matrix
# TN_nn, FP_nn, FN_nn, TP_nn = conf_matrix_gb.ravel()
# 
# # Print the TP/FP breakdown
# print(f"True Positives (TP): {TP_nn}")
# print(f"False Positives (FP): {FP_nn}")
# print(f"True Negatives (TN): {TN_nn}")
# print(f"False Negatives (FN): {FN_nn}")
# 
# # Optionally, display the confusion matrix
# print("\nConfusion Matrix:")
# print(conf_matrix_gb)
# 
# auroc = roc_auc_score(y_test, y_pred_proba)
# print(f"\nAUROC: {auroc}")
# 
# # Plot the ROC curve
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, label=f"AUROC = {auroc:.2f}")
# plt.plot([0, 1], [0, 1], 'k--', label="Random Guessing")
# plt.xlabel("False Positive Rate")
# plt.ylabel("True Positive Rate")
# plt.title("ROC Curve")
# plt.legend(loc="lower right")
# plt.grid()
# plt.show()

### 5.3 Neural Networks (Test)

In [14]:
# import tensorflow as tf
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Dropout
# from tensorflow.keras.callbacks import EarlyStopping
# from sklearn.preprocessing import StandardScaler
# from sklearn.metrics import classification_report, roc_auc_score
# 
# # Step 1: Scale the Features
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)
# 
# # Step 2: Build the Neural Network
# model = Sequential([
#     Dense(64, activation='relu', input_dim=X_train_scaled.shape[1]),
#     Dropout(0.3),  # Add dropout for regularization
#     Dense(32, activation='relu'),
#     Dropout(0.3),
#     Dense(1, activation='sigmoid')  # Output layer with sigmoid for binary classification
# ])
# 
# # Compile the model
# model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# 
# # Step 3: Train the Model
# early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
# 
# history = model.fit(
#     X_train_scaled, y_train,
#     validation_split=0.2,
#     epochs=100,
#     batch_size=32,
#     callbacks=[early_stopping],
#     verbose=1
# )
# 
# # Step 4: Evaluate the Model
# y_pred_proba_nn = model.predict(X_test_scaled)
# y_pred_nn = (y_pred_proba_nn > 0.5).astype(int)
# 
# # Performance Metrics
# print("Classification Report:\n", classification_report(y_test, y_pred_nn))
# print("ROC-AUC Score:", roc_auc_score(y_test, y_pred_proba_nn))
# 
# # Generate the confusion matrix
# conf_matrix_nn = confusion_matrix(y_test, y_pred_nn)
# 
# # Extract TP, FP, FN, TN from the confusion matrix
# TN_nn, FP_nn, FN_nn, TP_nn = conf_matrix_nn.ravel()
# 
# # Print the TP/FP breakdown
# print(f"True Positives (TP): {TP_nn}")
# print(f"False Positives (FP): {FP_nn}")
# print(f"True Negatives (TN): {TN_nn}")
# print(f"False Negatives (FN): {FN_nn}")
# 
# # Optionally, display the confusion matrix
# print("\nConfusion Matrix:")
# print(conf_matrix_nn)
# 
# auroc = roc_auc_score(y_test, y_pred_proba_nn)
# print(f"AUROC: {auroc}")
# 
# # Plot the ROC curve
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, label=f"AUROC = {auroc:.2f}")
# plt.plot([0, 1], [0, 1], 'k--', label="Random Guessing")
# plt.xlabel("False Positive Rate")
# plt.ylabel("True Positive Rate")
# plt.title("ROC Curve")
# plt.legend(loc="lower right")
# plt.grid()
# plt.show()

### 5.4) Ensemble Learning (Test)

In [15]:
# from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import roc_auc_score, classification_report, accuracy_score, roc_curve
# import matplotlib.pyplot as plt
# 
# # Example: Assuming X (features) and y (target) are already defined.
# # Split the data
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 
# # Random Forest
# rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# rf_model.fit(X_train, y_train)
# rf_predictions = rf_model.predict(X_test)
# rf_probabilities = rf_model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class
# 
# # Gradient Boosting
# gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
# gb_model.fit(X_train, y_train)
# gb_predictions = gb_model.predict(X_test)
# gb_probabilities = gb_model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class
# 
# # Evaluate the models
# rf_auroc = roc_auc_score(y_test, rf_probabilities)
# gb_auroc = roc_auc_score(y_test, gb_probabilities)
# 
# print("Random Forest AUROC:", rf_auroc)
# print("Gradient Boosting AUROC:", gb_auroc)
# 
# # Optional: Classification reports and accuracy
# print("\nRandom Forest Classification Report:\n", classification_report(y_test, rf_predictions))
# print("\nGradient Boosting Classification Report:\n", classification_report(y_test, gb_predictions))
# 
# # Plot the ROC curves
# rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_probabilities)
# gb_fpr, gb_tpr, _ = roc_curve(y_test, gb_probabilities)
# 
# plt.figure(figsize=(8, 6))
# plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (AUROC = {rf_auroc:.2f})")
# plt.plot(gb_fpr, gb_tpr, label=f"Gradient Boosting (AUROC = {gb_auroc:.2f})")
# plt.plot([0, 1], [0, 1], 'k--', label="Random Guessing")
# plt.xlabel("False Positive Rate")
# plt.ylabel("True Positive Rate")
# plt.title("ROC Curves")
# plt.legend()
# plt.show()
# 
# from sklearn.metrics import confusion_matrix
# 
# # Compute confusion matrices for each model
# rf_confusion = confusion_matrix(y_test, rf_predictions)
# gb_confusion = confusion_matrix(y_test, gb_predictions)
# 
# # Extract TP, TN, FP, FN for Random Forest
# rf_tn, rf_fp, rf_fn, rf_tp = rf_confusion.ravel()
# 
# # Extract TP, TN, FP, FN for Gradient Boosting
# gb_tn, gb_fp, gb_fn, gb_tp = gb_confusion.ravel()
# 
# # Display results
# print("Random Forest Confusion Matrix:")
# print(rf_confusion)
# print(f"True Positives (TP): {rf_tp}")
# print(f"True Negatives (TN): {rf_tn}")
# print(f"False Positives (FP): {rf_fp}")
# print(f"False Negatives (FN): {rf_fn}\n")
# 
# print("Gradient Boosting Confusion Matrix:")
# print(gb_confusion)
# print(f"True Positives (TP): {gb_tp}")
# print(f"True Negatives (TN): {gb_tn}")
# print(f"False Positives (FP): {gb_fp}")
# print(f"False Negatives (FN): {gb_fn}")