# 04 - Model Insights

This notebook provides insights derived from the trained models. It includes performance analysis, key takeaways, and recommendations for improvements.

- Insights from trained models include performance analysis and improvement recommendations.
- PCA reduced Gradient Boosting model performance, indicating loss of key features.
- Feature engineering and selection improved model performance, with Gradient Boosting as the top performer.


In [None]:
from _Model_Comparator_class import modelComparator
import pandas as pd

feature_selec_df = pd.read_parquet('data/feature_selec_df.parquet')

In [85]:
model_to_test = ['gradient_boosting PCA']
model_comparator_feature_selec = modelComparator(feature_selec_df)
model_comparator_feature_selec.compare_models(model_to_test, 10, 25)

{'Best Params': {'model__n_estimators': 220, 'model__learning_rate': 0.018936332533954676, 'model__max_depth': 5, 'model__subsample': 0.6565243115041418, 'model__random_state': 42}, 'Recall': 0.9272727272727272, 'Precision': 0.796875, 'Accuracy': 0.8454545454545455, 'F1 Score': 0.8571428571428571, 'AUC ROC': 0.8454545454545455, 'Model': 'gradient_boosting PCA', 'Processing Time (s)': 1625.2171738147736}


Unnamed: 0,Best Params,Recall,Precision,Accuracy,F1 Score,AUC ROC,Model,Processing Time (s)
0,"{'model__n_estimators': 220, 'model__learning_...",0.927273,0.796875,0.845455,0.857143,0.845455,gradient_boosting PCA,1625.217174


## 1. Key Indicators

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Get the best model from model_comparator_feature_selec
best_model = model_comparator_feature_selec.best_model

# Ensure that the test set is not used in any part of the model training or hyperparameter tuning process
X_test = model_comparator_feature_selec.X_test
y_test = model_comparator_feature_selec.y_test

# Predict probabilities for the test set
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Calculate the AUC
auc = roc_auc_score(y_test, model_comparator_feature_selec.y_pred)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='best')
plt.show()

This ROC curve demonstrates strong model performance, as indicated by the high AUC value of 0.96. The curve stays close to the top-left corner, which suggests that the model achieves a high true positive rate while maintaining a low false positive rate. This is indicative of excellent discriminatory power, meaning the model is effective at distinguishing between the positive and negative classes. The diagonal line represents a random guess (AUC = 0.5), and the model's curve significantly outperforms this baseline, showcasing its robustness and reliability in classification tasks.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import matplotlib.pyplot as plt

# Predict probabilities for the test set, selecting only the positive class
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

# Convert probabilities to binary predictions (using a threshold of 0.5)
y_pred = (y_pred_proba >= 0.5).astype(int)

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Display the confusion matrix
ConfusionMatrixDisplay(conf_matrix).plot(cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))



The results from the confusion matrix and classification report reflect a highly effective model with balanced performance across both classes. The confusion matrix shows that the model accurately identifies most instances, with 1051 true negatives and 1060 true positives. There are relatively few misclassifications, including 49 false positives and 40 false negatives, indicating strong model precision and recall.

The classification report confirms this, with both classes achieving precision and recall scores of approximately 0.96. The F1-scores for both classes are also 0.96, highlighting a well-balanced performance where precision and recall are effectively aligned. An overall accuracy of 96% underscores the model's reliability and its capability to generalize well to the dataset.

These results suggest that the model is robust and capable of making accurate predictions with minimal misclassification, which is crucial for maintaining high performance in practical applications.

## 2. Explorations

In [None]:
model_to_test = ['random_forest + xgboost + gradient_boosting', 'gradient_boosting PCA']
model_comparator_feature_selec = modelComparator(feature_selec_df)
model_comparator_feature_selec.compare_models(model_to_test, 10, 20)


Applying PCA (Principal Component Analysis) in this case resulted in a notable decline in the performance of the Gradient Boosting model. Initially, without PCA, the model achieved an impressive Accuracy of 0.959545 and an AUC ROC of 0.96. However, after applying PCA, the Accuracy dropped to 0.85 and the AUC ROC to 0.85.

This decline indicates that the transformation and dimensionality reduction introduced by PCA did not preserve the necessary information and relationships within the data that were critical for effective classification. While PCA can help reduce noise and computational complexity, it may also eliminate important variance and features that contribute to model performance, leading to suboptimal results. In this instance, the loss of key features likely hindered the model's ability to capture the underlying patterns in the dataset, resulting in reduced predictive capability and overall accuracy.

## Conclusion

The modeling process began with a comprehensive evaluation of several algorithms, yielding the following initial results for various models:

Gradient Boosting achieved an AUC ROC score of 0.87, while XGBoost reached an AUC ROC score of 0.86. Other models such as Random Forest and Logistic Regression showed similar performance metrics.
After implementing feature engineering, we observed significant improvements across the board. Specifically, Gradient Boosting improved its performance with an Accuracy of 0.94 and an AUC ROC score of 0.94. 

Focusing on Gradient Boosting, further feature selection yielded an even higher Accuracy of 0.96 and AUC ROC score of 0.96. This improvement confirmed the efficacy of narrowing down the feature set, leading to a model that effectively captured the underlying patterns in the data.

Subsequently, PCA was applied in an attempt to enhance dimensionality reduction, but this did not yield positive results, as the Accuracy and AUC ROC scores fell to 0.85 and 0.85, respectively. The application of PCA highlighted the importance of maintaining key features, as the transformation appeared to have removed critical information necessary for optimal classification.

In conclusion, the iterative process of feature engineering and selection demonstrated substantial gains in model performance, particularly for Gradient Boosting, which emerged as the top-performing algorithm. 