# Model Evaluation

### Original Data Set

In [5]:
# Import necessary libraries
import pandas as pd

# Create the table for model comparison based on the provided results for the original dataset
data_original_dataset = {
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Accuracy': [0.53, 0.80, 0.71],
    'Precision (Class 0)': [0.81, 0.80, 0.80],
    'Recall (Class 0)': [0.55, 1.00, 0.85],
    'F1-Score (Class 0)': [0.65, 0.89, 0.82],
    'Precision (Class 1)': [0.21, 0.00, 0.19],
    'Recall (Class 1)': [0.49, 0.00, 0.14],
    'F1-Score (Class 1)': [0.30, 0.00, 0.16]
}

# Create DataFrame
df_original_dataset_results = pd.DataFrame(data_original_dataset)

# Display the DataFrame
df_original_dataset_results


Unnamed: 0,Model,Accuracy,Precision (Class 0),Recall (Class 0),F1-Score (Class 0),Precision (Class 1),Recall (Class 1),F1-Score (Class 1)
0,Logistic Regression,0.53,0.81,0.55,0.65,0.21,0.49,0.3
1,Random Forest,0.8,0.8,1.0,0.89,0.0,0.0,0.0
2,XGBoost,0.71,0.8,0.85,0.82,0.19,0.14,0.16


**Conclusion:** 
- **Logistic Regression is the most balanced** model but still struggles to accurately predict defaulters, with an overall low performance for Class 1.
- **Random Forest** performs very well for non-defaulters (Class 0) but **completely fails to identify defaulters (Class 1)**. This is a significant problem, as identifying defaulters is critical.
- **XGBoost** Offers a better balance than Random Forest, though its performance on defaulters is still weak.

### After Aplying SMOTE


In [9]:
# Import necessary libraries
import pandas as pd

# Create the table for model comparison based on the provided results after applying SMOTE
data_smote = {
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Accuracy': [0.65, 0.80, 0.71],
    'Precision (Class 0)': [0.80, 0.80, 0.80],
    'Recall (Class 0)': [0.76, 1.00, 0.85],
    'F1-Score (Class 0)': [0.78, 0.89, 0.83],
    'Precision (Class 1)': [0.19, 0.00, 0.21],
    'Recall (Class 1)': [0.23, 0.00, 0.15],
    'F1-Score (Class 1)': [0.21, 0.00, 0.18]
}

# Create a DataFrame
df_smote_results = pd.DataFrame(data_smote)

# Display the DataFrame
df_smote_results

Unnamed: 0,Model,Accuracy,Precision (Class 0),Recall (Class 0),F1-Score (Class 0),Precision (Class 1),Recall (Class 1),F1-Score (Class 1)
0,Logistic Regression,0.65,0.8,0.76,0.78,0.19,0.23,0.21
1,Random Forest,0.8,0.8,1.0,0.89,0.0,0.0,0.0
2,XGBoost,0.71,0.8,0.85,0.83,0.21,0.15,0.18


**Conclusion**

- **Random Forest** has the highest accuracy but fails to identify defaulters, making it unsuitable when Class 1 (defaulters) is important.

- **Logistic Regression** is more balanced, providing slightly better recall and F1-score for defaulters than the other models, but still needs improvement.

- **XGBoost** offers better results than Random Forest for defaulters, but overall performance remains weak.

### After Aplying Random Under Sampling

In [17]:
# Import necessary libraries
import pandas as pd

# Create the table for model comparison based on the provided results
data_undersampling_full = {
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Accuracy': [0.52, 0.51, 0.48],
    'Precision (Class 0)': [0.80, 0.79, 0.78],
    'Recall (Class 0)': [0.54, 0.53, 0.48],
    'F1-Score (Class 0)': [0.64, 0.64, 0.60],
    'Precision (Class 1)': [0.20, 0.19, 0.18],
    'Recall (Class 1)': [0.47, 0.43, 0.45],
    'F1-Score (Class 1)': [0.28, 0.26, 0.25]
}

# Create a DataFrame
df_undersampling_full_results = pd.DataFrame(data_undersampling_full)

# Display the DataFrame
df_undersampling_full_results


Unnamed: 0,Model,Accuracy,Precision (Class 0),Recall (Class 0),F1-Score (Class 0),Precision (Class 1),Recall (Class 1),F1-Score (Class 1)
0,Logistic Regression,0.52,0.8,0.54,0.64,0.2,0.47,0.28
1,Random Forest,0.51,0.79,0.53,0.64,0.19,0.43,0.26
2,XGBoost,0.48,0.78,0.48,0.6,0.18,0.45,0.25


**Conclusion:**

- Logistic Regression performs the best overall in terms of balancing accuracy and performance for both classes. While it doesn’t perform well for defaulters (Class 1), it has the highest accuracy and best handles non-defaulters (Class 0).

- Random Forest has slightly better recall for defaulters (Class 1) compared to Logistic Regression, but its performance is quite close overall.

### After Aplying Random Over Sampling

In [23]:
# Results Comparison Among the models

import pandas as pd

# Create the data for model comparison after random oversampling
data = {
    'Model': ['Logistic Regression', 'XGBoost', 'Random Forest'],
    'Accuracy': [0.54, 0.89, 0.97],
    'Precision (Class 0)': [0.56, 0.95, 0.97],
    'Recall (Class 0)': [0.54, 0.82, 0.98],
    'F1-Score (Class 0)': [0.55, 0.88, 0.97],
    'Precision (Class 1)': [0.53, 0.83, 0.98],
    'Recall (Class 1)': [0.54, 0.96, 0.97],
    'F1-Score (Class 1)': [0.54, 0.89, 0.97]
}

# Create a DataFrame
df_results = pd.DataFrame(data)

# Display the DataFrame
df_results

Unnamed: 0,Model,Accuracy,Precision (Class 0),Recall (Class 0),F1-Score (Class 0),Precision (Class 1),Recall (Class 1),F1-Score (Class 1)
0,Logistic Regression,0.54,0.56,0.54,0.55,0.53,0.54,0.54
1,XGBoost,0.89,0.95,0.82,0.88,0.83,0.96,0.89
2,Random Forest,0.97,0.97,0.98,0.97,0.98,0.97,0.97


**Conclusion**

- **Logistic Regression** *underperformed* with an accuracy of 0.54. It struggled to accurately distinguish defaulters, showing balanced but low precision and recall for both classes.
- **XGBoost** *performed well*, achieving an **accuracy of 0.89**. It demonstrated strong precision for non-defaulters and excellent recall for defaulters, making it a good candidate for identifying defaulters while maintaining overall performance.
- **Random Forest** was the best performer with an **accuracy of 0.97**, showing **near-perfect precision and recall for both defaulters and non-defaulters**. It effectively handled the imbalanced data and provides a robust solution.



## Discusion

#### Why Random Forest is the Best Performed Model:
- **Ensemble Learning:** Random Forest combines the predictions of multiple decision trees to make a final prediction, leading to better generalization and performance.
- **Handling Class Imbalance:** It handles imbalanced classes well, especially when class_weight='balanced' is used, as the model adjusts for the imbalance in the data.
- **Overfitting Control:** Random Forest reduces overfitting by averaging multiple decision trees, improving the overall performance on unseen data.
  

#### Why XGBoost Performed Better, Because:

- Its **ability to handle complex relationships**, **manage class imbalances effectively**, and **avoid overfitting**. Its excellent recall for defaulters makes it a strong choice for your problem of predicting loan defaults.
  

#### Why Logistic Regression  Couldn't Perform Well :

- **Imbalanced Dataset:** struggles with imbalanced data where one class significantly outweighs the other. In this dataset, there are far more non-defaulters (class 0) than defaulters (class 1), causing the model to focus more on predicting the majority class (class 0), while ignoring the minority class (class 1).

- **Linear Model:**  This is a linear model, meaning it assumes a linear relationship between the input features and the target variable. If the relationship between your features (e.g., loan amount, interest rate) and the target (default status) is more complex, logistic regression may not capture this complexity well.

- **No Feature Interactions:**  LR does not automatically account for interactions between features unless explicitly modeled. In contrast, models like Random Forest and XGBoost naturally capture interactions between features, making them better suited for complex datasets.

- **Sensitivity to Outliers:** can be sensitive to outliers, which might affect its performance if outliers exist in this dataset.

- **Scaling Issues:** If features were not properly scaled (especially numerical ones like loan amount, interest rate), Logistic Regression may struggle to converge or properly separate the classes.

Addressing these issues, such as balancing the dataset or feature scaling, could improve Logistic Regression’s performance, but it may still fall short compared to the more flexible tree-based models.

### Why Didn't Choose SVM or Dicision Tree for Model Building in this Case

- **Decision Tree**:
    - Prone to overfitting with complex data, making it less robust compared to ensemble methods.
    - Can be unstable as small changes in data lead to drastically different results.
    - Lacks regularization, which can result in poor generalization.

- **SVM**:
    - Computationally expensive for large datasets and harder to scale.
    - Struggles with imbalanced data unless special techniques like adjusting class weights are applied.
    - Tuning for non-linear relationships (using kernel tricks) can be complex and less effective compared to tree-based models.
    - Less interpretable than decision tree-based methods.

*XGBoost and Random Forest were better suited because they handle class imbalance, capture complex patterns, prevent overfitting, and scale efficiently with large datasets, leading to their stronger performance.*


## Conclusion

In contrast, *Random Forest* and *XGBoost* are *tree-based models* that can handle *non-linear relationships* and *imbalanced data* better, which explains their **superior performance**.

**Random Forest** is likely the **optimal choice** based on its exceptional performance, with **XGBoost** also being a **strong alternative**.