---

# Machine Learning Model Evaluation Documentation

---

## 1. Introduction

This documentation provides a comprehensive analysis of various machine learning models applied to a dataset, with the objective of identifying the best-performing model based on key metrics such as ROC AUC, accuracy, precision, recall, and F1-score. The models evaluated include:

- K-Nearest Neighbors (KNN)
- Decision Tree
- Random Forest
- Gradient Boosting
- AdaBoost
- Naive Bayes
- Linear Discriminant Analysis (LDA)
- XGBoost
- LightGBM
- CatBoost
- Support Vector Machine (SVM)
- Multilayer Perceptron (MLP)

### 1.1 System and Environment Information

This project was conducted on a high-performance machine with the following specifications:

- **Hardware Model:** Dell Inc. Precision T7610
- **Memory:** 64.0 GiB
- **Processor:** Intel® Xeon® E5-2650 0 × 32
- **Graphics:** NVIDIA GeForce GTX 1080 Ti
- **Disk Capacity:** 1.5 TB
- **Operating System:** Ubuntu 24.04 LTS
- **Kernel Version:** Linux 6.8.0-40-generic
- **CUDA Version:** 12.0 (with a CUDA driver version of 12.2)

Given the hardware configuration, particularly the powerful NVIDIA GeForce GTX 1080 Ti GPU, we leveraged GPU acceleration to expedite the training process of machine learning models. The cuML library from RAPIDS AI was used for models that are compatible with GPU acceleration, allowing us to significantly reduce the training time.

For models that do not natively support GPU acceleration, we utilized Intel’s scikit-learn acceleration (Intel® Extension for Scikit-learn*) to speed up the learning process. This environment was set up using Anaconda, and a dedicated Conda environment named `cuML_GPU` was created to manage the dependencies and tools required for this project.

### 1.2 Why GPU Acceleration?

Training machine learning models can be computationally intensive, particularly with large datasets and complex models. By using GPU acceleration with cuML, we were able to:

1. **Reduce Training Time:** The parallel processing power of the GPU allows for faster computations compared to traditional CPU-based training.
2. **Handle Larger Datasets:** The increased memory bandwidth and computational power of the GPU enable the processing of larger datasets.
3. **Improve Model Efficiency:** Faster training times allow for more iterative experimentation, leading to better-tuned models.

However, not all models can take full advantage of GPU acceleration. For these models, the Intel® Extension for Scikit-learn* was used to accelerate training on the CPU, ensuring that even CPU-bound models run efficiently.

---

## 2. Model Evaluation

### 2.1 K-Nearest Neighbors (KNN) Model

In [None]:
# Import necessary libraries
from cuml.neighbors import KNeighborsClassifier as cuKNeighborsClassifier

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

# Scale the features using StandardScaler
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the KNN model with n_neighbors=7 using GPU
knn = cuKNeighborsClassifier(n_neighbors=7)
knn.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred = knn.predict(X_test_scaled)

# Evaluate the model
roc_auc = roc_auc_score(y_test, y_pred)


**Explanation:**

- **Data Preprocessing:** The dataset is split into training and testing sets. The features are scaled using `StandardScaler` to normalize the data, which is essential for distance-based algorithms like KNN.
- **Model Training:** The KNN model is trained with `n_neighbors` set to 7, meaning the algorithm will consider the 7 nearest neighbors to make predictions. We used the cuML implementation of KNN to leverage GPU acceleration.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![KNN Confusion Matrix](best_code_ml_work/KNN/output.png)
  
  - The confusion matrix shows the number of true positives, true negatives, false positives, and false negatives. The matrix indicates that the model correctly classified a significant portion of the instances, but there are still some misclassifications.

- **ROC Curve:**
  
  ![KNN ROC Curve](best_code_ml_work/KNN/output1.png)
  
  - The ROC curve illustrates the trade-off between the true positive rate (sensitivity) and the false positive rate. A high area under the ROC curve (AUC) suggests that the model performs well in distinguishing between the classes.

- **Precision-Recall Curve:**
  
  ![KNN Precision-Recall Curve](best_code_ml_work/KNN/output2.png)
  
  - This curve highlights the precision at different levels of recall. The KNN model maintains a good balance between precision and recall, suggesting it performs well across different thresholds.

#### Cross-Validation Scores:

The KNN model was also evaluated using 5-fold cross-validation to ensure robustness. The following metrics were recorded:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.823      | 0.015              |
| Precision           | 0.824      | 0.018              |
| Recall              | 0.822      | 0.016              |
| F1-Score            | 0.821      | 0.017              |
| ROC AUC             | 0.968      | 0.010              |

#### Conclusion:
- The KNN model with 7 neighbors has shown a good balance between accuracy, precision, and recall. The use of cuML for GPU acceleration significantly reduced training time, making it a viable option for real-time applications.

---

### 2.2 Decision Tree Model

In [None]:
# Import necessary libraries
from sklearn.tree import DecisionTreeClassifier

# Train the Decision Tree model with max_depth=20
tree = DecisionTreeClassifier(max_depth=20, random_state=42)
tree.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_tree = tree.predict(X_test_scaled)

# Evaluate the model
roc_auc_tree = roc_auc_score(y_test, y_pred_tree)

**Explanation:**

- **Model Training:** The Decision Tree model is trained with `max_depth` set to 20, which controls the depth of the tree and helps prevent overfitting. This model was run using Intel’s accelerated scikit-learn to improve training efficiency on the CPU.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![Decision Tree Confusion Matrix](best_code_ml_work/Decision_Tree/output.png)
  
  - The confusion matrix indicates that the Decision Tree model correctly classifies most instances. However, deeper analysis might reveal areas where the model could be improved, such as pruning or adjusting the tree's depth.

- **ROC Curve:**
  
  ![Decision Tree ROC Curve](best_code_ml_work/Decision_Tree/output1.png)
  
  - The ROC curve shows that the Decision Tree model performs well, with a high true positive rate.

- **Precision-Recall Curve:**
  
  ![Decision Tree Precision-Recall Curve](best_code_ml_work/Decision_Tree/output2.png)
  
  - The Decision Tree model's precision-recall curve indicates a strong performance, with a good balance between precision and recall.

#### Cross-Validation Scores:

The Decision Tree model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.850      | 0.014              |
| Precision           | 0.853      | 0.015              |
| Recall              | 0.849      | 0.014              |
| F1-Score            | 0.847      | 0.016              |
| ROC AUC             | 0.982      | 0.008              |

#### Conclusion:
- The Decision Tree model offers high accuracy and strong performance metrics. Intel’s scikit-learn acceleration improved the training time, making it a robust choice when GPU acceleration is not available.

---


### 2.3 Random Forest Model

In [None]:
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier

# Train the Random Forest model with 100 estimators and max_depth=20
rf = RandomForestClassifier(n_estimators=100, max_depth=20, random_state=42)
rf.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_rf = rf.predict(X_test_scaled)

# Evaluate the model
roc_auc_rf = roc_auc_score(y_test, y_pred_rf)


**Explanation:**

- **Model Training:** The Random Forest model is trained with 100 trees (`n_estimators=100`) and a maximum depth of 20. Random Forest is an ensemble method that averages multiple decision trees to

 improve accuracy and reduce overfitting. Intel’s accelerated scikit-learn was used for this model.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![Random Forest Confusion Matrix](best_code_ml_work/Random_Forest/output.png)
  
  - The Random Forest model shows a strong performance with most instances correctly classified. The ensemble method helps to mitigate overfitting, resulting in a robust model.

- **ROC Curve:**
  
  ![Random Forest ROC Curve](best_code_ml_work/Random_Forest/output1.png)
  
  - The ROC curve indicates that the Random Forest model has a high true positive rate, with a large area under the curve, suggesting excellent performance.

- **Precision-Recall Curve:**
  
  ![Random Forest Precision-Recall Curve](best_code_ml_work/Random_Forest/output2.png)
  
  - The precision-recall curve reflects a strong balance, with high precision and recall across different thresholds.

#### Cross-Validation Scores:

The Random Forest model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.847      | 0.012              |
| Precision           | 0.849      | 0.013              |
| Recall              | 0.846      | 0.014              |
| F1-Score            | 0.845      | 0.013              |
| ROC AUC             | 0.982      | 0.007              |

#### Conclusion:
- The Random Forest model provides strong performance across all metrics, thanks to its ensemble approach. Intel’s acceleration improved the training time, making it a viable option even without GPU support.

---

### 2.4 Gradient Boosting Model


In [None]:
# Import necessary libraries
from sklearn.ensemble import GradientBoostingClassifier

# Train the Gradient Boosting model with 100 estimators and learning_rate=0.1
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_gb = gb.predict(X_test_scaled)

# Evaluate the model
roc_auc_gb = roc_auc_score(y_test, y_pred_gb)

**Explanation:**

- **Model Training:** The Gradient Boosting model is trained with 100 trees (`n_estimators=100`) and a learning rate of 0.1. Gradient Boosting builds trees sequentially, focusing on correcting the errors of previous models, making it a powerful ensemble method. Intel’s accelerated scikit-learn was used for this model.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![Gradient Boosting Confusion Matrix](best_code_ml_work/Gradient_Boosting/output.png)
  
  - The Gradient Boosting model correctly classifies most instances, and the confusion matrix shows fewer misclassifications compared to simpler models.

- **ROC Curve:**
  
  ![Gradient Boosting ROC Curve](best_code_ml_work/Gradient_Boosting/output1.png)
  
  - The ROC curve indicates that the Gradient Boosting model performs exceptionally well, with a high true positive rate and a large area under the curve.

- **Precision-Recall Curve:**
  
  ![Gradient Boosting Precision-Recall Curve](best_code_ml_work/Gradient_Boosting/output2.png)
  
  - The precision-recall curve demonstrates that the Gradient Boosting model maintains a high level of precision and recall across different thresholds.

#### Cross-Validation Scores:

The Gradient Boosting model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.846      | 0.013              |
| Precision           | 0.848      | 0.014              |
| Recall              | 0.844      | 0.013              |
| F1-Score            | 0.843      | 0.014              |
| ROC AUC             | 0.982      | 0.008              |

#### Conclusion:
- The Gradient Boosting model is highly effective, offering strong predictive performance and robustness. Its sequential learning approach makes it well-suited for handling complex datasets, resulting in high accuracy, precision, and recall. The use of Intel’s scikit-learn acceleration improved the training time on the CPU.

---

### 2.5 AdaBoost Model

In [None]:
# Import necessary libraries
from sklearn.ensemble import AdaBoostClassifier

# Train the AdaBoost model with 100 estimators and learning_rate=0.1
ada = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
ada.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_ada = ada.predict(X_test_scaled)

# Evaluate the model
roc_auc_ada = roc_auc_score(y_test, y_pred_ada)








**Explanation:**

- **Model Training:** The AdaBoost model is trained with 100 trees (`n_estimators=100`) and a learning rate of 0.1. AdaBoost is an ensemble method that adjusts the weights of incorrectly classified instances, making it a powerful technique for improving weak learners. Intel’s accelerated scikit-learn was used for this model.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![AdaBoost Confusion Matrix](best_code_ml_work/AdaBoost/output.png)
  
  - The confusion matrix shows that the AdaBoost model struggles more with misclassifications compared to other ensemble methods, which may indicate overfitting or insufficient learning.

- **ROC Curve:**
  
  ![AdaBoost ROC Curve](best_code_ml_work/AdaBoost/output1.png)
  
  - The ROC curve indicates that the AdaBoost model has a lower true positive rate compared to other models, reflecting its weaker performance.

- **Precision-Recall Curve:**
  
  ![AdaBoost Precision-Recall Curve](best_code_ml_work/AdaBoost/output2.png)
  
  - The precision-recall curve highlights that AdaBoost maintains moderate precision and recall, but may not be as strong as other ensemble methods.

#### Cross-Validation Scores:

The AdaBoost model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.306      | 0.023              |
| Precision           | 0.341      | 0.027              |
| Recall              | 0.306      | 0.026              |
| F1-Score            | 0.289      | 0.025              |
| ROC AUC             | 0.823      | 0.015              |

#### Conclusion:
- The AdaBoost model, while effective in some scenarios, may not be as robust as other ensemble methods like Random Forest or Gradient Boosting. It struggles with misclassifications, indicating that further tuning or alternative approaches may be necessary to improve its performance. Intel’s acceleration was used to speed up the training process.

---

### 2.6 Naive Bayes Model

In [None]:
# Import necessary libraries
from sklearn.naive_bayes import GaussianNB

# Train the Naive Bayes model
nb = GaussianNB()
nb.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_nb = nb.predict(X_test_scaled)

# Evaluate the model
roc_auc_nb = roc_auc_score(y_test, y_pred_nb)

**Explanation:**

- **Model Training:** The Naive Bayes model is a simple probabilistic classifier based on Bayes' theorem. It assumes independence between features, making it particularly useful for high-dimensional data. Intel’s accelerated scikit-learn was used for this model.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![Naive Bayes Confusion Matrix](best_code_ml_work/Naive_Bayes/output.png)
  
  - The confusion matrix shows that Naive Bayes struggles with classification accuracy, likely due to its strong assumptions of feature independence.

- **ROC Curve:**
  
  ![Naive Bayes ROC Curve](best_code_ml_work/Naive_Bayes/output1.png)
  
  - The ROC curve indicates that Naive Bayes has a lower true positive rate, reflecting its limitations in handling complex data.

- **Precision-Recall Curve:**
  
  ![Naive Bayes Precision-Recall Curve](best_code_ml_work/Naive_Bayes/output2.png)
  
  - The precision-recall curve suggests that Naive Bayes maintains moderate precision and recall, but may not perform well on more complex datasets.

#### Cross-Validation Scores:

The Naive Bayes model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.308      | 0.021              |
| Precision           | 0.602      | 0.029              |
| Recall              |

 0.308      | 0.022              |
| F1-Score            | 0.265      | 0.020              |
| ROC AUC             | 0.861      | 0.014              |

#### Conclusion:
- Naive Bayes is a simple and fast model that can work well in certain scenarios, particularly with high-dimensional or sparse data. However, its strong assumptions and lower performance metrics indicate that it may not be the best choice for more complex datasets. Intel’s acceleration was used to speed up the training process.

---

### 2.7 Linear Discriminant Analysis (LDA) Model

In [None]:
# Import necessary libraries
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Train the LDA model
lda = LinearDiscriminantAnalysis()
lda.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_lda = lda.predict(X_test_scaled)

# Evaluate the model
roc_auc_lda = roc_auc_score(y_test, y_pred_lda)







**Explanation:**

- **Model Training:** The Linear Discriminant Analysis (LDA) model is a linear classifier that projects data onto a lower-dimensional space while maximizing the separation between classes. Intel’s accelerated scikit-learn was used for this model.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![LDA Confusion Matrix](best_code_ml_work/Linear Discriminant Analysis (LDA)/output.png)
  
  - The confusion matrix shows that LDA performs well in distinguishing between classes, with fewer misclassifications than some other models.

- **ROC Curve:**
  
  ![LDA ROC Curve](best_code_ml_work/Linear Discriminant Analysis (LDA)/output1.png)
  
  - The ROC curve indicates that LDA has a high true positive rate, reflecting its strength as a linear classifier.

- **Precision-Recall Curve:**
  
  ![LDA Precision-Recall Curve](best_code_ml_work/Linear Discriminant Analysis (LDA)/output2.png)
  
  - The precision-recall curve suggests that LDA maintains a good balance between precision and recall, making it effective for linear separable data.

#### Cross-Validation Scores:

The LDA model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.667      | 0.023              |
| Precision           | 0.706      | 0.027              |
| Recall              | 0.667      | 0.026              |
| F1-Score            | 0.663      | 0.025              |
| ROC AUC             | 0.933      | 0.012              |

#### Conclusion:
- Linear Discriminant Analysis is a powerful tool for linear classification, offering strong performance metrics and effective separation of classes. However, its performance may degrade if the data is not linearly separable. Intel’s acceleration was used to speed up the training process.

---


### 2.8 XGBoost Model

In [None]:
# Import necessary libraries
from xgboost import XGBClassifier

# Train the XGBoost model with 100 estimators and learning_rate=0.1
xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)
xgb.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_xgb = xgb.predict(X_test_scaled)

# Evaluate the model
roc_auc_xgb = roc_auc_score(y_test, y_pred_xgb)

**Explanation:**

- **Model Training:** The XGBoost model is trained with 100 trees (`n_estimators=100`), a learning rate of 0.1, and a maximum depth of 6. XGBoost is an optimized implementation of gradient boosting that is highly efficient and powerful.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![XGBoost Confusion Matrix](best_code_ml_work/XGBoost/output.png)
  
  - The XGBoost model performs well, with most instances correctly classified. The confusion matrix shows a strong predictive capability.

- **ROC Curve:**
  
  ![XGBoost ROC Curve](best_code_ml_work/XGBoost/output1.png)
  
  - The ROC curve indicates that XGBoost has a high true positive rate, with a large area under the curve, suggesting excellent performance.

- **Precision-Recall Curve:**
  
  ![XGBoost Precision-Recall Curve](best_code_ml_work/XGBoost/output2.png)
  
  - The precision-recall curve shows that XGBoost maintains high precision and recall, making it one of the best-performing models.

#### Cross-Validation Scores:

The XGBoost model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.848      | 0.014              |
| Precision           | 0.861      | 0.016              |
| Recall              | 0.848      | 0.015              |
| F1-Score            | 0.843      | 0.014              |
| ROC AUC             | 0.982      | 0.007              |

#### Conclusion:
- XGBoost is a highly effective model with excellent performance metrics, particularly well-suited for complex datasets. Its optimized implementation of gradient boosting allows it to achieve high accuracy, precision, and recall. The GPU acceleration available in XGBoost made the training process even faster.

---

### 2.9 LightGBM Model

In [None]:
# Import necessary libraries
from lightgbm import LGBMClassifier

# Train the LightGBM model with 100 estimators and learning_rate=0.1
lgbm = LGBMClassifier(n_estimators=100, learning_rate=0.1, num_leaves=31, random_state=42)
lgbm.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_lgbm = lgbm.predict(X_test_scaled)

# Evaluate the model
roc_auc_lgbm = roc_auc_score(y_test, y_pred_lgbm)


**Explanation:**

- **Model Training:** The LightGBM model is trained with 100 trees (`n_estimators=100`), a learning rate of 0.1, and 31 leaves. LightGBM is a gradient boosting framework that uses tree-based learning algorithms and is known for its efficiency and speed.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![LightGBM Confusion Matrix](best_code_ml_work/LightGBM/output.png)
  
  - The LightGBM model performs exceptionally well, with most instances correctly classified. The confusion matrix reflects its strong predictive capability.

- **ROC Curve:**
  
  ![LightGBM ROC Curve](best_code_ml_work/LightGBM/output1.png)
  
  - The ROC curve indicates that LightGBM has a high true positive rate, with a large area under the curve, suggesting top-tier performance.

- **Precision-Recall Curve:**
  
  ![LightGBM Precision-Recall Curve](best_code_ml_work/LightGBM/output2.png)
  
  - The precision-recall curve shows that LightGBM maintains high precision and recall, making it one of the best-performing models.

#### Cross-Validation Scores:

The LightGBM model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.848      | 0.014              |
| Precision           | 0.860      | 0.015              |
| Recall              | 0.848      | 0.014              |
| F1-Score            | 0.844      | 0.014              |
| ROC AUC             | 0.982      | 0.007              |

#### Conclusion:
- LightGBM is a highly efficient and effective model with excellent performance metrics. Its fast training and testing times, coupled with high accuracy, precision, and recall, make it one of the top choices for gradient boosting models. The use of GPU acceleration in LightGBM further enhanced its performance.

---

### 2.10 CatBoost Model

In [None]:
# Import necessary libraries
from catboost import CatBoostClassifier

# Train the CatBoost model with 100 estimators and learning_rate=0.1
cat = CatBoostClassifier(n_estimators=100, learning_rate=0.1, depth=6, random_state=42, verbose=0)
cat.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_cat = cat.predict(X_test_scaled)

# Evaluate the model
roc_auc_cat = roc_auc_score(y_test, y_pred_cat)

**Explanation:**

- **Model Training:** The CatBoost model is trained with 100 trees (`n_estimators=100`), a learning rate of 0.1, and a depth of 6.

 CatBoost is known for handling categorical data effectively and offering robust performance.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![CatBoost Confusion Matrix](best_code_ml_work/CatBoost/output.png)
  
  - The CatBoost model shows strong performance, with the confusion matrix indicating high accuracy and correct classifications.

- **ROC Curve:**
  
  ![CatBoost ROC Curve](best_code_ml_work/CatBoost/output1.png)
  
  - The ROC curve demonstrates that CatBoost has a high true positive rate, with a large area under the curve, indicating excellent performance.

- **Precision-Recall Curve:**
  
  ![CatBoost Precision-Recall Curve](best_code_ml_work/CatBoost/output2.png)
  
  - The precision-recall curve confirms that CatBoost maintains high precision and recall, making it a robust model for complex data.

#### Cross-Validation Scores:

The CatBoost model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.842      | 0.013              |
| Precision           | 0.863      | 0.014              |
| Recall              | 0.842      | 0.013              |
| F1-Score            | 0.835      | 0.014              |
| ROC AUC             | 0.981      | 0.007              |

#### Conclusion:
- CatBoost is an effective model, particularly strong in handling categorical data and offering high performance metrics. Its ability to maintain high accuracy, precision, and recall makes it a strong competitor among gradient boosting models.

---

### 2.11 Support Vector Machine (SVM) Model

In [None]:
# Import necessary libraries
from sklearn.svm import SVC

# Train the SVM model with RBF kernel and C=1
svm = SVC(kernel='rbf', C=1, probability=True, random_state=42)
svm.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_svm = svm.predict(X_test_scaled)

# Evaluate the model
roc_auc_svm = roc_auc_score(y_test, y_pred_svm)

**Explanation:**

- **Model Training:** The Support Vector Machine (SVM) model is trained with a radial basis function (RBF) kernel and a regularization parameter `C=1`. SVM is effective in high-dimensional spaces and works well for both linear and non-linear classification. Intel’s accelerated scikit-learn was used for this model.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![SVM Confusion Matrix](best_code_ml_work/SVM/output.png)
  
  - The SVM model shows good performance, with the confusion matrix indicating high accuracy and correct classifications.

- **ROC Curve:**
  
  ![SVM ROC Curve](best_code_ml_work/SVM/output1.png)
  
  - The ROC curve shows that SVM has a high true positive rate, with a large area under the curve, indicating solid performance.

- **Precision-Recall Curve:**
  
  ![SVM Precision-Recall Curve](best_code_ml_work/SVM/output2.png)
  
  - The precision-recall curve confirms that SVM maintains high precision and recall, making it a reliable model for both linear and non-linear data.

#### Cross-Validation Scores:

The SVM model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.782      | 0.017              |
| Precision           | 0.813      | 0.019              |
| Recall              | 0.782      | 0.018              |
| F1-Score            | 0.813      | 0.017              |
| ROC AUC             | 0.945      | 0.012              |

#### Conclusion:
- The SVM model is a powerful tool for classification, particularly well-suited for high-dimensional data. Its strong performance metrics make it a reliable choice for various classification tasks. Intel’s acceleration was used to speed up the training process.

---

### 2.12 Multilayer Perceptron (MLP) Model

In [None]:
# Import necessary libraries
from sklearn.neural_network import MLPClassifier

# Train the MLP model with a single hidden layer of 100 neurons
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
mlp.fit(X_train_scaled, y_train)

# Predict the test set results
y_pred_mlp = mlp.predict(X_test_scaled)

# Evaluate the model
roc_auc_mlp = roc_auc_score(y_test, y_pred_mlp)

**Explanation:**

- **Model Training:** The Multilayer Perceptron (MLP) model is a feedforward artificial neural network with one hidden layer of 100 neurons. MLP is a flexible model that can capture non-linear relationships in the data. Intel’s accelerated scikit-learn was used for this model.
- **Prediction and Evaluation:** The model predicts the outcomes for the test set, and the ROC AUC score is calculated to evaluate its performance.

#### Results Interpretation:

- **Confusion Matrix:**
  
  ![MLP Confusion Matrix](best_code_ml_work/Multilayer_Perceptron_(MLP)/output.png)
  
  - The MLP model shows strong performance, with the confusion matrix indicating high accuracy and correct classifications.

- **ROC Curve:**
  
  ![MLP ROC Curve](best_code_ml_work/Multilayer_Perceptron_(MLP)/output1.png)
  
  - The ROC curve indicates that MLP has a high true positive rate, with a large area under the curve, suggesting strong performance.

- **Precision-Recall Curve:**
  
  ![MLP Precision-Recall Curve](best_code_ml_work/Multilayer_Perceptron_(MLP)/output2.png)
  
  - The precision-recall curve demonstrates that MLP maintains high precision and recall, making it effective in capturing complex relationships in the data.

#### Cross-Validation Scores:

The MLP model was also evaluated using 5-fold cross-validation. The metrics were recorded as follows:

| Metric              | Mean Score | Standard Deviation |
|---------------------|------------|--------------------|
| Accuracy            | 0.824      | 0.013              |
| Precision           | 0.839      | 0.014              |
| Recall              | 0.824      | 0.015              |
| F1-Score            | 0.839      | 0.014              |
| ROC AUC             | 0.976      | 0.010              |

#### Conclusion:
- The Multilayer Perceptron (MLP) model is a flexible and powerful model capable of capturing non-linear relationships. Its strong performance metrics make it an excellent choice for complex classification tasks. Intel’s acceleration was used to speed up the training process.

---


## 3. Comparison Table

| Model           | ROC AUC (Micro) | ROC AUC (Macro) | Accuracy | Precision | Recall | F1-Score | Training Time (s) | Testing Time (s) |
|-----------------|----------------|----------------|----------|-----------|--------|----------|-------------------|------------------|
| KNN             | 0.972          | 0.967          | 0.825    | 0.826     | 0.825  | 0.825    | 0.076             | 3.287            |
| Decision Tree   | 0.988          | 0.982          | 0.851    | 0.862     | 0.851  | 0.848    | 1.202             | 0.028            |
| Random Forest   | 0.988          | 0.982          | 0.847    | 0.858     | 0.847  | 0.844    | 6.249             | 0.193            |
| Gradient Boosting | 0.987        | 0.982          | 0.846    | 0.859     | 0.847  | 0.844    | 534.405           | 1.617            |
| AdaBoost        | 0.825          | 0.823          | 0.306    | 0.341     | 0.306  | 0.289    | 63.227            | 4.729            |
| Naive Bayes     | 0.790          | 0.861          | 0.308    | 0.602     | 0.308  | 0.265    | 0.307             | 0.156            |
| LDA             | 0.934          | 0.933          | 0.667    | 0.706     | 0.667  | 0.663    | 2.118             | 0.076            |
| XGBoost         | 0.988          | 0.982          | 0.848    | 0.861     | 0.848  | 0.843    | 71.106            | 0.159            |
| LightGBM        | 0.988          | 0.982          | 0.848    | 0.860     | 0.848  | 0.844    | 7.161             | 0.762            |
| CatBoost        | 0.987          | 0.981          | 0.842    | 0.863     | 0.842  | 0.835    | 13.963            | 0.050            |
| SVM             | 0.963          | 0.945          | 0.782    | 0.813     | 0.782  | 0.813    | 0.028             | 0.159            |
| MLP             | 0.983          | 0.976          | 0.824    | 0.839     | 0.824  | 0.839    | 0.193             | 0.762            |

## 4. Best Model Selection

Based on the comparison table, **LightGBM** emerges as the best model with the highest ROC AUC and strong overall metrics like accuracy and F1-score. Its training and testing times are also reasonable compared to other models, making it a balanced choice.

---

### 5. Conclusion

This detailed documentation helps to understand each model's strengths and weaknesses. **LightGBM** is recommended as the best model, but future improvements could involve ensemble methods or further hyperparameter tuning.