# Evaluate XGBoost and RandomForest ML models

## Notebook Purpose

The purpose of this notebook is to evaluate the performance of **Random Forest** and **XGBoost** models using various metrics.  


## Requirements for Using the Notebook

To successfully use this notebook, the following paths and configurations are required:

### 1. Data Files
Ensure CSV files are prepared using the methodology described in:
- [Creating CSV files from Planet training data](https://gitlab.inf.elte.hu/gislab/waste-detection/-/wikis/Creating-CSV-files-from-Planet-training-data)
- Or a similar script designed for Sentinel data. (more bands)

### 2. Model Checkpoints
The models must be loaded from the specified paths:
- **XGBoost Model Path:** `models/xgboost_model_sampled.sav`
- **Random Forest Model Path:** `models/with_extra_unknowns.sav`

### Imports

In [None]:
import pandas as pd
import numpy as np
import joblib
import json

### Dataset for Random forest and XGboost

The datasets `kiskore_data` and `drina_data` were created based on the methodology described in the [Waste Detection Wiki](https://gitlab.inf.elte.hu/gislab/waste-detection/-/wikis/Creating-CSV-files-from-Planet-training-data).  

These CSV files (`kiskore_images.csv` and `drina_images.csv`) were generated from Planet training data and are used for testing purposes.


In [None]:
kiskore_data = pd.read_csv("test_data/kiskore_images.csv", delimiter = ";")
drina_data = pd.read_csv("test_data/drina_images.csv", delimiter = ";")

In [None]:
# If working with data from Sentinel or other sources with additional spectral bands,
# modify the `bands` list to include the desired band names.

bands = ['BLUE', 'GREEN', 'RED', 'NIR', 'PI', 'NDWI', 'NDVI', 'RNDVI', 'SR']

X_kiskore = kiskore_data[bands]
y_kiskore = kiskore_data['COD'].values

X_drina= drina_data[bands]
y_drina = drina_data['COD'].values

X_merged = pd.concat([X_kiskore, X_drina], axis=0)

y_merged = np.concatenate([y_kiskore, y_drina])

### Load models (Random forest, XGBoost)

#### XGBoost Model Training

The XGBoost model was trained based on the methodology described in the [Waste Detection Wiki](https://gitlab.inf.elte.hu/gislab/waste-detection/-/wikis/Training-a-model).  

However, the current results are suboptimal due to missing hyperparameter tuning and further adjustments. It would be more effective to train a new model with improved settings and use that for predictions.


In [32]:
xgboost_model_path = "models/xgboost_model_sampled.sav"
xgboost_model = joblib.load(xgboost_model_path)

In [10]:
rf_model_path = 'models/with_extra_unknowns.sav'
rf_model = joblib.load(rf_model_path)

### Create predictions

#### RF and XGBoost

In [17]:
rf_predictions_merged = rf_model.predict(X_merged)

In [None]:
# This code prepares the ground truth labels and predictions from a Random Forest model
# for binary classification. Specifically, it focuses on identifying a target class 
# (e.g., waste class label = 100) by converting multi-class labels and predictions 
# into binary format, where 1 represents the target class and 0 represents all others.
waste_class_label = 100

rf_y_pred_merged = rf_predictions_merged.astype(int)
rf_y_true_merged = y_merged

rf_y_true_bin_merged = np.where(rf_y_true_merged == waste_class_label, 1, 0)
rf_y_pred_bin_merged = np.where(rf_y_pred_merged == waste_class_label, 1, 0)

In [33]:
xgboost_predictions_merged = xgboost_model.predict(X_merged)

In [None]:
class_labels = [100, 200, 300, 400, 500]
label_mapping = {label: idx for idx, label in enumerate(class_labels)}
inverse_label_mapping = {v: k for k, v in label_mapping.items()}

In [None]:
# Same as at RF.
xgboost_y_pred_merged = np.array([inverse_label_mapping[label] for label in xgboost_predictions_merged])
xgboost_y_true_merged = y_merged

xgboost_y_true_bin_merged = np.where(xgboost_y_true_merged == waste_class_label, 1, 0)
xgboost_y_pred_bin_merged = np.where(xgboost_y_pred_merged == waste_class_label, 1, 0)

### METRICES

#### RF and XGBoost

In [None]:
TP_rf = np.sum((rf_y_pred_bin_merged == 1) & (rf_y_true_bin_merged == 1))
FP_rf = np.sum((rf_y_pred_bin_merged == 1) & (rf_y_true_bin_merged == 0))
TN_rf = np.sum((rf_y_pred_bin_merged == 0) & (rf_y_true_bin_merged == 0))
FN_rf = np.sum((rf_y_pred_bin_merged == 0) & (rf_y_true_bin_merged == 1))

accuracy_rf = (TP_rf + TN_rf) / (TP_rf + TN_rf + FP_rf + FN_rf)
precision_rf = TP_rf / (TP_rf + FP_rf) if (TP_rf + FP_rf) > 0 else 0
recall_rf = TP_rf / (TP_rf + FN_rf) if (TP_rf + FN_rf) > 0 else 0
f1_score_rf = 2 * (precision_rf * recall_rf) / (precision_rf + recall_rf) if (precision_rf + recall_rf) > 0 else 0

print(f"\nRandom Forest teljesítmény a hulladék osztályra ({waste_class_label}):")
print(f"True Positives (TP): {TP_rf}")
print(f"False Positives (FP): {FP_rf}")
print(f"True Negatives (TN): {TN_rf}")
print(f"False Negatives (FN): {FN_rf}")
print(f"Accuracy: {accuracy_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Recall: {recall_rf:.4f}")
print(f"F1-score: {f1_score_rf:.4f}")


Random Forest teljesítmény a hulladék osztályra (100):
True Positives (TP): 4528
False Positives (FP): 0
True Negatives (TN): 5155
False Negatives (FN): 13073
Pontosság (Accuracy): 0.4255
Precízió (Precision): 1.0000
Visszahívás (Recall): 0.2573
F1-score: 0.4092


In [None]:
TP_xgboost = np.sum((xgboost_y_pred_bin_merged == 1) & (xgboost_y_true_bin_merged == 1))
FP_xgboost = np.sum((xgboost_y_pred_bin_merged == 1) & (xgboost_y_true_bin_merged == 0))
TN_xgboost = np.sum((xgboost_y_pred_bin_merged == 0) & (xgboost_y_true_bin_merged == 0))
FN_xgboost = np.sum((xgboost_y_pred_bin_merged == 0) & (xgboost_y_true_bin_merged == 1))

accuracy_xgb = (TP_xgboost + TN_xgboost) / (TP_xgboost + TN_xgboost + FP_xgboost + FN_xgboost)
precision_xgb = TP_xgboost / (TP_xgboost + FP_xgboost) if (TP_xgboost + FP_xgboost) > 0 else 0
recall_xgb = TP_xgboost / (TP_xgboost + FN_xgboost) if (TP_xgboost + FN_xgboost) > 0 else 0
f1_score_xgb = 2 * (precision_xgb * recall_xgb) / (precision_xgb + recall_xgb) if (precision_xgb + recall_xgb) > 0 else 0

print(f"\nXGBoost teljesítmény a hulladék osztályra ({waste_class_label}):")
print(f"True Positives (TP): {TP_xgboost}")
print(f"False Positives (FP): {FP_xgboost}")
print(f"True Negatives (TN): {TN_xgboost}")
print(f"False Negatives (FN): {FN_xgboost}")
print(f"Accuracy: {accuracy_xgb:.4f}")
print(f"Precision: {precision_xgb:.4f}")
print(f"Recall: {recall_xgb:.4f}")
print(f"F1-score: {f1_score_xgb:.4f}")


XGBoost teljesítmény a hulladék osztályra (100):
True Positives (TP): 293
False Positives (FP): 0
True Negatives (TN): 5155
False Negatives (FN): 17308
Pontosság (Accuracy): 0.2394
Precízió (Precision): 1.0000
Visszahívás (Recall): 0.0166
F1-score: 0.0327


### Summary

In [None]:
metrics_xgb = {
    'True Positives (TP)': TP_xgboost,
    'False Positives (FP)': FP_xgboost,
    'True Negatives (TN)': TN_xgboost,
    'False Negatives (FN)': FN_xgboost,
    'Accuracy': accuracy_xgb,
    'Precision': precision_xgb,
    'Recall': recall_xgb,
    'F1-score': f1_score_xgb,
}

metrics_rf = {
    'True Positives (TP)': TP_rf,
    'False Positives (FP)': FP_rf,
    'True Negatives (TN)': TN_rf,
    'False Negatives (FN)': FN_rf,
    'Accuracy': accuracy_rf,
    'Precision': precision_rf,
    'Recall': recall_rf,
    'F1-score': f1_score_rf
}

#### Save to json

In [None]:
def convert_numpy_types(obj):
    if isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return obj


results = {
    'XGBoost': metrics_xgb,
    'Random Forest': metrics_rf,
}

with open('model_metrics_ml.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, ensure_ascii=False, indent=4, default=convert_numpy_types)