# <center>Machine Learning Project</center>

** **
## <center>*03.12 - Ensemble*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639



In [43]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *


from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier

## <span style="color:salmon"> 1. Import Dataset </span> 

In [45]:
# Import dataset
train_df = pd.read_csv('../preprocessed_data/train_data.csv', index_col="Claim Identifier")

In [46]:
# Import dataset
test_df = pd.read_csv('../preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [47]:
# Define Feature Selection: essential_features, reduced_features or []
feature_selection = []

In [48]:
missing_percentage = train_df.isna().sum() / len(train_df) * 100
for col, percent in missing_percentage.items():
    if not percent == 0:
        print(f"{col}: {percent:.2f}% missing values")

Age at Injury: 0.40% missing values
Average Weekly Wage: 63.41% missing values
Birth Year: 0.40% missing values
Industry Code: 1.73% missing values
WCIO Cause of Injury Code: 2.72% missing values
WCIO Nature of Injury Code: 2.72% missing values
WCIO Part Of Body Code: 2.98% missing values
Zip Code: 4.99% missing values


## <span style="color:salmon"> 2. Prepare Dataset </span> 

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [49]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

In [50]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.25, stratify = y, shuffle = True)

In [51]:
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [52]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features])  

In [53]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X.columns:
        if col not in feature_selection:
            drop_list.append(col)
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)

In [54]:
# XGBoost
xgb_model = XGBClassifier(num_class=8, 
        objective="multi:softmax",
        eval_metric="merror",
        verbose = 0)
xgb_model.fit(X_train, y_train)

In [55]:
# LightGBM
lgbm_model = LGBMClassifier()
lgbm_model.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008667 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1916
[LightGBM] [Info] Number of data points in the train set: 430006, number of used features: 60
[LightGBM] [Info] Start training from score -3.832603
[LightGBM] [Info] Start training from score -0.679208
[LightGBM] [Info] Start training from score -2.119445
[LightGBM] [Info] Start training from score -1.351777
[LightGBM] [Info] Start training from score -2.475127
[LightGBM] [Info] Start training from score -4.915762
[LightGBM] [Info] Start training from score -8.681095
[LightGBM] [Info] Start training from score -7.133824


In [56]:
# CatBoost
catboost_model = CatBoostClassifier(verbose=0)
catboost_model.fit(X_train, y_train)

<catboost.core.CatBoostClassifier at 0x3cda43770>

In [57]:
# Gradient Boosting
gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)

In [58]:
xgb_y_train_pred = xgb_model.predict(X_train)

In [59]:
lgbm_y_train_pred = lgbm_model.predict(X_train)

In [60]:
catboost_y_train_pred = catboost_model.predict(X_train)

In [61]:
gb_y_train_pred = gb_model.predict(X_train)

In [62]:
xgb_y_val_pred = xgb_model.predict(X_val)

In [63]:
lgbm_y_val_pred = lgbm_model.predict(X_val)

In [64]:
catboost_y_val_pred = catboost_model.predict(X_val)

In [65]:
gb_y_val_pred = gb_model.predict(X_val)

In [66]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

Compute confusion matrix to evaluate the accuracy of a classification

In [67]:
# Evaluate the xgb model
print("Confusion Matrix:")
print(confusion_matrix(y_train, xgb_y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, xgb_y_train_pred, target_names=target_names))

Confusion Matrix:
[[  5238   3460    133    457     23      0      0      0]
 [  1355 210055    961   5259    389      0      0      2]
 [    40  29524   5315  14458   2304      0      0      2]
 [    69  25152   1907  78296   5833     15      0      5]
 [     4   1330    478  11772  22600      2      0      0]
 [     0      6     33   2504    364    245      0      0]
 [     0      0      0      0      0      0     73      0]
 [     0      2      1      5      0      0      0    335]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.78      0.56      0.65      9311
    2. NON-COMP       0.78      0.96      0.86    218021
    3. MED ONLY       0.60      0.10      0.18     51643
   4. TEMPORARY       0.69      0.70      0.70    111277
5. PPD SCH LOSS       0.72      0.62      0.67     36186
     6. PPD NSL       0.94      0.08      0.14      3152
         7. PTD       1.00      1.00      1.00        73
       8. DEATH       0.97  

In [68]:
# Evaluate the lgbm model
print("Confusion Matrix:")
print(confusion_matrix(y_train, lgbm_y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, lgbm_y_train_pred, target_names=target_names))

Confusion Matrix:
[[  4807   3800    169    464     41     14      4     12]
 [  1504 207085   1791   6689    574    255     72     51]
 [    74  29715   4502  14709   2532     58     16     37]
 [   251  26055   2872  74484   7172    304     26    113]
 [    63   1490    793  12471  21257     58     13     41]
 [     9      9     58   2361    410    302      0      3]
 [     1      1      1     64      4      0      0      2]
 [     3     40     11    126      6      1      0    156]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.72      0.52      0.60      9311
    2. NON-COMP       0.77      0.95      0.85    218021
    3. MED ONLY       0.44      0.09      0.15     51643
   4. TEMPORARY       0.67      0.67      0.67    111277
5. PPD SCH LOSS       0.66      0.59      0.62     36186
     6. PPD NSL       0.30      0.10      0.15      3152
         7. PTD       0.00      0.00      0.00        73
       8. DEATH       0.38  

In [69]:
# Evaluate the catboost model
print("Confusion Matrix:")
print(confusion_matrix(y_train, catboost_y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, catboost_y_train_pred, target_names=target_names))

Confusion Matrix:
[[  4852   3811    150    466     31      0      0      1]
 [  1343 209946    991   5392    342      0      0      7]
 [    53  29718   5065  14531   2270      0      0      6]
 [    62  25309   1956  78110   5817      5      0     18]
 [     5   1297    537  12227  22118      2      0      0]
 [     0      8     32   2639    389     84      0      0]
 [     0      0      0     17      2      0     54      0]
 [     3      9      3     43      2      0      0    283]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.77      0.52      0.62      9311
    2. NON-COMP       0.78      0.96      0.86    218021
    3. MED ONLY       0.58      0.10      0.17     51643
   4. TEMPORARY       0.69      0.70      0.70    111277
5. PPD SCH LOSS       0.71      0.61      0.66     36186
     6. PPD NSL       0.92      0.03      0.05      3152
         7. PTD       1.00      0.74      0.85        73
       8. DEATH       0.90  

In [70]:
# Evaluate the gb model
print("Confusion Matrix:")
print(confusion_matrix(y_train, gb_y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, gb_y_train_pred, target_names=target_names))

Confusion Matrix:
[[  4616   3960    166    526     37      0      0      6]
 [  1685 209515   1160   5139    499      0      0     23]
 [    55  30884   3410  14722   2550      0      4     18]
 [    65  28605   2004  74328   6218      3     11     43]
 [    11   1651    493  14187  19843      0      1      0]
 [     0      3     22   2693    418     15      1      0]
 [     0      0      0     60     10      0      3      0]
 [     6     58      7    157     13      0      2    100]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.72      0.50      0.59      9311
    2. NON-COMP       0.76      0.96      0.85    218021
    3. MED ONLY       0.47      0.07      0.12     51643
   4. TEMPORARY       0.66      0.67      0.67    111277
5. PPD SCH LOSS       0.67      0.55      0.60     36186
     6. PPD NSL       0.83      0.00      0.01      3152
         7. PTD       0.14      0.04      0.06        73
       8. DEATH       0.53  

In [71]:
# Evaluate the xgb model in validation set
print("Confusion Matrix:")
print(confusion_matrix(y_val, xgb_y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, xgb_y_val_pred, target_names=target_names))

Confusion Matrix:
[[ 1533  1326    48   185    11     0     0     0]
 [  480 69544   436  2064   145     0     0     5]
 [   12  9980  1340  5008   868     2     1     4]
 [   20  8445   841 25605  2153    11     0    18]
 [    2   431   203  4344  7078     4     0     0]
 [    0     4    17   898   129     3     0     0]
 [    0     0     1    22     1     0     0     0]
 [    1    11     7    55     1     0     0    39]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.75      0.49      0.60      3103
    2. NON-COMP       0.77      0.96      0.86     72674
    3. MED ONLY       0.46      0.08      0.13     17215
   4. TEMPORARY       0.67      0.69      0.68     37093
5. PPD SCH LOSS       0.68      0.59      0.63     12062
     6. PPD NSL       0.15      0.00      0.01      1051
         7. PTD       0.00      0.00      0.00        24
       8. DEATH       0.59      0.34      0.43       114

       accuracy                   

In [72]:
# Evaluate the lgbm model in validation set
print("Confusion Matrix:")
print(confusion_matrix(y_val, lgbm_y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, lgbm_y_val_pred, target_names=target_names))

Confusion Matrix:
[[ 1460  1365    53   196    17     5     1     6]
 [  532 68667   700  2352   210   105    42    66]
 [   26  9975  1348  4894   909    23     9    31]
 [  116  8580  1036 24569  2465   210    22    95]
 [   19   478   283  4297  6906    45    16    18]
 [    5     4    20   836   145    38     1     2]
 [    0     1     1    17     5     0     0     0]
 [    2    11     5    56     4     1     0    35]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.68      0.47      0.55      3103
    2. NON-COMP       0.77      0.94      0.85     72674
    3. MED ONLY       0.39      0.08      0.13     17215
   4. TEMPORARY       0.66      0.66      0.66     37093
5. PPD SCH LOSS       0.65      0.57      0.61     12062
     6. PPD NSL       0.09      0.04      0.05      1051
         7. PTD       0.00      0.00      0.00        24
       8. DEATH       0.14      0.31      0.19       114

       accuracy                   

In [73]:
# Evaluate the catboost model in validation set
print("Confusion Matrix:")
print(confusion_matrix(y_val, catboost_y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, catboost_y_val_pred, target_names=target_names))

Confusion Matrix:
[[ 1501  1359    42   192     7     0     0     2]
 [  473 69602   415  2032   148     0     0     4]
 [   12 10001  1357  5025   818     0     0     2]
 [   19  8398   811 25744  2086    12     0    23]
 [    0   414   209  4401  7038     0     0     0]
 [    0     4    11   910   125     1     0     0]
 [    0     0     0    21     3     0     0     0]
 [    0    12     3    55     0     0     0    44]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.75      0.48      0.59      3103
    2. NON-COMP       0.78      0.96      0.86     72674
    3. MED ONLY       0.48      0.08      0.14     17215
   4. TEMPORARY       0.67      0.69      0.68     37093
5. PPD SCH LOSS       0.69      0.58      0.63     12062
     6. PPD NSL       0.08      0.00      0.00      1051
         7. PTD       0.00      0.00      0.00        24
       8. DEATH       0.59      0.39      0.47       114

       accuracy                   

In [74]:
# Evaluate the gb model in validation set
print("Confusion Matrix:")
print(confusion_matrix(y_val, gb_y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, gb_y_val_pred, target_names=target_names))

Confusion Matrix:
[[ 1480  1358    58   199     8     0     0     0]
 [  523 69785   423  1764   172     0     0     7]
 [   11 10315  1081  4933   870     0     1     4]
 [   18  9357   670 24900  2118     2     1    27]
 [    2   535   172  4798  6550     2     1     2]
 [    0     1    11   918   120     1     0     0]
 [    0     0     0    19     5     0     0     0]
 [    1    20     7    55     8     0     0    23]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.73      0.48      0.58      3103
    2. NON-COMP       0.76      0.96      0.85     72674
    3. MED ONLY       0.45      0.06      0.11     17215
   4. TEMPORARY       0.66      0.67      0.67     37093
5. PPD SCH LOSS       0.66      0.54      0.60     12062
     6. PPD NSL       0.20      0.00      0.00      1051
         7. PTD       0.00      0.00      0.00        24
       8. DEATH       0.37      0.20      0.26       114

       accuracy                   

## <span style="color:salmon"> 4. Test Ensemble </span> 

In [76]:
# Define the base models
base_models = [
    ("xgb", xgb_model),
    ("lgbm", lgbm_model),
    ("catboost", catboost_model),
    ("gb", gb_model),
]

# Stacking Classifier
stacking_model = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(),
    cv=5  
)

# Train stacking model
stacking_model.fit(X_train, y_train)

# Evaluate
y_pred = stacking_model.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007794 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1916
[LightGBM] [Info] Number of data points in the train set: 430006, number of used features: 60
[LightGBM] [Info] Start training from score -3.832603
[LightGBM] [Info] Start training from score -0.679208
[LightGBM] [Info] Start training from score -2.119445
[LightGBM] [Info] Start training from score -1.351777
[LightGBM] [Info] Start training from score -2.475127
[LightGBM] [Info] Start training from score -4.915762
[LightGBM] [Info] Start training from score -8.681095
[LightGBM] [Info] Start training from score -7.133824
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008605 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_

In [78]:
# Get individual predictions
xgb_pred = xgb_model.predict_proba(X_val)
lgbm_pred = lgbm_model.predict_proba(X_val)
catboost_pred = catboost_model.predict_proba(X_val)
gb_pred = gb_model.predict_proba(X_val)

# Weighted average of probabilities
weights = [0.25, 0.25, 0.25, 0.25]  # Adjust weights if necessary
final_pred = (weights[0] * xgb_pred +
              weights[1] * lgbm_pred +
              weights[2] * catboost_pred +
              weights[3] * gb_pred)

# Convert probabilities to class predictions
final_class_pred = np.argmax(final_pred, axis=1)

print("Accuracy:", accuracy_score(y_val, final_class_pred))


Accuracy: 0.7321189373220963


In [81]:
# Voting Classifier
voting_model = VotingClassifier(
    estimators=base_models,
    voting="hard"  # Use "hard" for majority voting or "soft" for probabilities
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Train and evaluate
voting_model = VotingClassifier(
    estimators=base_models,
    voting="hard"
)
voting_model.fit(X_train_scaled, y_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.012722 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1929
[LightGBM] [Info] Number of data points in the train set: 430006, number of used features: 60
[LightGBM] [Info] Start training from score -3.832603
[LightGBM] [Info] Start training from score -0.679208
[LightGBM] [Info] Start training from score -2.119445
[LightGBM] [Info] Start training from score -1.351777
[LightGBM] [Info] Start training from score -2.475127
[LightGBM] [Info] Start training from score -4.915762
[LightGBM] [Info] Start training from score -8.681095
[LightGBM] [Info] Start training from score -7.133824


In [83]:
y_pred = voting_model.predict(X_val)
print("Accuracy:", accuracy_score(X_val, y_pred))

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (4, 143336) + inhomogeneous part.