# <center>Machine Learning Project</center>

** **
## <center>*03.9 - XGBBoosted*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639



In [1]:
# Import libraries
import pandas as pd
import numpy as np

#make the split here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from xgboost import XGBClassifier

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

## <span style="color:salmon"> 1. Import Dataset </span> 

In [2]:
# Import dataset
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")

In [3]:
# Import dataset
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or []
feature_selection = []

## <span style="color:salmon"> 2. Prepare Dataset </span> 

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

## <span style="color:salmon"> 3. XGBoosted </span> 

In [6]:
config = {'max_depth': 6, 'learning_rate': 0.1, 'n_estimators': 200}

In [7]:
model = XGBClassifier(
        max_depth=config["max_depth"],
        learning_rate=config["learning_rate"],
        n_estimators=config["n_estimators"],
        verbose = 0,
    )

In [8]:
check_performace(model,X,y,numerical_features,essential_features,n_folds = 5)

Fold 1 train F1 score: 0.3611
Fold 1 validation F1 score: 0.3118
------------------------------
Fold 2 train F1 score: 0.3576
Fold 2 validation F1 score: 0.3131
------------------------------
Fold 3 train F1 score: 0.3499
Fold 3 validation F1 score: 0.3128
------------------------------
Fold 4 train F1 score: 0.3521
Fold 4 validation F1 score: 0.3142
------------------------------
Fold 5 train F1 score: 0.3472
Fold 5 validation F1 score: 0.3169
------------------------------
Average Train F1 score: 0.3535796536243594
Average Validation F1 score: 0.31376557955550777


In [9]:
check_performace(model,X,y,numerical_features,reduced_features,n_folds = 5)

Fold 1 train F1 score: 0.4727
Fold 1 validation F1 score: 0.3654
------------------------------
Fold 2 train F1 score: 0.5003
Fold 2 validation F1 score: 0.3740
------------------------------
Fold 3 train F1 score: 0.4918
Fold 3 validation F1 score: 0.3719
------------------------------
Fold 4 train F1 score: 0.5003
Fold 4 validation F1 score: 0.3674
------------------------------
Fold 5 train F1 score: 0.4942
Fold 5 validation F1 score: 0.3644
------------------------------
Average Train F1 score: 0.49185220194843843
Average Validation F1 score: 0.3686070909952829


In [10]:
check_performace(model,X,y,numerical_features,[],n_folds = 5)

Fold 1 train F1 score: 0.6245
Fold 1 validation F1 score: 0.4413
------------------------------
Fold 2 train F1 score: 0.6201
Fold 2 validation F1 score: 0.4284
------------------------------
Fold 3 train F1 score: 0.6264
Fold 3 validation F1 score: 0.4346
------------------------------
Fold 4 train F1 score: 0.6142
Fold 4 validation F1 score: 0.4473
------------------------------
Fold 5 train F1 score: 0.6204
Fold 5 validation F1 score: 0.4448
------------------------------
Average Train F1 score: 0.6211168856198206
Average Validation F1 score: 0.43929690557796564


#### <span style="color:salmon"> 3.1  Evaluate the model </span> 


In [11]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.75, stratify = y, shuffle = True)

In [12]:
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [13]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features])  

In [14]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X.columns:
        if col not in feature_selection:
            drop_list.append(col)
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)

In [15]:
model.fit(X_train, y_train)

In [16]:
y_train_pred = model.predict(X_train)

In [17]:
y_val_pred = model.predict(X_val)

In [18]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

Compute confusion matrix to evaluate the accuracy of a classification

In [19]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, y_train_pred, target_names=target_names))

Confusion Matrix:
[[ 1716  1297    59    29     2     1     0     0]
 [  432 71609   274   326    33     0     0     0]
 [    5  8903  2252  5392   662     0     0     0]
 [   11  1839   438 33165  1637     1     0     1]
 [    0   159   123  3732  8048     0     0     0]
 [    0     0     9   727   115   200     0     0]
 [    0     0     0     0     0     0    24     0]
 [    0     0     0     1     0     0     0   113]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.79      0.55      0.65      3104
    2. NON-COMP       0.85      0.99      0.92     72674
    3. MED ONLY       0.71      0.13      0.22     17214
   4. TEMPORARY       0.76      0.89      0.82     37092
5. PPD SCH LOSS       0.77      0.67      0.71     12062
     6. PPD NSL       0.99      0.19      0.32      1051
         7. PTD       1.00      1.00      1.00        24
       8. DEATH       0.99      0.99      0.99       114

       accuracy                   

In [20]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=target_names))


Confusion Matrix:
[[  4544   4379    169    199     17      0      0      2]
 [  1494 213544   1413   1406    155      0      0      9]
 [    56  27316   4427  17661   2171      2      0     11]
 [    63   6019   2159  96479   6480     35      0     43]
 [     2    501    622  13878  21172     11      0      0]
 [     0     10     50   2651    428     13      0      0]
 [     0      0      2     65      6      0      0      0]
 [     1     44     14    197      3      0      0     84]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.74      0.49      0.59      9310
    2. NON-COMP       0.85      0.98      0.91    218021
    3. MED ONLY       0.50      0.09      0.15     51644
   4. TEMPORARY       0.73      0.87      0.79    111278
5. PPD SCH LOSS       0.70      0.59      0.64     36186
     6. PPD NSL       0.21      0.00      0.01      3152
         7. PTD       0.00      0.00      0.00        73
       8. DEATH       0.56  

## <span style="color:salmon"> 4. Test Predictions </span> 

Make validation predictions:

In [21]:
X, test_df = apply_frequency_encoding(X, test_df)
NA_imputer(X, test_df)
create_new_features(X, test_df)

In [22]:
scaler = StandardScaler().fit(X[numerical_features])
X[numerical_features]  = scaler.transform(X[numerical_features])
test_df[numerical_features]  = scaler.transform(test_df[numerical_features])  

In [23]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X.columns:
        if col not in feature_selection:
            drop_list.append(col)
test_df = test_df.drop(drop_list, axis=1)

In [24]:
# Make validation predictions
y_test_pred = model.predict(test_df)
y_test_pred = y_test_pred.ravel()

In [25]:
y_test_final = np.array([class_mapping[i] for i in y_test_pred])

In [26]:
test_id = test_df.index

In [27]:
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': y_test_final
})

In [28]:
if False:
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)