# <center>Machine Learning Project</center>

** **
## <center>*03.3 - Gradient Boost*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639



In [1]:
# Import libraries
import pandas as pd
import numpy as np

#make the split here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

random_state=68+1

## <span style="color:salmon"> 1. Import Dataset </span> 

In [2]:
# Import dataset
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")

In [3]:
# Import dataset
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or []
feature_selection = []

## <span style="color:salmon"> 2. Prepare Dataset </span> 

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

## <span style="color:salmon"> 3. Gradient Boosting</span> 

In [6]:
model = DecisionTreeClassifier(max_depth=None, min_samples_leaf=10, class_weight='balanced', random_state=random_state)

In [7]:
check_performace(model,X,y,numerical_features,essential_features,n_folds = 5, random_state=random_state)

Fold 1 train F1 score: 0.3973
Fold 1 validation F1 score: 0.3071
------------------------------
Fold 2 train F1 score: 0.3979
Fold 2 validation F1 score: 0.3090
------------------------------
Fold 3 train F1 score: 0.4006
Fold 3 validation F1 score: 0.3113
------------------------------
Fold 4 train F1 score: 0.4008
Fold 4 validation F1 score: 0.3099
------------------------------
Fold 5 train F1 score: 0.3987
Fold 5 validation F1 score: 0.3097
------------------------------
Average Train F1 score: 0.3990708023492851
Average Validation F1 score: 0.30938762251411256


In [8]:
check_performace(model,X,y,numerical_features,reduced_features,n_folds = 5, random_state=random_state)

Fold 1 train F1 score: 0.4923
Fold 1 validation F1 score: 0.3038
------------------------------
Fold 2 train F1 score: 0.4922
Fold 2 validation F1 score: 0.3033
------------------------------
Fold 3 train F1 score: 0.4911
Fold 3 validation F1 score: 0.3024
------------------------------
Fold 4 train F1 score: 0.4900
Fold 4 validation F1 score: 0.2976
------------------------------
Fold 5 train F1 score: 0.4940
Fold 5 validation F1 score: 0.3031
------------------------------
Average Train F1 score: 0.4919281047438123
Average Validation F1 score: 0.3020322293756747


In [9]:
check_performace(model,X,y,numerical_features,[],n_folds = 5, random_state=random_state)

Fold 1 train F1 score: 0.5409
Fold 1 validation F1 score: 0.3329
------------------------------
Fold 2 train F1 score: 0.5364
Fold 2 validation F1 score: 0.3266
------------------------------
Fold 3 train F1 score: 0.5352
Fold 3 validation F1 score: 0.3264
------------------------------
Fold 4 train F1 score: 0.5395
Fold 4 validation F1 score: 0.3235
------------------------------
Fold 5 train F1 score: 0.5388
Fold 5 validation F1 score: 0.3228
------------------------------
Average Train F1 score: 0.5381845547752319
Average Validation F1 score: 0.32644032975882115


#### <span style="color:salmon"> 3.1  Evaluate the model </span> 


In [10]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.25, stratify = y, shuffle = True, random_state=random_state)
X_train_to_preprocess=X_train.copy()

In [11]:
remove_outliers(X_train)
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [12]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features])  

In [13]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X.columns:
        if col not in feature_selection:
            drop_list.append(col)
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)

In [14]:
model.fit(X_train, y_train)

In [15]:
y_train_pred = model.predict(X_train)

In [16]:
y_val_pred = model.predict(X_val)

In [17]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

Compute confusion matrix to evaluate the accuracy of a classification

In [18]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, y_train_pred, target_names=target_names))

Confusion Matrix:
[[  9267      0     25      5     22     19      0     18]
 [ 12560 154877  35916  11073   3301    287     23    272]
 [  1650   3403  37687   4186   3518   1077     33    126]
 [  3072   9229  15287  60969  15112   6998    236    477]
 [   340    206   1214   2049  30511   1778     47     65]
 [     0      0      0      0      2   3147      6      3]
 [     0      0      0      0      0      0     73      0]
 [     0      0      0      0      0      0      0    352]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.34      0.99      0.51      9356
    2. NON-COMP       0.92      0.71      0.80    218309
    3. MED ONLY       0.42      0.73      0.53     51680
   4. TEMPORARY       0.78      0.55      0.64    111380
5. PPD SCH LOSS       0.58      0.84      0.69     36210
     6. PPD NSL       0.24      1.00      0.38      3158
         7. PTD       0.17      1.00      0.30        73
       8. DEATH       0.27  

In [19]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=target_names))

Confusion Matrix:
[[ 2098   534   251   124    81    11     2    18]
 [ 4794 45474 16402  4773  1119   101     1   105]
 [  893  5143  5491  3367  1836   425    14    57]
 [ 1106  4231  7028 15188  6602  2719    86   167]
 [  188   141  1101  2154  7454   964    32    36]
 [   15     0    80   402   266   276     9     5]
 [    1     0     4     6     6     5     2     0]
 [   23     4     8    11     9     2     0    61]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.23      0.67      0.34      3119
    2. NON-COMP       0.82      0.62      0.71     72769
    3. MED ONLY       0.18      0.32      0.23     17226
   4. TEMPORARY       0.58      0.41      0.48     37127
5. PPD SCH LOSS       0.43      0.62      0.51     12070
     6. PPD NSL       0.06      0.26      0.10      1053
         7. PTD       0.01      0.08      0.02        24
       8. DEATH       0.14      0.52      0.22       118

       accuracy                   

## <span style="color:salmon"> 4. Test Predictions </span> 

Make validation predictions:

In [20]:
remove_outliers(X_train_to_preprocess)
X_train_to_preprocess, test_df = apply_frequency_encoding(X_train_to_preprocess, test_df)
NA_imputer(X_train_to_preprocess, test_df)
create_new_features(X_train_to_preprocess, test_df)

In [21]:
test_df[numerical_features]  = scaler.transform(test_df[numerical_features])  

In [22]:
test_df = test_df.drop(drop_list, axis=1)

In [23]:
# Make validation predictions
y_test_pred = model.predict(test_df)
y_test_pred = y_test_pred.ravel()

In [24]:
y_test_final = np.array([class_mapping[i] for i in y_test_pred])

In [25]:
test_id = model.index

AttributeError: 'DecisionTreeClassifier' object has no attribute 'index'

In [None]:
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': y_test_final
})

In [None]:
if False:
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)