# <center>Machine Learning Project</center>

** **
## <center>*03.1 - CatBoosted*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Louren√ßo - 2021639



In [1]:
# Import libraries
import pandas as pd
import numpy as np

#make the split here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score

from catboost import CatBoostClassifier

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

random_state=68+1

## <span style="color:salmon"> 1. Import Dataset </span> 

In [2]:
# Import dataset
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")

In [3]:
# Import dataset
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or []
feature_selection = []

## <span style="color:salmon"> 2. Prepare Dataset </span> 

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

## <span style="color:salmon"> 3. CatBoosted </span> 

In [6]:
config = {'iterations': 500, 'learning_rate': 0.16, 'depth': 6, 'l2_leaf_reg': 9.9}

In [7]:
ignore = """
model = CatBoostClassifier(
        iterations=config["iterations"],
        learning_rate=config["learning_rate"],
        depth=config["depth"],
        l2_leaf_reg=config["l2_leaf_reg"],
        loss_function="MultiClass", 
        eval_metric="MultiClass",  
        custom_metric=['F1'], 
        verbose=0
    )
"""

In [8]:
model = CatBoostClassifier(custom_metric=['F1'], verbose=0)

In [10]:
check_performace(model,X,y,numerical_features,essential_features,n_folds = 5, random_state=random_state)

Fold 1 train F1 score: 0.3520
Fold 1 validation F1 score: 0.3194
------------------------------
Fold 2 train F1 score: 0.3531
Fold 2 validation F1 score: 0.3151
------------------------------
Fold 3 train F1 score: 0.3589
Fold 3 validation F1 score: 0.3158
------------------------------
Fold 4 train F1 score: 0.3521
Fold 4 validation F1 score: 0.3182
------------------------------
Fold 5 train F1 score: 0.3467
Fold 5 validation F1 score: 0.3121
------------------------------
Average Train F1 score: 0.3525908445502652
Average Validation F1 score: 0.3161110278408993


In [12]:
check_performace(model,X,y,numerical_features,reduced_features,n_folds = 5, random_state=random_state)

Fold 1 train F1 score: 0.5029
Fold 1 validation F1 score: 0.3713
------------------------------
Fold 2 train F1 score: 0.5123
Fold 2 validation F1 score: 0.3752
------------------------------
Fold 3 train F1 score: 0.5074
Fold 3 validation F1 score: 0.3587
------------------------------
Fold 4 train F1 score: 0.5147
Fold 4 validation F1 score: 0.3666
------------------------------
Fold 5 train F1 score: 0.5105
Fold 5 validation F1 score: 0.3640
------------------------------
Average Train F1 score: 0.5095813009139549
Average Validation F1 score: 0.36718083267181045


In [14]:
check_performace(model,X,y,numerical_features,[],n_folds = 5, random_state=random_state)

Fold 1 train F1 score: 0.5877
Fold 1 validation F1 score: 0.4126
------------------------------
Fold 2 train F1 score: 0.5835
Fold 2 validation F1 score: 0.4136
------------------------------
Fold 3 train F1 score: 0.5866
Fold 3 validation F1 score: 0.4138
------------------------------
Fold 4 train F1 score: 0.5942
Fold 4 validation F1 score: 0.4162
------------------------------
Fold 5 train F1 score: 0.5885
Fold 5 validation F1 score: 0.4042
------------------------------
Average Train F1 score: 0.5881011773911614
Average Validation F1 score: 0.4120814287515409


#### <span style="color:salmon"> 3.1  Evaluate the model </span> 


In [15]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.20, stratify = y, shuffle = True, random_state=random_state)
X_train_to_preprocess=X_train.copy()

In [16]:
remove_outliers(X_train)
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [17]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features])  

In [18]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X_train.columns:
        if col not in feature_selection:
            drop_list.append(col)
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)

In [19]:
model.fit(X_train, y_train)

<catboost.core.CatBoostClassifier at 0x22507138450>

In [20]:
y_train_pred = model.predict(X_train)

In [21]:
y_val_pred = model.predict(X_val)

In [22]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

Compute confusion matrix to evaluate the accuracy of a classification

In [23]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, y_train_pred, target_names=target_names))

Confusion Matrix:
[[ 1778  1141    44   134     7     0     0     0]
 [  339 70361   258  1607   109     0     0     0]
 [   19  9863  2045  4578   709     0     0     0]
 [   14  8179   496 26811  1592     0     0     0]
 [    1   405   150  3716  7790     0     0     0]
 [    0     2    10   795   103   141     0     0]
 [    0     0     0     0     0     0    24     0]
 [    0     0     0     0     0     0     0   114]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.83      0.57      0.68      3104
    2. NON-COMP       0.78      0.97      0.87     72674
    3. MED ONLY       0.68      0.12      0.20     17214
   4. TEMPORARY       0.71      0.72      0.72     37092
5. PPD SCH LOSS       0.76      0.65      0.70     12062
     6. PPD NSL       1.00      0.13      0.24      1051
         7. PTD       1.00      1.00      1.00        24
       8. DEATH       1.00      1.00      1.00       114

       accuracy                   

In [24]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=target_names))

Confusion Matrix:
[[  4434   4171    151    514     37      0      0      3]
 [  1491 208888   1189   6011    424      0      0     18]
 [    63  30111   3972  15135   2358      0      0      5]
 [    78  25810   2404  76331   6599     21      0     35]
 [     4   1355    698  13409  20715      5      0      0]
 [     0      7     35   2668    431     11      0      0]
 [     0      0      1     65      6      1      0      0]
 [     5     53     10    205      4      0      0     66]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.73      0.48      0.58      9310
    2. NON-COMP       0.77      0.96      0.86    218021
    3. MED ONLY       0.47      0.08      0.13     51644
   4. TEMPORARY       0.67      0.69      0.68    111278
5. PPD SCH LOSS       0.68      0.57      0.62     36186
     6. PPD NSL       0.29      0.00      0.01      3152
         7. PTD       0.00      0.00      0.00        73
       8. DEATH       0.52  

## <span style="color:salmon"> 4. Test Predictions </span> 

In [25]:
remove_outliers(X_train_to_preprocess)
X_train_to_preprocess, test_df = apply_frequency_encoding(X_train_to_preprocess, test_df)
NA_imputer(X_train_to_preprocess, test_df)
create_new_features(X_train_to_preprocess, test_df)

In [26]:
test_df[numerical_features]  = scaler.transform(test_df[numerical_features])  

In [None]:
test_df = test_df.drop(drop_list, axis=1)

In [28]:
# Make validation predictions
y_test_pred = model.predict(test_df)
y_test_pred = y_test_pred.ravel()

In [29]:
y_test_final = np.array([class_mapping[i] for i in y_test_pred])

In [30]:
test_id = test_df.index

In [31]:
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': y_test_final
})

In [32]:
if False:
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)