# <center>Machine Learning Project</center>

** **
## <center>*04.1 - CatBoosted*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639



In [1]:
# Import libraries
import pandas as pd
import numpy as np

#make the split here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score

from catboost import CatBoostClassifier

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

random_state=68+1

## <span style="color:salmon"> 1. Import Dataset </span> 

In [2]:
# Import dataset
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")

In [3]:
# Import dataset
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or [] (No Feature Selection)
feature_selection = []

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

## <span style="color:salmon"> 2. CatBoosted </span> 

#### <span style="color:salmon"> 2.1  Model K-fold cross validation </span> 

In [6]:
config_1 ={
        "iterations": 300,
        "learning_rate": 0.7,
        "depth": 6,
        "l2_leaf_reg": 6,
        "bagging_temperature": 0.7,
        "random_state": random_state
        }

In [7]:
model = CatBoostClassifier(
        iterations=config_1["iterations"],
        learning_rate=config_1["learning_rate"],
        depth=config_1["depth"],
        l2_leaf_reg=config_1["l2_leaf_reg"],
        bagging_temperature=config_1["bagging_temperature"],
        # -------------------
        random_state = config_1["random_state"],
        custom_metric='F1', 
        early_stopping_rounds=50,
        verbose=0
)

In [8]:
check_performace(model,X,y,numerical_features,essential_features,n_folds = 3, random_state=random_state)

Fold 1 train F1 score: 0.3720
Fold 1 validation F1 score: 0.3107
------------------------------
Fold 2 train F1 score: 0.3678
Fold 2 validation F1 score: 0.3192
------------------------------
Fold 3 train F1 score: 0.3651
Fold 3 validation F1 score: 0.3149
------------------------------
Average Train F1 score: 0.3682958001583308
Average Validation F1 score: 0.3149497495591473


In [9]:
# The use of the Reduced Features made the model worse than no feature selection or only the essential features
#check_performace(model,X,y,numerical_features,reduced_features,n_folds = 3, random_state=random_state)

In [10]:
config_2 ={
        "iterations": 1000,
        "learning_rate": 0.11,
        "depth": 6,
        "l2_leaf_reg": 5,
        "bagging_temperature": 0.4,
        "random_state": random_state
    }
    

In [11]:
model = CatBoostClassifier(
        iterations=config_2["iterations"],
        learning_rate=config_2["learning_rate"],
        depth=config_2["depth"],
        l2_leaf_reg=config_2["l2_leaf_reg"],
        bagging_temperature=config_2["bagging_temperature"],
        # -------------------
        random_state = config_2["random_state"],
        custom_metric='F1', 
        early_stopping_rounds=50,
        verbose=0
)

In [12]:
check_performace(model,X,y,numerical_features,[],n_folds = 3, random_state=random_state)

Fold 1 train F1 score: 0.5592
Fold 1 validation F1 score: 0.4140
------------------------------
Fold 2 train F1 score: 0.5660
Fold 2 validation F1 score: 0.4141
------------------------------
Fold 3 train F1 score: 0.5608
Fold 3 validation F1 score: 0.4046
------------------------------
Average Train F1 score: 0.5619988502534625
Average Validation F1 score: 0.41089690988965594


#### <span style="color:salmon"> 2.2  Train the model </span> 

In [13]:
selected_config = config_2

In [14]:
model = CatBoostClassifier(
        iterations=selected_config["iterations"],
        learning_rate=selected_config["learning_rate"],
        depth=selected_config["depth"],
        l2_leaf_reg=selected_config["l2_leaf_reg"],
        bagging_temperature=selected_config["bagging_temperature"],
        # -------------------
        random_state = selected_config["random_state"],
        custom_metric='F1', 
        early_stopping_rounds=50,
        verbose=0
)

In [15]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.20, stratify = y, shuffle = True, random_state=random_state)
X_train_to_preprocess=X_train.copy()

In [16]:
remove_outliers(X_train)
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [17]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features])  

In [18]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X_train.columns:
        if col not in feature_selection:
            drop_list.append(col)
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)

In [19]:
model.fit(X_train, y_train)

<catboost.core.CatBoostClassifier at 0x16e19ec17d0>

In [20]:
y_train_pred = model.predict(X_train)

In [21]:
y_val_pred = model.predict(X_val)

#### <span style="color:salmon"> 2.3  Model Results </span> 

In [22]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

Compute confusion matrix to evaluate the accuracy of a classification

In [23]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, y_train_pred, target_names=target_names))

Confusion Matrix:
[[  5168   4115    150    515     29      0      0      3]
 [  1446 224008   1106   5887    406      0      0      9]
 [    55  31721   5312  15564   2465      1      0      7]
 [    65  26688   2194  83590   6240      2      0     26]
 [     4   1341    600  13053  23624      2      0      0]
 [     0      8     34   2848    429     50      0      0]
 [     0      0      0     52      2      0     24      0]
 [     2     25      4     78      2      0      0    265]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.77      0.52      0.62      9980
    2. NON-COMP       0.78      0.96      0.86    232862
    3. MED ONLY       0.57      0.10      0.16     55125
   4. TEMPORARY       0.69      0.70      0.70    118805
5. PPD SCH LOSS       0.71      0.61      0.66     38624
     6. PPD NSL       0.91      0.01      0.03      3369
         7. PTD       1.00      0.31      0.47        78
       8. DEATH       0.85  

In [24]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=target_names))

Confusion Matrix:
[[ 1229  1088    37   130    10     0     0     1]
 [  396 55856   313  1532   115     0     0     4]
 [   20  8011  1122  4002   624     2     0     0]
 [   15  6896   679 20389  1703     6     0    14]
 [    2   360   164  3427  5703     0     0     0]
 [    0     3     7   718   110     4     0     0]
 [    0     0     0    16     3     0     0     0]
 [    3    10     1    49     1     0     0    30]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.74      0.49      0.59      2495
    2. NON-COMP       0.77      0.96      0.86     58216
    3. MED ONLY       0.48      0.08      0.14     13781
   4. TEMPORARY       0.67      0.69      0.68     29702
5. PPD SCH LOSS       0.69      0.59      0.64      9656
     6. PPD NSL       0.33      0.00      0.01       842
         7. PTD       0.00      0.00      0.00        19
       8. DEATH       0.61      0.32      0.42        94

       accuracy                   

## <span style="color:salmon"> 3. Test Predictions </span> 

In [25]:
remove_outliers(X_train_to_preprocess)
X_train_to_preprocess, test_df = apply_frequency_encoding(X_train_to_preprocess, test_df)
NA_imputer(X_train_to_preprocess, test_df)
create_new_features(X_train_to_preprocess, test_df)

In [26]:
test_df[numerical_features]  = scaler.transform(test_df[numerical_features])  

In [27]:
test_df = test_df.drop(drop_list, axis=1)

In [28]:
y_test_pred = model.predict(test_df)
y_test_pred = y_test_pred.ravel()

In [29]:
y_test_final = np.array([class_mapping[i] for i in y_test_pred])

In [30]:
test_id = test_df.index

In [31]:
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': y_test_final
})

In [32]:
if False:
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)