# <center>Machine Learning Project</center>

** **
## <center>*04.3 - Decision Tree*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639



In [1]:
# Import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score

from sklearn.tree import DecisionTreeClassifier

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

random_state=68+1

## <span style="color:salmon"> 1. Import Dataset </span> 

In [2]:
# Import dataset
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")

In [3]:
# Import dataset
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or [] (No Feature Selection)
feature_selection = []

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

## <span style="color:salmon"> 2. Decision Tree</span> 

A tree like structured supervised learning model that splits data into subsets based on feature values.

#### <span style="color:salmon"> 2.1  Model K-fold cross validation </span> 

In [6]:
config = {
        "min_samples_split": 10,
        "min_samples_leaf": 4,
        "max_depth": 20,
        "criterion": "entropy",
        "random_state": random_state,
    }
#Same config for both no feature selection and only the essential features

In [7]:
model = DecisionTreeClassifier(
                        max_depth=config["max_depth"], 
                        min_samples_leaf=config["min_samples_leaf"],
                        min_samples_split=config["min_samples_split"],
                        class_weight='balanced',
                        random_state=config["random_state"])

In [8]:
check_performace(model,X,y,numerical_features,essential_features,n_folds = 3, random_state=random_state)

Fold 1 train F1 score: 0.4339
Fold 1 validation F1 score: 0.3111
------------------------------
Fold 2 train F1 score: 0.4435
Fold 2 validation F1 score: 0.3101
------------------------------
Fold 3 train F1 score: 0.4222
Fold 3 validation F1 score: 0.3133
------------------------------
Average Train F1 score: 0.43322364365949934
Average Validation F1 score: 0.31153899307516203


In [9]:
check_performace(model,X,y,numerical_features,[],n_folds = 3, random_state=random_state)

Fold 1 train F1 score: 0.5215
Fold 1 validation F1 score: 0.3394
------------------------------
Fold 2 train F1 score: 0.5258
Fold 2 validation F1 score: 0.3459
------------------------------
Fold 3 train F1 score: 0.5112
Fold 3 validation F1 score: 0.3344
------------------------------
Average Train F1 score: 0.5195017243939507
Average Validation F1 score: 0.3398999730257672


#### <span style="color:salmon"> 2.2  Train the model </span> 

In [10]:
selected_config = {
        "min_samples_split": 10,
        "min_samples_leaf": 4,
        "max_depth": 20,
        "criterion": "entropy",
        "random_state": random_state,
    }

In [11]:
model = DecisionTreeClassifier(
                        max_depth=selected_config["max_depth"], 
                        min_samples_leaf=selected_config["min_samples_leaf"],
                        min_samples_split=selected_config["min_samples_split"],
                        class_weight='balanced',
                        random_state=selected_config["random_state"])

In [12]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.2, stratify = y, shuffle = True, random_state=random_state)
X_train_to_preprocess=X_train.copy()

In [13]:
remove_outliers(X_train)
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [14]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features])  

In [15]:
drop_list = []
if feature_selection != []:
    for col in X_train.columns:
        if col not in feature_selection:
            drop_list.append(col)
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)

In [16]:
model.fit(X_train, y_train)

In [17]:
y_train_pred = model.predict(X_train)

In [18]:
y_val_pred = model.predict(X_val)

#### <span style="color:salmon"> 2.3  Model Results </span> 

In [19]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

Compute confusion matrix to evaluate the accuracy of a classification

In [20]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, y_train_pred, target_names=target_names))

Confusion Matrix:
[[  9502    340     31     11     25      7      1     63]
 [ 10359 183745  28179   6797   2977    228     39    538]
 [  1084  16110  28808   3845   3944    997     66    271]
 [  1955  15514  14220  62316  16304   7225    600    671]
 [   215    318    969   1548  33041   2237    192    104]
 [     0      0      1      3     38   3307     18      2]
 [     0      0      0      0      0      0     78      0]
 [     0      0      0      0      0      0      0    376]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.41      0.95      0.57      9980
    2. NON-COMP       0.85      0.79      0.82    232862
    3. MED ONLY       0.40      0.52      0.45     55125
   4. TEMPORARY       0.84      0.52      0.64    118805
5. PPD SCH LOSS       0.59      0.86      0.70     38624
     6. PPD NSL       0.24      0.98      0.38      3369
         7. PTD       0.08      1.00      0.15        78
       8. DEATH       0.19  

In [21]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=target_names))

Confusion Matrix:
[[ 1598   505   181   108    74    10     0    19]
 [ 3092 43846  8088  2160   811    71     9   139]
 [  506  4966  3997  2440  1471   316     9    76]
 [  585  4270  5081 11857  5406  2153   180   170]
 [  109   156   809  1577  6152   775    44    34]
 [    8     2    47   308   233   226    16     2]
 [    1     0     5     5     2     5     1     0]
 [    7     4     4    13     7     2     0    57]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.27      0.64      0.38      2495
    2. NON-COMP       0.82      0.75      0.78     58216
    3. MED ONLY       0.22      0.29      0.25     13781
   4. TEMPORARY       0.64      0.40      0.49     29702
5. PPD SCH LOSS       0.43      0.64      0.52      9656
     6. PPD NSL       0.06      0.27      0.10       842
         7. PTD       0.00      0.05      0.01        19
       8. DEATH       0.11      0.61      0.19        94

       accuracy                   

## <span style="color:salmon"> 3. Test Predictions </span> 

Make predictions on test:

In [22]:
remove_outliers(X_train_to_preprocess)
X_train_to_preprocess, test_df = apply_frequency_encoding(X_train_to_preprocess, test_df)
NA_imputer(X_train_to_preprocess, test_df)
create_new_features(X_train_to_preprocess, test_df)

In [23]:
test_df[numerical_features]  = scaler.transform(test_df[numerical_features])  

In [24]:
test_df = test_df.drop(drop_list, axis=1)

In [25]:
y_test_pred = model.predict(test_df)
y_test_pred = y_test_pred.ravel()

In [26]:
y_test_final = np.array([class_mapping[i] for i in y_test_pred])

In [27]:
test_id = test_df.index

In [28]:
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': y_test_final
})

In [29]:
if False:
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)