# <center>Machine Learning Project</center>

** **
## <center>*03.4 - Naïve Bayes*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639



In [1]:
# Import libraries
import pandas as pd
import numpy as np

#make the split here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

random_state=68+1

## <span style="color:salmon"> 1. Import Dataset </span> 

In [2]:
# Import dataset
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")

In [3]:
# Import dataset
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or []
feature_selection = []

## <span style="color:salmon"> 2. Prepare Dataset </span> 

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

## <span style="color:salmon"> 3. XGBoosted </span> 

In [6]:
model = GaussianNB()

In [7]:
check_performace(model,X,y,numerical_features,essential_features,n_folds = 5, random_state=random_state)

Fold 1 train F1 score: 0.2563
Fold 1 validation F1 score: 0.2553
------------------------------
Fold 2 train F1 score: 0.2311
Fold 2 validation F1 score: 0.2303
------------------------------
Fold 3 train F1 score: 0.2600
Fold 3 validation F1 score: 0.2590
------------------------------
Fold 4 train F1 score: 0.2575
Fold 4 validation F1 score: 0.2549
------------------------------
Fold 5 train F1 score: 0.2563
Fold 5 validation F1 score: 0.2550
------------------------------
Average Train F1 score: 0.2522294442421754
Average Validation F1 score: 0.25091159741995694


In [8]:
check_performace(model,X,y,numerical_features,reduced_features,n_folds = 5, random_state=random_state)

Fold 1 train F1 score: 0.2404
Fold 1 validation F1 score: 0.2390
------------------------------
Fold 2 train F1 score: 0.2335
Fold 2 validation F1 score: 0.2350
------------------------------
Fold 3 train F1 score: 0.2438
Fold 3 validation F1 score: 0.2421
------------------------------
Fold 4 train F1 score: 0.2399
Fold 4 validation F1 score: 0.2398
------------------------------
Fold 5 train F1 score: 0.2407
Fold 5 validation F1 score: 0.2391
------------------------------
Average Train F1 score: 0.23965430152953832
Average Validation F1 score: 0.23901496925651702


In [9]:
check_performace(model,X,y,numerical_features,[],n_folds = 5, random_state=random_state)

Fold 1 train F1 score: 0.1478
Fold 1 validation F1 score: 0.1460
------------------------------
Fold 2 train F1 score: 0.1419
Fold 2 validation F1 score: 0.1412
------------------------------
Fold 3 train F1 score: 0.1428
Fold 3 validation F1 score: 0.1443
------------------------------
Fold 4 train F1 score: 0.1519
Fold 4 validation F1 score: 0.1508
------------------------------
Fold 5 train F1 score: 0.1387
Fold 5 validation F1 score: 0.1395
------------------------------
Average Train F1 score: 0.14461760078839864
Average Validation F1 score: 0.1443770210906031


#### <span style="color:salmon"> 3.1  Evaluate the model </span> 


In [10]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.2, stratify = y, shuffle = True, random_state=random_state)
X_train_to_preprocess=X_train.copy()

In [11]:
remove_outliers(X_train)
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [12]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features]) 

In [13]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X.columns:
        if col not in feature_selection:
            drop_list.append(col)
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)

In [14]:
model.fit(X_train, y_train)

In [15]:
y_train_pred = model.predict(X_train)

In [16]:
y_val_pred = model.predict(X_val)

In [17]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

Compute confusion matrix to evaluate the accuracy of a classification

In [18]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, y_train_pred, target_names=target_names))

Confusion Matrix:
[[  5639    221   2182    142    495     33    876    392]
 [  5658  15619 187147   1823  12581    279   5227   4528]
 [   314    463  30840    286   6311    426  11682   4803]
 [  1459    674  39892   1332  15339   1875  52412   5822]
 [    66     10   2148    114  11974    773  21906   1633]
 [     1      0      8      3     38    113   3146     60]
 [     0      0      0      0      0      0     78      0]
 [     1      0      0      0      0      0    320     55]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.43      0.57      0.49      9980
    2. NON-COMP       0.92      0.07      0.13    232862
    3. MED ONLY       0.12      0.56      0.19     55125
   4. TEMPORARY       0.36      0.01      0.02    118805
5. PPD SCH LOSS       0.26      0.31      0.28     38624
     6. PPD NSL       0.03      0.03      0.03      3369
         7. PTD       0.00      1.00      0.00        78
       8. DEATH       0.00  

In [19]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=target_names))

Confusion Matrix:
[[ 1391    59   570    36   114    10   218    97]
 [ 1394  3940 46801   424  3154    43  1296  1164]
 [   74   134  7620    57  1624   118  2914  1240]
 [  378   203 10001   314  3837   427 13086  1456]
 [   20     1   517    27  3085   177  5403   426]
 [    0     0     7     1     5    31   787    11]
 [    1     0     0     0     0     1    15     2]
 [    0     0     0     0     0     1    84     9]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.43      0.56      0.48      2495
    2. NON-COMP       0.91      0.07      0.13     58216
    3. MED ONLY       0.12      0.55      0.19     13781
   4. TEMPORARY       0.37      0.01      0.02     29702
5. PPD SCH LOSS       0.26      0.32      0.29      9656
     6. PPD NSL       0.04      0.04      0.04       842
         7. PTD       0.00      0.79      0.00        19
       8. DEATH       0.00      0.10      0.00        94

       accuracy                   

## <span style="color:salmon"> 4. Test Predictions </span> 

Make validation predictions:

In [20]:
remove_outliers(X_train_to_preprocess)
X_train_to_preprocess, test_df = apply_frequency_encoding(X_train_to_preprocess, test_df)
NA_imputer(X_train_to_preprocess, test_df)
create_new_features(X_train_to_preprocess, test_df)

In [21]:
test_df[numerical_features]  = scaler.transform(test_df[numerical_features])  

In [22]:
test_df = test_df.drop(drop_list, axis=1)

In [23]:
# Make validation predictions
y_test_pred = model.predict(test_df)
y_test_pred = y_test_pred.ravel()

In [24]:
y_test_final = np.array([class_mapping[i] for i in y_test_pred])

In [25]:
test_id = test_df.index

In [26]:
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': y_test_final
})

In [27]:
if False:
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)