# <center>Machine Learning Project</center>

** **
## <center>*04.4 - Naïve Bayes*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639



In [1]:
# Import libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score

from sklearn.naive_bayes import GaussianNB

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

random_state=68+1

## <span style="color:salmon"> 1. Import Dataset </span> 

In [2]:
# Import dataset
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")

In [3]:
# Import dataset
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or [] (No Feature Selection)
feature_selection = essential_features

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

## <span style="color:salmon"> 2. Naïve Bayes </span> 

A probabilistic classifier based on Bayes' theorem. It assumes independence among features and uses a Gaussian distribution for continuous data.

#### <span style="color:salmon"> 2.1  Model K-fold cross validation </span> 

In [6]:
model = GaussianNB()

In [7]:
check_performace(model,X,y,numerical_features,essential_features,n_folds = 3, random_state=random_state)

Fold 1 train F1 score: 0.2356
Fold 1 validation F1 score: 0.2313
------------------------------
Fold 2 train F1 score: 0.2601
Fold 2 validation F1 score: 0.2588
------------------------------
Fold 3 train F1 score: 0.2568
Fold 3 validation F1 score: 0.2554
------------------------------
Average Train F1 score: 0.25083770071696004
Average Validation F1 score: 0.2485143872630937


In [8]:
# The Average Validation F1 score is inbetween the scores for no feature selection and only the essential features
#check_performace(model,X,y,numerical_features,reduced_features,n_folds = 3, random_state=random_state)

In [9]:
check_performace(model,X,y,numerical_features,[],n_folds = 3, random_state=random_state)

Fold 1 train F1 score: 0.1461
Fold 1 validation F1 score: 0.1447
------------------------------
Fold 2 train F1 score: 0.1501
Fold 2 validation F1 score: 0.1511
------------------------------
Fold 3 train F1 score: 0.1388
Fold 3 validation F1 score: 0.1388
------------------------------
Average Train F1 score: 0.14500918632274737
Average Validation F1 score: 0.1448632889002067


#### <span style="color:salmon"> 2.2  Train the model </span> 

In [10]:
model = GaussianNB()

In [11]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.2, stratify = y, shuffle = True, random_state=random_state)
X_train_to_preprocess=X_train.copy()

In [12]:
remove_outliers(X_train)
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [13]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features]) 

In [14]:
drop_list = []
if feature_selection != []:
    for col in X_train.columns:
        if col not in feature_selection:
            drop_list.append(col)
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)

In [15]:
model.fit(X_train, y_train)

In [16]:
y_train_pred = model.predict(X_train)

In [17]:
y_val_pred = model.predict(X_val)

#### <span style="color:salmon"> 2.3  Model Results </span> 

In [18]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

Compute confusion matrix to evaluate the accuracy of a classification

In [19]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, y_train_pred, target_names=target_names))

Confusion Matrix:
[[  5551   2790    477     76      2      0   1040     44]
 [  6790 209467   5202   1041     67     26   9984    285]
 [   183  31321   3664   5503    899     44  13323    188]
 [  1584  26832   5116  21700   2489    134  60563    387]
 [    31    801    872   2454   1729     37  32631     69]
 [     2      7     16     63     41      8   3232      0]
 [     0      0      0      0      0      0     78      0]
 [     0     27     16     19      2      0    294     18]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.39      0.56      0.46      9980
    2. NON-COMP       0.77      0.90      0.83    232862
    3. MED ONLY       0.24      0.07      0.10     55125
   4. TEMPORARY       0.70      0.18      0.29    118805
5. PPD SCH LOSS       0.33      0.04      0.08     38624
     6. PPD NSL       0.03      0.00      0.00      3369
         7. PTD       0.00      1.00      0.00        78
       8. DEATH       0.02  

In [20]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=target_names))

Confusion Matrix:
[[ 1374   696   146    16     1     0   252    10]
 [ 1696 52477  1253   275    18     6  2426    65]
 [   52  7864   946  1341   213    10  3299    56]
 [  390  6756  1328  5449   646    31 14986   116]
 [    6   215   215   616   428    10  8144    22]
 [    0     0     5    13     9     1   813     1]
 [    1     0     0     1     0     0    17     0]
 [    0     5     4     2     1     0    76     6]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.39      0.55      0.46      2495
    2. NON-COMP       0.77      0.90      0.83     58216
    3. MED ONLY       0.24      0.07      0.11     13781
   4. TEMPORARY       0.71      0.18      0.29     29702
5. PPD SCH LOSS       0.33      0.04      0.08      9656
     6. PPD NSL       0.02      0.00      0.00       842
         7. PTD       0.00      0.89      0.00        19
       8. DEATH       0.02      0.06      0.03        94

       accuracy                   

## <span style="color:salmon"> 3. Test Predictions </span> 

In [21]:
remove_outliers(X_train_to_preprocess)
X_train_to_preprocess, test_df = apply_frequency_encoding(X_train_to_preprocess, test_df)
NA_imputer(X_train_to_preprocess, test_df)
create_new_features(X_train_to_preprocess, test_df)

In [22]:
test_df[numerical_features]  = scaler.transform(test_df[numerical_features])  

In [23]:
test_df = test_df.drop(drop_list, axis=1)

In [24]:
# Make validation predictions
y_test_pred = model.predict(test_df)
y_test_pred = y_test_pred.ravel()

In [25]:
y_test_final = np.array([class_mapping[i] for i in y_test_pred])

In [26]:
test_id = test_df.index

In [27]:
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': y_test_final
})

In [28]:
if False:
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)