# <center>Machine Learning Project</center>

** **
## <center>*03.11 - Light Gradient Boost*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639



In [1]:
# Import libraries
import pandas as pd
import numpy as np

#make the split here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
import lightgbm as lgb

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

## <span style="color:salmon"> 1. Import Dataset </span> 

In [2]:
# Import dataset
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")

In [43]:
# Import dataset
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or []
feature_selection = []

## <span style="color:salmon"> 2. Prepare Dataset </span> 

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

## <span style="color:salmon"> 3. XGBoosted </span> 

In [8]:
model = lgb.LGBMClassifier(
    verbose=-1
)

In [9]:
check_performace(model,X,y,numerical_features,essential_features,n_folds = 5)

Fold 1 train F1 score: 0.3451
Fold 1 validation F1 score: 0.3145
------------------------------
Fold 2 train F1 score: 0.3599
Fold 2 validation F1 score: 0.3167
------------------------------
Fold 3 train F1 score: 0.3253
Fold 3 validation F1 score: 0.2971
------------------------------
Fold 4 train F1 score: 0.3437
Fold 4 validation F1 score: 0.3184
------------------------------
Fold 5 train F1 score: 0.3203
Fold 5 validation F1 score: 0.3015
------------------------------
Average Train F1 score: 0.33884674807526743
Average Validation F1 score: 0.3096306143004636


In [10]:
check_performace(model,X,y,numerical_features,reduced_features,n_folds = 5)

Fold 1 train F1 score: 0.3658
Fold 1 validation F1 score: 0.3313
------------------------------
Fold 2 train F1 score: 0.3665
Fold 2 validation F1 score: 0.3309
------------------------------
Fold 3 train F1 score: 0.3908
Fold 3 validation F1 score: 0.3377
------------------------------
Fold 4 train F1 score: 0.3814
Fold 4 validation F1 score: 0.3360
------------------------------
Fold 5 train F1 score: 0.3773
Fold 5 validation F1 score: 0.3352
------------------------------
Average Train F1 score: 0.3763747413987285
Average Validation F1 score: 0.3342087379453134


In [11]:
check_performace(model,X,y,numerical_features,[],n_folds = 5)

Fold 1 train F1 score: 0.4301
Fold 1 validation F1 score: 0.3728
------------------------------
Fold 2 train F1 score: 0.4045
Fold 2 validation F1 score: 0.3691
------------------------------
Fold 3 train F1 score: 0.3882
Fold 3 validation F1 score: 0.3645
------------------------------
Fold 4 train F1 score: 0.3986
Fold 4 validation F1 score: 0.3652
------------------------------
Fold 5 train F1 score: 0.4121
Fold 5 validation F1 score: 0.3712
------------------------------
Average Train F1 score: 0.40673581351977794
Average Validation F1 score: 0.36856011189390914


#### <span style="color:salmon"> 3.1  Evaluate the model </span> 


In [12]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.75, stratify = y, shuffle = True)

In [13]:
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [14]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features]) 

In [25]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X.columns:
        if col not in feature_selection:
            drop_list.append(col)
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)

In [26]:
model.fit(X_train, y_train)

In [27]:
y_train_pred = model.predict(X_train)

In [28]:
y_val_pred = model.predict(X_val)

In [29]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

Compute confusion matrix to evaluate the accuracy of a classification

In [30]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))
print("\nClassification Report:")
print(classification_report(y_train, y_train_pred, target_names=target_names))

Confusion Matrix:
[[ 1784  1158    24   121     7     0     6     4]
 [  383 68945   883  2204   170     1    42    46]
 [   16  9845  1757  4743   795     7    33    18]
 [   32  8658   897 25194  2141    57    77    36]
 [    4   453   202  3772  7576    17    31     7]
 [    1     1     7   595   108   334     4     1]
 [    0     2     0    20     1     0     1     0]
 [    1    15    10    71     4     0     0    13]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.80      0.57      0.67      3104
    2. NON-COMP       0.77      0.95      0.85     72674
    3. MED ONLY       0.46      0.10      0.17     17214
   4. TEMPORARY       0.69      0.68      0.68     37092
5. PPD SCH LOSS       0.70      0.63      0.66     12062
     6. PPD NSL       0.80      0.32      0.46      1051
         7. PTD       0.01      0.04      0.01        24
       8. DEATH       0.10      0.11      0.11       114

       accuracy                   

In [20]:
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=target_names))


Confusion Matrix:
[[  4315   4505    198    204     24      3      8     53]
 [  1636 211387   1554   2436    203     23    172    610]
 [   174  27200   4079  17572   2262     49    144    164]
 [   274   6450   2291  93453   7393    578    628    211]
 [    91    605    606  13455  21040    112    222     55]
 [     2     13     51   2471    456    124     34      1]
 [     0      0      0     59      7      4      3      0]
 [     5     50     19    238     12      0      0     19]]

Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.66      0.46      0.55      9310
    2. NON-COMP       0.84      0.97      0.90    218021
    3. MED ONLY       0.46      0.08      0.13     51644
   4. TEMPORARY       0.72      0.84      0.78    111278
5. PPD SCH LOSS       0.67      0.58      0.62     36186
     6. PPD NSL       0.14      0.04      0.06      3152
         7. PTD       0.00      0.04      0.00        73
       8. DEATH       0.02  

## <span style="color:salmon"> 4. Test Predictions </span> 

Make validation predictions:

In [31]:
X, test_df = apply_frequency_encoding(X, test_df)
NA_imputer(X, test_df)
create_new_features(X, test_df)

KeyError: 'County of Injury'

In [22]:
scaler = StandardScaler().fit(X[numerical_features])
X[numerical_features]  = scaler.transform(X[numerical_features])
test_df[numerical_features]  = scaler.transform(test_df[numerical_features])  

In [23]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X.columns:
        if col not in feature_selection:
            drop_list.append(col)
test_df = test_df.drop(drop_list, axis=1)

In [32]:
# Make validation predictions
y_test_pred = model.predict(test_df)
y_test_pred = y_test_pred.ravel()

ValueError: Number of features of the model must match the input. Model n_features_ is 64 and input n_features is 66

In [40]:
for col in test_df.columns:
    if col not in X.columns:
        print(col)

Agreement Reached
WCB Decision


In [48]:
test_df["Agreement Reached"]

Claim Identifier
5393875    0.0
5393091    1.0
5393889    0.0
5393887    0.0
5393863    0.0
          ... 
6165265    0.0
6165285    0.0
6165506    0.0
6165339    0.0
6165075    0.0
Name: Agreement Reached, Length: 573342, dtype: float64

In [45]:
test_df.columns

Index(['Accident Date', 'Age at Injury', 'Average Weekly Wage', 'Birth Year',
       'County of Injury', 'District Name', 'IME-4 Count', 'Industry Code',
       'WCIO Cause of Injury Code', 'WCIO Nature of Injury Code',
       'WCIO Part Of Body Code', 'Zip Code', 'Agreement Reached',
       'WCB Decision', 'Number of Dependents', 'Known Accident Date',
       'Known Assembly Date', 'Known C-2 Date', 'Known C-3 Date',
       'Known First Hearing Date', 'Known Age at Injury', 'Known Birth Year',
       'Accident Date_Year', 'Accident Date_Month', 'Accident Date_Day',
       'Accident Date_DayOfWeek', 'Assembly Date_Year', 'Assembly Date_Month',
       'Assembly Date_Day', 'Assembly Date_DayOfWeek', 'C-2 Date_Year',
       'C-2 Date_Month', 'C-2 Date_Day', 'C-2 Date_DayOfWeek', 'C-3 Date_Year',
       'C-3 Date_Month', 'C-3 Date_Day', 'C-3 Date_DayOfWeek',
       'First Hearing Date_Year', 'First Hearing Date_Month',
       'First Hearing Date_Day', 'First Hearing Date_DayOfWeek',
      

In [None]:
y_test_final = np.array([class_mapping[i] for i in y_test_pred])

In [None]:
test_id = test_df.index

In [None]:
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': y_test_final
})

In [None]:
if False:
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)