# <center>Machine Learning Project</center>

** **
## <center>*06 - Predictive Model*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639


| Model              | Best Parameters  | Feature Selection | Average Validation Macro F1 Score | Validation Macro F1-Score |
|--------------------|--------------------------------------|---|---------------|---------------|
| CatBoostClassifier | `{"iterations": 1000, "learning_rate": 0.11, "depth": 6, "l2_leaf_reg": 5, bagging_temperature": 0.4}`| `None` | `0.41` | `0.42` |
| XGBoostClassifier  | `{"n_estimators": 200, "learning_rate": 0.2, "max_depth": 7, "subsample": 0.9, "colsample_bytree": 0.9, "gamma": 0.3}`| `None` | `0.40`|`0.42`|
| Decision Trees     | `{"min_samples_split": 10, "min_samples_leaf": 4, "max_depth": 20, "criterion": "entropy"}`| `Essential Features` | `0.33`| `0.34`|
| Naive Bayes        | `Default Parameters`| `Essential Features` | `0.24`| `0.23`|
| StackEnsemble      | `CatBoost Config 1, XGBoost Config 1, Decision Trees, Default Parameters`     | `Essential Features` | `0.30`| `0.31`|
| VotingEnsemble     | `CatBoost Config 2, XGBoost Config 2, Decision Trees, Default Parameters` | `None` | `0.41`| `0.42`|

In [1]:
# Falar q vamos usar CatBoost ou XGBoost

In [2]:
# Import libraries
import pandas as pd
import numpy as np

#make the split here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score

from catboost import CatBoostClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

random_state=68+1

In [3]:
# Import datasets
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or []
feature_selection = []

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

In [6]:
config ={
        "n_estimators": 200,
        "learning_rate": 0.2,
        "max_depth": 7,
        "subsample": 0.9,
        "colsample_bytree": 0.9,
        "gamma": 0.3,
        "random_state":69
        }

In [7]:
model = XGBClassifier(
                    n_estimators=config["n_estimators"],        
                    learning_rate=config["learning_rate"],      
                    max_depth=config["max_depth"],                          
                    subsample=config["subsample"],              
                    colsample_bytree=config["colsample_bytree"],
                    gamma=config["gamma"],                     
                    # --------------
                    objective="multi:softmax",                  
                    num_class=8,                                
                    eval_metric="merror",   
                    random_state = config["random_state"],                                      
                    verbosity=0                                 
                )

In [8]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.20, stratify = y, shuffle = True,random_state=69)
X_train_to_preprocess=X_train.copy()

In [9]:
remove_outliers(X_train)
X_train, X_val = apply_frequency_encoding(X_train, X_val)
NA_imputer(X_train,X_val)
create_new_features(X_train,X_val)

In [10]:
remove_outliers(X_train_to_preprocess)
X_train_to_preprocess, test_df = apply_frequency_encoding(X_train_to_preprocess, test_df)
NA_imputer(X_train_to_preprocess, test_df)
create_new_features(X_train_to_preprocess, test_df)

In [11]:
scaler = StandardScaler().fit(X_train[numerical_features])
X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
X_val[numerical_features]  = scaler.transform(X_val[numerical_features])  
test_df[numerical_features]  = scaler.transform(test_df[numerical_features])  

In [12]:
drop_list = ["Average Weekly Wage"]
if feature_selection != []:
    for col in X_train.columns:
        if col not in feature_selection:
            drop_list.append(col)

In [13]:
X_train = X_train.drop(drop_list, axis=1)
X_val = X_val.drop(drop_list, axis=1)
test_df = test_df.drop(drop_list, axis=1)

In [14]:
model.fit(X_train, y_train)

In [15]:
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)

In [16]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

# Use the values from class_mapping as the target names
target_names = list(class_mapping.values())

In [17]:
print("\nClassification Report:")
print(classification_report(y_train, y_train_pred, target_names=target_names))
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=target_names))


Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.85      0.65      0.73      9980
    2. NON-COMP       0.79      0.97      0.87    232862
    3. MED ONLY       0.73      0.15      0.25     55125
   4. TEMPORARY       0.73      0.73      0.73    118805
5. PPD SCH LOSS       0.77      0.69      0.73     38624
     6. PPD NSL       0.99      0.34      0.51      3369
         7. PTD       1.00      0.97      0.99        78
       8. DEATH       0.99      1.00      1.00       376

       accuracy                           0.77    459219
      macro avg       0.86      0.69      0.73    459219
   weighted avg       0.77      0.77      0.74    459219


Classification Report:
                 precision    recall  f1-score   support

   1. CANCELLED       0.73      0.50      0.59      2495
    2. NON-COMP       0.78      0.96      0.86     58216
    3. MED ONLY       0.46      0.09      0.14     13781
   4. TEMPORARY       0.67      0.69

In [18]:
y_test_pred = model.predict(test_df)
y_test_pred = y_test_pred.ravel()
y_test_final = np.array([class_mapping[i] for i in y_test_pred])
test_id = test_df.index

In [19]:
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': y_test_final
})

In [20]:
if False:
    # Best Score V20
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)