# <center>Machine Learning Project</center>

** **
## <center>*06 - Predictive Model*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639


| Model              | Best Parameters  | Feature Selection | Average Validation Macro F1 Score | Validation Macro F1-Score |
|--------------------|--------------------------------------|---|---------------|---------------|
| CatBoostClassifier | `{"iterations": 1000, "learning_rate": 0.11, "depth": 6, "l2_leaf_reg": 5, bagging_temperature": 0.4}`| `None` | `0.41` | `0.42` |
| XGBoostClassifier  | `{"n_estimators": 200, "learning_rate": 0.2, "max_depth": 7, "subsample": 0.9, "colsample_bytree": 0.9, "gamma": 0.3}`| `None` | `0.42`|`0.42`|
| Decision Trees     | `{"min_samples_split": 10, "min_samples_leaf": 4, "max_depth": 20, "criterion": "entropy"}`| `Essential Features` | `0.33`| `0.34`|
| Naive Bayes        | `Default Parameters`| `Essential Features` | `0.24`| `0.23`|
| StackEnsemble      | `CatBoost Config 1, XGBoost Config 1, Decision Trees, Default Parameters`     | `Essential Features` | `0.30`| `0.31`|
| VotingEnsemble     | `CatBoost Config 2, XGBoost Config 2, Decision Trees, Default Parameters` | `None` | `0.41`| `0.42`|
| XGBoostClassifier With kfold | `{"n_estimators": 200, "learning_rate": 0.2, "max_depth": 7, "subsample": 0.9, "colsample_bytree": 0.9, "gamma": 0.3}`| `None` | `0.44`|`0.45`|

In [1]:
# Falar q vamos usar CatBoost ou XGBoost

In [2]:
# Import libraries
import pandas as pd
import numpy as np

#make the split here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score

from catboost import CatBoostClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

random_state=68+1

In [3]:
# Import datasets
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [4]:
# Define Feature Selection: essential_features, reduced_features or []
feature_selection = []

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [5]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

In [6]:
config ={
        "n_estimators": 200,
        "learning_rate": 0.2,
        "max_depth": 7,
        "subsample": 0.9,
        "colsample_bytree": 0.9,
        "gamma": 0.3,
        "random_state":random_state
        }

In [7]:
model = XGBClassifier(
                    n_estimators=config["n_estimators"],        
                    learning_rate=config["learning_rate"],      
                    max_depth=config["max_depth"],                          
                    subsample=config["subsample"],              
                    colsample_bytree=config["colsample_bytree"],
                    gamma=config["gamma"],                     
                    # --------------
                    objective="multi:softmax",                  
                    num_class=8,                                
                    eval_metric="merror",   
                    random_state = config["random_state"],                                      
                    verbosity=0                                 
                )

In [8]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

In [9]:
kf = StratifiedKFold(n_splits=6, shuffle=True, random_state=random_state)

In [10]:
test_preds = np.zeros((len(test_df), len(class_mapping)))

for fold, (train_index, val_index) in enumerate(kf.split(X, y)):
    print(f"Processing Fold {fold + 1}...")
    
    # Split data
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    test_temp =test_df.copy()
    train_temp=X_train.copy()
    
    # Preprocess and scale
    remove_outliers(X_train)
    X_train, X_val = apply_frequency_encoding(X_train, X_val)
    NA_imputer(X_train,X_val)
    create_new_features(X_train,X_val)
    
    remove_outliers(train_temp)
    train_temp, test_temp = apply_frequency_encoding(train_temp, test_temp)
    NA_imputer(train_temp, test_temp)
    create_new_features(train_temp, test_temp)

    scaler = StandardScaler().fit(X_train[numerical_features])
    X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
    X_val[numerical_features]  = scaler.transform(X_val[numerical_features])  
    test_temp[numerical_features]  = scaler.transform(test_temp[numerical_features])  

    drop_list = ["Average Weekly Wage"]
    if feature_selection != []:
        for col in X_train.columns:
            if col not in feature_selection:
                drop_list.append(col)
        
    # Train model
    model = XGBClassifier(
                    n_estimators=config["n_estimators"],        
                    learning_rate=config["learning_rate"],      
                    max_depth=config["max_depth"],                          
                    subsample=config["subsample"],              
                    colsample_bytree=config["colsample_bytree"],
                    gamma=config["gamma"],                     
                    # --------------
                    objective="multi:softmax",                  
                    num_class=8,                                
                    eval_metric="merror",   
                    random_state = config["random_state"],                                      
                    verbosity=0                                 
                )
    model.fit(X_train,y_train)
    
    pred_train = model.predict(X_train)
    pred_val = model.predict(X_val)
    print(f"F1-score for Train:{f1_score(y_train, pred_train, average='macro')}")
    print(f"F1-score for Validation:{f1_score(y_val, pred_val, average='macro')}")
    test_preds += model.predict_proba(test_temp)

Processing Fold 1...
F1-score for Train:0.7501261784970161
F1-score for Validation:0.44153261334859856
Processing Fold 2...
F1-score for Train:0.7546852082250131
F1-score for Validation:0.44955224571018143
Processing Fold 3...
F1-score for Train:0.7536295748033681
F1-score for Validation:0.45918596276937595
Processing Fold 4...
F1-score for Train:0.7534874940592244
F1-score for Validation:0.45592605378838613
Processing Fold 5...
F1-score for Train:0.7514196877001905
F1-score for Validation:0.4486102210985014
Processing Fold 6...
F1-score for Train:0.7507163207378577
F1-score for Validation:0.44319580387169905


In [11]:
test_id = test_df.index

In [12]:
final_test_preds = np.argmax(test_preds / kf.get_n_splits(), axis=1)
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': final_test_preds
})
submission_df["Claim Injury Type"] = submission_df["Claim Injury Type"].replace(class_mapping)

In [13]:
if False:
    # Best Score V27
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)