# <center>Machine Learning Project</center>

** **
## <center>*06 - Predictive Model*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639


| Model              | Best Parameters  | Feature Selection | Average Validation Macro F1 Score | Validation Macro F1-Score |
|--------------------|--------------------------------------|---|---------------|---------------|
| CatBoostClassifier | `{"iterations": 1000, "learning_rate": 0.11, "depth": 6, "l2_leaf_reg": 5, bagging_temperature": 0.4}`| `None` | `0.41` | `0.42` |
| XGBoostClassifier  | `{"n_estimators": 200, "learning_rate": 0.2, "max_depth": 7, "subsample": 0.9, "colsample_bytree": 0.9, "gamma": 0.3}`| `None` | `0.42`|`0.42`|
| Decision Trees     | `{"min_samples_split": 10, "min_samples_leaf": 4, "max_depth": 20, "criterion": "entropy"}`| `Essential Features` | `0.33`| `0.34`|
| Naive Bayes        | `Default Parameters`| `Essential Features` | `0.24`| `0.23`|
| StackEnsemble      | `CatBoost Config 1, XGBoost Config 1, Decision Trees, Default Parameters`     | `Essential Features` | `0.30`| `0.31`|
| VotingEnsemble     | `CatBoost Config 2, XGBoost Config 2, Decision Trees, Default Parameters` | `None` | `0.41`| `0.42`|
| XGBoostClassifier With kfold | `{"n_estimators": 200, "learning_rate": 0.2, "max_depth": 7, "subsample": 0.9, "colsample_bytree": 0.9, "gamma": 0.3}`| `None` | `0.44`|`0.45`|

After many iterations of preprocessing, modeling and gridsearch we found that XGBoostClassifier was slightly more consistent than the others.

In [35]:
# Import libraries
import pandas as pd
import numpy as np

#make the split here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score

from catboost import CatBoostClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import StratifiedKFold
import time

from utils import *
from utils_feature_selection import check_performace
from utils_dicts import *

import warnings
warnings.filterwarnings('ignore')

random_state=68+1

## <span style="color:salmon"> 1. Import Dataset </span> 

In [36]:
# Import datasets
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")
test_df = pd.read_csv('./preprocessed_data/test_data.csv', index_col = 'Claim Identifier')

In [37]:
# Define Feature Selection: essential_features, reduced_features or []
feature_selection = []

Define y as a target "Claim Injury Type Encoded" and X with all the other columns

In [38]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

## <span style="color:salmon"> 2. Model Training</span> 

Defining the configuration for the model and the class mapping for the target

In [39]:
config = {
    "n_estimators": 200,
    "learning_rate": 0.2,
    "max_depth": 7,
    "subsample": 0.9,
    "colsample_bytree": 0.9,
    "gamma": 0.3,
    "random_state": random_state
}


In [40]:
class_mapping = {
    0:'1. CANCELLED', 
    1:'2. NON-COMP',
    2:'3. MED ONLY', 
    3:'4. TEMPORARY',
    4:'5. PPD SCH LOSS', 
    5:'6. PPD NSL', 
    6:'7. PTD', 
    7:'8. DEATH'
}

We decided that instead of using a XGBoost Classification Model with a train_test_split to train the final model, we could implemented a pipeline that creates the 6 (number of folds) versions of the same model trained in diferent segments of the training dataset using Stratified K-fold.

For the pipeline we first split the data into training set and validation set, and create copies of the training set and test_df in order to correctly preprocess the data.

The scalers and models from the pipeline were saved in order to be used on the GrantApp.

In [41]:
kf = StratifiedKFold(n_splits=6, shuffle=True, random_state=random_state)

In [None]:
test_preds = np.zeros((len(test_df), len(class_mapping)))
avg_train = []
avg_val = []

for fold, (train_index, val_index) in enumerate(kf.split(X, y)):
    print(f"Processing Fold {fold + 1}...")
    
    # Split data
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    test_temp =test_df.copy()
    train_temp=X_train.copy()
    
    # Preprocess X_train and X_Val
    remove_outliers(X_train)
    X_train, X_val = apply_frequency_encoding(X_train, X_val, True, fold=fold)
    NA_imputer(X_train,X_val, True, fold=fold)
    create_new_features(X_train,X_val)
    
    # Preprocess Test_df
    remove_outliers(train_temp)
    train_temp, test_temp = apply_frequency_encoding(train_temp, test_temp)
    NA_imputer(train_temp, test_temp)
    create_new_features(train_temp, test_temp)

    scaler = StandardScaler().fit(X_train[numerical_features])
    # Save Scaler
    joblib.dump(scaler, f'./OthersPipeline/Scaler_{fold}.pkl')
    X_train[numerical_features]  = scaler.transform(X_train[numerical_features])
    X_val[numerical_features]  = scaler.transform(X_val[numerical_features])  
    test_temp[numerical_features]  = scaler.transform(test_temp[numerical_features])  

    drop_list = []
    if feature_selection != []:
        for col in X_train.columns:
            if col not in feature_selection:
                drop_list.append(col)

    X_train = X_train.drop(drop_list, axis=1)
    X_val = X_val.drop(drop_list, axis=1)
    test_temp = test_temp.drop(drop_list, axis=1)
        
    # Train model
    model = XGBClassifier(
            n_estimators=config["n_estimators"],        
            learning_rate=config["learning_rate"],      
            max_depth=config["max_depth"],                          
            subsample=config["subsample"],              
            colsample_bytree=config["colsample_bytree"],
            gamma=config["gamma"],                     
            objective="multi:softmax",                  
            num_class=8,                                
            eval_metric="merror",   
            random_state=config["random_state"],                                      
            verbosity=0                                 
        )
    model.fit(X_train,y_train)
    # Save Model
    model.save_model(f"./OpenEnded/Model_{fold}.json")
    
    pred_train = model.predict(X_train)
    pred_val = model.predict(X_val)
    
    f1_train = f1_score(y_train, pred_train, average='macro')
    f1_val = f1_score(y_val, pred_val, average='macro')

    avg_train.append(f1_train)
    avg_val.append(f1_val)

    print(f"Fold {fold + 1} train F1 score: {f1_train:.4f}")
    print(f"Fold {fold + 1} validation F1 score: {f1_val:.4f}")
    print(f"------------------------------")
    
    test_preds += model.predict_proba(test_temp)

print(f"Average Train F1 score: {sum(avg_train)/len(avg_train)}")
print(f"Average Validation F1 score: {sum(avg_val)/len(avg_val)}")

Processing Fold 1...

Fold 1 train F1 score: 0.7488
Fold 1 validation F1 score: 0.4336
------------------------------
Processing Fold 2...

Fold 2 train F1 score: 0.7505
Fold 2 validation F1 score: 0.4446
------------------------------
Processing Fold 3...

Fold 3 train F1 score: 0.7552
Fold 3 validation F1 score: 0.4567
------------------------------
Processing Fold 4...



KeyboardInterrupt: 

## <span style="color:salmon"> 3. Test Predictions </span> 

To calculate the **final prediction** we first averaged the test predictions of each fold by the number of total folds and then used **np.argmax** to select the class with the highest averaged probability.
The final step was to create a dataframe with the index of test_df as **Claim Identifier** and the **Claim Injury Type** as the inversed mapping of the **final prediction** and **classes_mapping**.

In [9]:
test_id = test_df.index

In [10]:
final_test_preds = np.argmax(test_preds / kf.get_n_splits(), axis=1)
submission_df = pd.DataFrame({
    'Claim Identifier': test_id,
    'Claim Injury Type': final_test_preds
})
submission_df["Claim Injury Type"] = submission_df["Claim Injury Type"].replace(class_mapping)

In [11]:
if False:
    version = version_control()
    submission_df.to_csv(f'./submissions/Group49_Version{version:02}.csv', index=False)