<h1 style="text-align: center; color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 36px; text-shadow: 2px 2px #D1D1D1;">
    Model Optimization (MO) for Workers' Compensation Claims
</h1>
<hr style="border: 2px solid #4A90E2;">

<h2 style="text-align: center; color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 36px; text-shadow: 2px 2px #D1D1D1;">Required Imports</h2>

<hr style="border: 2px solid #4A90E2;">

<h3 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 24px; text-shadow: 2px 2px #D1D1D1;">Package Descriptions</h3>
<ul style="font-family: 'Arial', sans-serif;">
    <li><strong>pandas</strong>: For data manipulation and analysis, enabling easy reading and handling of dataframes.</li>
    <li><strong>numpy</strong>: For efficient numerical operations and array manipulation.</li>
    <li><strong>matplotlib.pyplot</strong>: To create data visualizations and plots.</li>
    <li><strong>seaborn</strong>: For generating attractive and informative statistical visualizations.</li>
    <li><strong>missingno</strong>: For visualizing and analyzing missing data, helping to better understand data quality.</li>
</ul>


In [12]:
import pandas as pd # type: ignore
import numpy as np # type: ignore
import matplotlib.pyplot as plt # type: ignore
import seaborn as sns # type: ignore
import missingno as msng # type: ignore
import sys # type: ignore
import os # type: ignore

from scipy import stats # type: ignore
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE # type: ignore
from keras.utils import to_categorical


sys.path.append(os.path.abspath("../utils"))
from meta_model_train import meta_model_rf, meta_model_xgbc, meta_model_et, meta_model_cb
from neural_network import neural_network
from plots import plot_training_history, plot_confusion_matrix
from predicitons_csv import save_predictions_to_csv
from save_models import save_model

import warnings
warnings.filterwarnings("ignore")

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<h2 style="text-align: center; color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 36px; text-shadow: 1px 1px #D1D1D1;">
    Data Loading
</h2>
<hr style="border: 1px solid #4A90E2;">

<p style="font-size: 18px; line-height: 1.6; font-family: 'Arial', sans-serif;">
    This section handles loading the dataset into the environment for further processing. Using <strong>pandas</strong>, we load the data into a structured dataframe, allowing for easy manipulation, exploration, and analysis throughout the project.
</p>


In [13]:
path = "../data/"

X_train = pd.read_csv(path + "X_train_post_FS.csv")
X_val = pd.read_csv(path + "X_val_post_FS.csv")

y_train = pd.read_csv(path + "y_train_post_FS.csv")
y_val = pd.read_csv(path + 'y_val_post_FS.csv')

data_test = pd.read_csv(path + "data_test_post_FS.csv")

data = pd.read_csv(path + "Claim_Injury_Type_mapping.csv")

In [14]:
mapping_dict = dict(zip(data['Encoded Value'], data['Claim Injury Type']))

<h2 style="text-align: center; color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 36px; text-shadow: 1px 1px #D1D1D1;">
    Data Scaling
</h2>
<hr style="border: 1px solid #4A90E2;">

<p style="font-size: 18px; line-height: 1.6; font-family: 'Arial', sans-serif;">
    In this section, we do data scaling is applied to normalize features, ensuring that all variables contribute equally to the model without being impacted by differing scales. This step is crucial for algorithms sensitive to feature magnitudes, such as regression and distance-based models.
</p>


<h2 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 28px; text-shadow: 1px 1px #D1D1D1;">
    Data Scaling
</h2>

<p style="font-size: 18px; line-height: 1.6; font-family: 'Arial', sans-serif;">
    Data scaling is an essential preprocessing step to standardize the range of independent variables. By scaling features, we ensure that all variables contribute equally to model training, avoiding bias toward features with larger scales. Scaling is particularly important for algorithms sensitive to feature magnitudes, such as gradient-based methods and distance-based models.
</p>


In [15]:
# Initialize the scaler
scaler = StandardScaler()

In [16]:
# Fit the scaler on the training data (this step calculates the mean and std dev of X_train only)
scaler.fit(X_train)

In [17]:
X_train_scaled = scaler.fit_transform(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

In [18]:
# Transform the validation data using the same scaler (without fitting)
X_val_scaled = scaler.transform(X_val)
X_val_scaled = pd.DataFrame(X_val_scaled, columns=X_val.columns)

In [19]:
# If you also want to scale the test data in the same manner:
X_test_scaled = scaler.transform(data_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=data_test.columns)

<h2 style="text-align: center; color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 36px; text-shadow: 1px 1px #D1D1D1;">
    Model Selection
</h2>
<hr style="border: 1px solid #4A90E2;">

<p style="font-size: 18px; line-height: 1.6; font-family: 'Arial', sans-serif;">
    This section focuses on selecting the best-performing models for predicting workers' compensation claims outcomes. Various machine learning algorithms are evaluated based on their accuracy, interpretability, and suitability for the dataset, ensuring an optimal balance between predictive performance and computational efficiency.
</p>


In [20]:
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

In [21]:
def meta_model_run(model, X_train_resampled=None, y_train_resampled=None, data_test_FS=None, n_splits=5):
    if X_train_resampled is None or y_train_resampled is None or data_test_FS is None:
        raise ValueError("Os dados de treinamento e teste precisam ser fornecidos.")
    
    if model == "CatBoost":
        models, f1_scores, oof_predictions, test_predictions = meta_model_cb(
            X_train_resampled, y_train_resampled, data_test_FS, n_splits_n=n_splits
        )
    elif model == "RandomForest":
        models, f1_scores, oof_predictions, test_predictions = meta_model_rf(
            X_train_resampled, y_train_resampled, data_test_FS, n_splits_n=n_splits
        )
    elif model == "XGBoost":
        models, f1_scores, oof_predictions, test_predictions = meta_model_xgbc(
            X_train_resampled, y_train_resampled, data_test_FS, n_splits_n=n_splits
        )
    elif model == "ExtraTree":
        models, f1_scores, oof_predictions, test_predictions = meta_model_et(
            X_train_resampled, y_train_resampled, data_test_FS, n_splits_n=n_splits
        )
    else:
        raise ValueError("Modelo inválido. Escolha entre 'RandomForest', 'XGBoost' ou 'ExtraTree'.")

    return oof_predictions, test_predictions


<h3 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 32px; text-shadow: 2px 2px #D1D1D1;">Claim Injury Type Prediction(Without Agreement Reached and WCB Decision)</h3>

<h3 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 28px; text-shadow: 2px 2px #D1D1D1;">CatBoost Classifier</h3>

In [None]:
oof_predictions, test_predictions = meta_model_run(
    model="CatBoost",
    X_train_resampled=X_train_resampled,
    y_train_resampled=y_train_resampled,
    data_test_FS=data_test
)


Training Fold 1...


In [204]:
nn_model, X_nn_train, X_nn_val, y_nn_val, y_nn_train, early_stopping, reduce_lr = neural_network(oof_predictions, y_train_resampled)

history = nn_model.fit(
    X_nn_train, y_nn_train,
    validation_data=(X_nn_val, y_nn_val),
    epochs=10,
    batch_size=64,
    callbacks=[early_stopping, reduce_lr],
    verbose=1
)

save_model(nn_model, "CB_NN_model")

In [205]:
plot_training_history(history)

In [206]:
plot_confusion_matrix(model=nn_model, X_val=X_nn_val, y_val=y_nn_val,
    class_mapping=mapping_dict, title="Matriz de Confusão - Conjunto de Validação"
)

In [207]:
save_predictions_to_csv(
    model=nn_model,
    test_data=test_predictions,
    claim_ids=data_test["Claim Identifier"],
    class_mapping=mapping_dict,
    output_path="../predictions/group_40_KFold_CatBoost_NN_predictions.csv"
)


In [208]:
predicitons_data = pd.read_csv('../predictions/group_40_KFold_CatBoost_NN_predictions.csv')
values = predicitons_data['Claim Injury Type'].value_counts()
values

<h3 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 28px; text-shadow: 2px 2px #D1D1D1;">Random Forest Classifier</h3>

In [None]:
oof_predictions_rf, test_predictions_rf = meta_model_run(
    model="RandomForest",
    X_train_resampled=X_train_resampled,
    y_train_resampled=y_train_resampled,
    data_test_FS=data_test,
    n_splits=5
)

In [None]:
nn_model_rf, X_nn_train, X_nn_val, y_nn_val, y_nn_train, early_stopping, reduce_lr = neural_network(oof_predictions_rf, y_train_resampled)

history = nn_model_rf.fit(
    X_nn_train, y_nn_train,
    validation_data=(X_nn_val, y_nn_val),
    epochs=10,
    batch_size=64,
    callbacks=[early_stopping, reduce_lr],
    verbose=1
)

save_model(nn_model_rf, "RF_NN_model")

In [None]:
plot_training_history(history)

In [None]:
plot_confusion_matrix(model=nn_model_rf, X_val=X_nn_val, y_val=y_nn_val,
    class_mapping=mapping_dict, title="Matriz de Confusão - Conjunto de Validação"
)

In [None]:
save_predictions_to_csv(
    model=nn_model_rf,
    test_data=test_predictions_rf,
    claim_ids=data_test["Claim Identifier"],
    class_mapping=mapping_dict,
    output_path="../predictions/group_40_KFold_RF_NN_predictions.csv"
)


In [None]:
predicitons_data = pd.read_csv('../predictions/group_40_KFold_RF_NN_predictions.csv')
values = predicitons_data['Claim Injury Type'].value_counts()
values

<h3 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 28px; text-shadow: 2px 2px #D1D1D1;">XGBoost Classifier</h3>

In [None]:
oof_predictions_xgbc, test_predictions_xgbc = meta_model_run(
    model="XGBoost",
    X_train_resampled=X_train_resampled,
    y_train_resampled=y_train_resampled,
    data_test_FS=data_test,
    n_splits=5
)

In [None]:
nn_model_xgbc, X_nn_train, X_nn_val, y_nn_val, y_nn_train, early_stopping, reduce_lr = neural_network(oof_predictions_xgbc, y_train_resampled)

history = nn_model_xgbc.fit(
    X_nn_train, y_nn_train,
    validation_data=(X_nn_val, y_nn_val),
    epochs=10,
    batch_size=64,
    callbacks=[early_stopping, reduce_lr],
    verbose=1
)

save_model(nn_model_xgbc, "XGBoost_NN_model")

In [None]:
plot_training_history(history)

In [None]:
plot_confusion_matrix(model=nn_model_xgbc, X_val=X_nn_val, y_val=y_nn_val,
    class_mapping=mapping_dict, title="Matriz de Confusão - Conjunto de Validação"
)

In [None]:
save_predictions_to_csv(
    model=nn_model_xgbc,
    test_data=test_predictions_xgbc,
    claim_ids=data_test["Claim Identifier"],
    class_mapping=mapping_dict,
    output_path="../predictions/group_40_KFold_XGBC_NN_predictions.csv"
)


In [None]:
predicitons_data = pd.read_csv('../predictions/group_40_KFold_XGBC_NN_predictions.csv')
values = predicitons_data['Claim Injury Type'].value_counts()
values

<h3 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 28px; text-shadow: 2px 2px #D1D1D1;">Extra Tree Classifier</h3>

In [None]:
oof_predictions_et, test_predictions_et = meta_model_run(
    model="ExtraTree",
    X_train_resampled=X_train_resampled,
    y_train_resampled=y_train_resampled,
    data_test_FS=data_test,
    n_splits=5
)

In [None]:
nn_model_et, X_nn_train, X_nn_val, y_nn_val, y_nn_train, early_stopping, reduce_lr = neural_network(oof_predictions_et, y_train_resampled)

history = nn_model_et.fit(
    X_nn_train, y_nn_train,
    validation_data=(X_nn_val, y_nn_val),
    epochs=10,
    batch_size=64,
    callbacks=[early_stopping, reduce_lr],
    verbose=1
)

save_model(nn_model_et, "ExtraTree_NN_model")

In [None]:
plot_training_history(history)

In [None]:
plot_confusion_matrix(model=nn_model_et, X_val=X_nn_val, y_val=y_nn_val,
    class_mapping=mapping_dict, title="Matriz de Confusão - Conjunto de Validação"
)

In [None]:
save_predictions_to_csv(
    model=nn_model_et,
    test_data=test_predictions_et,
    claim_ids=data_test["Claim Identifier"],
    class_mapping=mapping_dict,
    output_path="../predictions/group_40_KFold_ExtraTree_NN_predictions.csv"
)


In [None]:
predicitons_data = pd.read_csv('../predictions/group_40_KFold_ExtraTree_NN_predictions.csv')
values = predicitons_data['Claim Injury Type'].value_counts()
values

<h3 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 28px; text-shadow: 2px 2px #D1D1D1;">Ensemble Learning</h3>

<h5 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 24px; text-shadow: 2px 2px #D1D1D1;">Weighted Averaging Ensemble</h5>


In [179]:
#Cálculo do weight's através do desempenho
#Desempenho sem rede neuronal
f1_xgbc=0.88726
f1_rf=0.91618
f1_extratree=0.9288

#Desempenho com rede neuronal
f1_xgbc_nn=0.89
f1_rf_nn=0.93
f1_extratree_nn=0.94

total_f1 = f1_xgbc + f1_rf + f1_extratree
total_f1_nn = f1_xgbc_nn + f1_rf_nn + f1_extratree_nn

weights = [(((f1_xgbc / total_f1) + (f1_xgbc_nn / total_f1_nn)) / 2), (((f1_rf / total_f1) + (f1_rf_nn / total_f1_nn)) / 2), (((f1_extratree / total_f1) + (f1_extratree_nn / total_f1_nn)) / 2)]

In [180]:
ensemble_test_predictions = (
    weights[0] * test_predictions_xgbc +
    weights[1] * test_predictions_rf +
    weights[2] * test_predictions_et
)

ensemble_oof_predictions = (
    weights[0] * oof_predictions_xgbc +
    weights[1] * oof_predictions_rf +
    weights[2] * oof_predictions_et
)

In [181]:
final_test_predictions = ensemble_test_predictions
final_oof_predictions = ensemble_oof_predictions

In [None]:
nn_model_weight_ensemble, X_nn_train, X_nn_val, y_nn_val, y_nn_train, early_stopping, reduce_lr = neural_network(final_oof_predictions, y_train_resampled)

history = nn_model_weight_ensemble.fit(
    X_nn_train, y_nn_train,
    validation_data=(X_nn_val, y_nn_val),
    epochs=10,
    batch_size=64,
    callbacks=[early_stopping, reduce_lr],
    verbose=1
)

save_model(nn_model_weight_ensemble, "WeightedEnsemble_NN_model")

In [None]:
plot_training_history(history)

In [None]:
plot_confusion_matrix(model=nn_model_weight_ensemble, X_val=X_nn_val, y_val=y_nn_val,
    class_mapping=mapping_dict, title="Matriz de Confusão - Conjunto de Validação"
)

In [None]:
save_predictions_to_csv(
    model=nn_model_weight_ensemble,
    test_data=final_test_predictions,
    claim_ids=data_test["Claim Identifier"],
    class_mapping=mapping_dict,
    output_path="../predictions/group_40_KFold_Ensemble_Weight_NN_predictions.csv"
)


In [None]:
predicitons_data = pd.read_csv('../predictions/group_40_KFold_Ensemble_Weight_NN_predictions.csv')
values = predicitons_data['Claim Injury Type'].value_counts()
values

<h5 style="color: #4A90E2; font-family: 'Arial', sans-serif; font-size: 24px; text-shadow: 2px 2px #D1D1D1;">Stacking Ensemble</h5>


In [187]:
X_meta_train = np.column_stack((oof_predictions_xgbc, oof_predictions_rf, oof_predictions_et))
X_meta_test = np.column_stack((test_predictions_xgbc, test_predictions_rf, test_predictions_et))

In [None]:
nn_model_stacking_ensemble, X_nn_train, X_nn_val, y_nn_val, y_nn_train, early_stopping, reduce_lr = neural_network(X_meta_train, y_train_resampled)

history = nn_model_stacking_ensemble.fit(
    X_nn_train, y_nn_train,
    validation_data=(X_nn_val, y_nn_val),
    epochs=10,
    batch_size=64,
    callbacks=[early_stopping, reduce_lr],
    verbose=1
)

save_model(nn_model_stacking_ensemble, "StackingEnsemble_NN_model")

In [None]:
plot_training_history(history)

In [None]:
plot_confusion_matrix(model=nn_model_stacking_ensemble, X_val=X_nn_val, y_val=y_nn_val,
    class_mapping=mapping_dict, title="Matriz de Confusão - Conjunto de Validação"
)

In [None]:
save_predictions_to_csv(
    model=nn_model_stacking_ensemble,
    test_data=X_meta_test,
    claim_ids=data_test["Claim Identifier"],
    class_mapping=mapping_dict,
    output_path="../predictions/group_40_KFold_Ensemble_Stacking_NN_predictions.csv"
)


In [None]:
predicitons_data = pd.read_csv('../predictions/group_40_KFold_Ensemble_Stacking_NN_predictions.csv')
values = predicitons_data['Claim Injury Type'].value_counts()
values