# Documentação 

**Application of Machine Learning Algorithms for Predicting Core Temperature at the End of a 10km Run**

* This project is dedicated to the application of machine learning models to predict internal temperature at the conclusion of a self-paced 10km run.

* The dataset utilized in this study originates from the article by [Andrade et al. (2023)](https://pmc.ncbi.nlm.nih.gov/articles/PMC10988464/), with the raw data made available by the authors at [https://doi.org/10.6084/m9.figshare.21508239](https://doi.org/10.6084/m9.figshare.21508239).

* Prior to modeling, the original dataset underwent a pre-processing stage that included the removal of variables not used in the models, the modification of variable data structures to more suitable formats, the elimination of invalid characters, and the replacement of commas with periods for numerical standardization. Details of these transformations are documented in the **pre-processing.ipynb** notebook, located at **https://github.com/leprogramar/projeto1-doc/blob/main/pre-processamento.ipynb**. From the pre-processed data, three distinct datasets were generated, each configured to replicate the variable models proposed by the original authors: Dataset 1 (10 variables), Dataset 2 (8 variables), and Dataset 3 (5 variables).

* The primary objective of this work is to evaluate the performance and optimization of the **Decision Tree, Random Forest, XGBoost, and LASSO Regression** algorithms on the three datasets.

* In the code below, specifically, the modeling of the four machine learning algorithms was performed on **Dataset 1**.

* **Predictor Variables** - WBGT, Running speed, Initial core, Body mass, Tcore-Tskin, Tskin mean, Sweat rate, VO2max, HR, ΔBM%

* **Predicted Variable** - EndTcore

# Import library python

## Standard

In [17]:
import pandas as pd #data manipulation 
import numpy as np #data manipulation 
import seaborn as sns #building graphic
import matplotlib.pyplot as plt #building graphic
from sklearn.metrics import mean_squared_error, r2_score #algoritms metrics
from sklearn.model_selection import train_test_split, GridSearchCV #sample and hyperparameters manipulation

## Decision tree

In [18]:
from sklearn.tree import DecisionTreeRegressor, plot_tree #decision tree model and tree graphic

## Random forest

In [19]:
from sklearn.ensemble import RandomForestRegressor #random forest model

## XGBoost

In [1]:
#pip instal xgboost # install xgboos

In [2]:
import xgboost as xgb #xgboos model

## LASSO regression

In [None]:
from sklearn.linear_model import Lasso # import Lasso model
from sklearn.preprocessing import StandardScaler # import StandardScaler for normalization 
from sklearn.pipeline import Pipeline # import Pipeline

# Import database

In [None]:
df = pd.read_csv("/home/lafise/Desktop/Samuel/leticiaag/trabalho1/dados_modelo_1.csv") # database used in Decision tree
df2 = pd.read_csv("/home/lafise/Desktop/Samuel/leticiaag/trabalho1/dados_modelo_1.csv") # database used in Random Forest
df3 = pd.read_csv("/home/lafise/Desktop/Samuel/leticiaag/trabalho1/dados_modelo_1.csv") # database used in XGBoost
df4 = pd.read_csv("/home/lafise/Desktop/Samuel/leticiaag/trabalho1/dados_modelo_1.csv") # database used in Regression

# Decision tree model

## Pre-processing - decision tree

In [None]:
X1 = df.drop("EndTCORE_C", axis=1) #separating the predictor variables
y1 = df["EndTCORE_C"] #separating the predict variables

In [None]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.20, random_state=42) #separating the training (80%) and testing (20%) databases.
#the 80/20 separation was chosen because the amount of data to train the model when 70/30 or 80/10/10 was chosen did not provide enough data
#for creating the model, leading to underfitting. To mitigate overfitting with the 80/20 distribution, the K-fold sampling technique was used.

In [None]:
#optimization of hyperparameters
param_grid = {
    'max_depth': [5, 10, 15, 18, 20], #tre max depth
    'min_samples_leaf': [3, 5, 9, 10], #minimum sample leaf
    'min_samples_split': [3, 5, 8, 10], #minimum de sample to split
    'ccp_alpha': [0.00, 0.02, 0.03] #complexity cost
}

In [None]:
dt_regressor_base = DecisionTreeRegressor(random_state=42) #building decision tree instance, used random_state=42 for reproducibility

In [None]:
grid_search = GridSearchCV(estimator=dt_regressor_base, #algorithm base
                           param_grid=param_grid, #hyperparameters
                           cv=5, #5 K-fold (cross-validation)
                           scoring='neg_mean_squared_error', #parameters for choosing the best algorithm
                           n_jobs=-1, #uses all threads available processing cores
                           verbose=1) #show processing type

## Training model - decision tree

In [None]:
grid_search.fit(X_train1, y_train1) #training models with training data

In [None]:
best_dt_regressor = grid_search.best_estimator_ #get hyperparameters used for the model
print(best_dt_regressor) #show best hyperparameters

## Evaluating the model - Decision tree

In [None]:
mse_train = mean_squared_error(y_train1, best_dt_regressor.predict(X_train1)) #MSE calculation
rmse_train = np.sqrt(mse_train) #RMSE calculation
r2_train = r2_score(y_train1, best_dt_regressor.predict(X_train1)) #R² calculation
print('==========================================================')
print(f"----- Métricas de Avaliação do Modelo - Dados TREINO -----")
print('----------------------------------------------------------')
print(f"Mean Squared Error (MSE): {mse_train:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_train:.2f}")
print(f"R-squared (R²): {r2_train:.2f}")

mse_optimized = mean_squared_error(y_test1, best_dt_regressor.predict(X_test1)) #MSE calculation
rmse_optimized = np.sqrt(mse_optimized) #RMSE calculation 
r2_optimized = r2_score(y_test1, best_dt_regressor.predict(X_test1)) #R² calculation
print('==========================================================')
print(f"----- Métricas de Avaliação do Modelo - Dados TESTE ------")
print('----------------------------------------------------------')
print(f"Mean Squared Error (MSE): {mse_optimized:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_optimized:.2f}")
print(f"R-squared (R²): {r2_optimized:.2f}")
print('==========================================================')

# Random forest model

## Pre-processing - random forest

In [None]:
X2 = df2.drop("EndTCORE_C", axis=1) #separating the predictor variables
y2 = df2["EndTCORE_C"] #separating the predict variables

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.20, random_state=42)#separating the training (80%) and testing (20%) databases.

In [None]:
#optimization of hyperparameters
param_grid_rf = {
    'n_estimators': [100, 150, 200], # number of the tree in bagging
    'max_depth': [3, 4, 5, 8], # max depth for each tree
    'min_samples_leaf': [ 10, 8, 12], #minimum sample leaf
    'min_samples_split': [ 10, 8, 12],  #minimum de sample to split
    'max_features': ['log2'], # number of features to consider at each split
    'ccp_alpha': [0.00, 0.001, 0.002, 0.003] #complexity cost
}

In [None]:
rf_regressor_base = RandomForestRegressor(random_state=42)#building random forest instance, used random_state=42 for reproducibility

In [None]:
grid_search_rf = GridSearchCV(estimator=rf_regressor_base, #algorithm base
                              param_grid=param_grid_rf, #hyperparameters
                              cv=5, #5 K-fold (cross-validation)
                              scoring='neg_mean_squared_error', #parameters for choosing the best algorithm
                              n_jobs=-1, #uses all threads available processing cores
                              verbose=1) #show processing type

## Training model - random forest

In [None]:
grid_search_rf.fit(X_train2, y_train2) #training models with training data

In [None]:
best_params_rf = grid_search_rf.best_params_ # get hyperparameters used for the model
best_rf_regressor = grid_search_rf.best_estimator_

## Evaluating model - random forest

In [None]:
y_pred_rf_optimized = best_rf_regressor.predict(X_test2)
y_pred_rf_train = best_rf_regressor.predict(X_train2)

In [None]:
mse_rf_train = mean_squared_error(y_train2, y_pred_rf_train) #MSE calculation
rmse_rf_train = np.sqrt(mse_rf_train) #RMSE calculation
r2_rf_train = r2_score(y_train2, y_pred_rf_train) #R² calculation
print('========================================================================')
print(f"\n--- Métricas de Avaliação do Modelo Random Forest OTIMIZADO (TREINO) ---")
print('------------------------------------------------------------------------')
print(f"Mean Squared Error (MSE): {mse_rf_train:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_rf_train:.2f}")
print(f"R-squared (R²): {r2_rf_train:.2f}")

mse_rf_optimized = mean_squared_error(y_test2, y_pred_rf_optimized) #MSE calculation
rmse_rf_optimized = np.sqrt(mse_rf_optimized) #RMSE calculation
r2_rf_optimized = r2_score(y_test2, y_pred_rf_optimized) #R² calculation
print('=======================================================================')
print(f"\n--- Métricas de Avaliação do Modelo Random Forest OTIMIZADO (TESTE) ---")
print('-----------------------------------------------------------------------')
print(f"Mean Squared Error (MSE): {mse_rf_optimized:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_rf_optimized:.2f}")
print(f"R-squared (R²): {r2_rf_optimized:.2f}")
print('=======================================================================')

# XGBoost

## Pre-processing - xgbboost

In [None]:
X3 = df3.drop("EndTCORE_C", axis=1) #separating the predictor variables
y3 = df3["EndTCORE_C"] #separating the predict variables

In [None]:
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.20, random_state=42) #separating the training (80%) and testing (20%) databases.

In [None]:
#otmizacao de hiperparametros
param_grid_xgb = { 
    'n_estimators': [90, 100], # number of the tree in boosting
    'learning_rate': [0.01, 0.02, 0.05], # learning rate
    'max_depth': [2, 3, 5], # max depth for each tree
    'subsample': [0.5, 0.6, 0.7], # subsampling of training data
    'colsample_bytree': [0.4, 0.5], # subsampling of columns to building each tree
    'gamma': [0.3, 0.4, 0.5], # minimum reduction to make a partition
    'min_child_weight': [ 6, 7, 8] # minimum sum of instance weight needed in a child
}

In [None]:
xgb_regressor_base = xgb.XGBRegressor(random_state=42, eval_metric='rmse') #criacao da instancia do modelo xgboost, utilizando random_state=42 para asegurar reprodutibilidade 

In [None]:
grid_search_xgb = GridSearchCV(estimator=xgb_regressor_base, #algoritmo base
                               param_grid=param_grid_xgb, #hiperparametros
                               cv=5, #5 K-fold (validacao cruzada)
                               scoring='neg_mean_squared_error', # otimiza para o menor MSE
                               n_jobs=-1, # usa todos os nucleos disponiveis
                               verbose=1) # exibe o progresso

## Training model - xgboost

In [None]:
grid_search_xgb.fit(X_train3, y_train3) #treinando o modelo com base nos dados de treino

In [None]:
best_params_xgb = grid_search_xgb.best_params_
best_xgb_regressor = grid_search_xgb.best_estimator_

In [None]:
print(best_xgb_regressor)
print(best_params_xgb)

## Evaluating model - xgboost

In [None]:
y_pred_xgb_optimized = best_xgb_regressor.predict(X_test3)
y_pred_xgb_train = best_xgb_regressor.predict(X_train3)

In [None]:
mse_xgb_train = mean_squared_error(y_train3, y_pred_xgb_train) #MSE calculation
rmse_xgb_train = np.sqrt(mse_xgb_train) #RMSE calculation
r2_xgb_train = r2_score(y_train3, y_pred_xgb_train) #R² calculation
print('========================================================================')
print(f"\n--- Métricas de Avaliação do Modelo XGBoost OTIMIZADO (TREINO) ---")
print('------------------------------------------------------------------------')
print(f"Mean Squared Error (MSE): {mse_xgb_train:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_xgb_train:.2f}")
print(f"R-squared (R²): {r2_xgb_train:.2f}")

mse_xgb_optimized = mean_squared_error(y_test3, y_pred_xgb_optimized) #MSE calculation
rmse_xgb_optimized = np.sqrt(mse_xgb_optimized) #RMSE calculation
r2_xgb_optimized = r2_score(y_test3, y_pred_xgb_optimized)#R² calculation
print('========================================================================')
print(f"\n--- Métricas de Avaliação do Modelo XGBoost OTIMIZADO (TESTE) ---")
print('------------------------------------------------------------------------')
print(f"Mean Squared Error (MSE): {mse_xgb_optimized:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_xgb_optimized:.2f}")
print(f"R-squared (R²): {r2_xgb_optimized:.2f}")
print('========================================================================')

# LASSO regression

## Pre-processing - lasso regression

In [None]:
X4 = df4.drop("EndTCORE_C", axis=1) #separating the predictor variables
y4 = df4["EndTCORE_C"] #separating the predictor variables

In [None]:
X_train4, X_test4, y_train4, y_test4 = train_test_split(X4, y4, test_size=0.30, random_state=42)#separating the training (80%) and testing (20%) databases.

In [None]:
pipeline = Pipeline([
    ('scaler', StandardScaler()), #standardization the features
    ('lasso', Lasso(random_state=42, max_iter=2000)) #aplica a Regressão Lasso
])


In [None]:
param_grid_lasso = {
    'lasso__alpha': np.logspace(-4, 2, 7) # test logarithmic scale
}

In [None]:
##optimization of hyperparameters
grid_search_lasso = GridSearchCV(estimator=pipeline, #pipeline estimator
                                 param_grid=param_grid_lasso, #hyperparameters
                                 cv=5, #5 K-fold (cross-validation)
                                 scoring='neg_mean_squared_error', #parameters for choosing the best algorithm
                                 n_jobs=-1, #uses all threads available processing cores
                                 verbose=1) #show processing type

## Training model - lasso regression

In [None]:
grid_search_lasso.fit(X_train4, y_train4) #training the model based on the training data

In [None]:
best_params_lasso = grid_search_lasso.best_params_  # get the best parameters of grid 
best_lasso_regressor_pipeline = grid_search_lasso.best_estimator_ # get the best parameters 
print(best_lasso_regressor_pipeline)
print(best_params_lasso)

## Evaluating model - lasso regression

In [None]:
y_pred_lasso_optimized = best_lasso_regressor_pipeline.predict(X_test4)
y_pred_lasso_train = best_lasso_regressor_pipeline.predict(X_train4)

In [None]:
mse_lasso_train = mean_squared_error(y_train4, y_pred_lasso_train) #MSE calculation
rmse_lasso_train = np.sqrt(mse_lasso_train) #RMSE calculation
r2_lasso_train = r2_score(y_train4, y_pred_lasso_train) #R² calculation
print('============================================================================')
print(f"\n--- Métricas de Avaliação do Modelo Lasso Regression OTIMIZADO (TREINO) ----")
print('----------------------------------------------------------------------------')
print(f"Mean Squared Error (MSE): {mse_lasso_train:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_lasso_train:.2f}")
print(f"R-squared (R²): {r2_lasso_train:.2f}")
print('============================================================================')
mse_lasso_optimized = mean_squared_error(y_test4, y_pred_lasso_optimized) #MSE calculation
rmse_lasso_optimized = np.sqrt(mse_lasso_optimized) #RMSE calculation
r2_lasso_optimized = r2_score(y_test4, y_pred_lasso_optimized) #R² calculation
print(f"\n--- Métricas de Avaliação do Modelo Lasso Regression OTIMIZADO (TESTE) -----")
print('----------------------------------------------------------------------------')
print(f"Mean Squared Error (MSE): {mse_lasso_optimized:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse_lasso_optimized:.2f}")
print(f"R-squared (R²): {r2_lasso_optimized:.2f}")
print('============================================================================')

# General graphic

In [None]:
# Names maps
mapa_de_nomes = {
    'Heart_rate_bpm': 'Heart rate',
    'WBGT_C': 'WBGT',
    'Initial_TCORE_C': 'Initial TCORE',
    'VO2MAX_mLkg1min1': 'VO2 Max',
    'Mean_TSKIN_C': 'Mean TSKIN',
    'Speed_kmh1': 'Speed',
    'Delta_mass_': 'Delta mass',
    'Sweat_rate_Lh1': 'Sweat rate',
    'Body_mass_kg': 'Body mass',
    'TCORE__TSKIN_C': 'TCORE-TSKIN'
}

# data for graphic A (Random forest)
importances_rf = (best_rf_regressor.feature_importances_) * 100
feature_names_rf = X2.columns
rf_importance_df = pd.DataFrame({
    'Feature': feature_names_rf,
    'Importance': importances_rf
}).sort_values(by='Importance', ascending=False)
rf_importance_df['Feature'] = rf_importance_df['Feature'].map(mapa_de_nomes)

# data for graphic B (XGBoost)
importances_xgb = (best_xgb_regressor.feature_importances_) * 100
feature_names_xgb = X3.columns
xgb_importance_df = pd.DataFrame({
    'Feature': feature_names_xgb,
    'Importance': importances_xgb
}).sort_values(by='Importance', ascending=False).round(2)
xgb_importance_df['Feature'] = xgb_importance_df['Feature'].map(mapa_de_nomes)

# data for graphic C (LASSO)
coefficients_lasso = best_lasso_regressor_pipeline.named_steps['lasso'].coef_
feature_names_lasso = X4.columns
lasso_coef_df = pd.DataFrame({
    'Feature': feature_names_lasso,
    'Coefficient': coefficients_lasso
}).sort_values(by='Coefficient', key=abs, ascending=False)
lasso_coef_nonzero_df = lasso_coef_df[lasso_coef_df['Coefficient'] != 0]
lasso_coef_nonzero_df['Feature'] = lasso_coef_nonzero_df['Feature'].map(mapa_de_nomes)

fig, axes = plt.subplots(1, 3, figsize=(22, 6))

# --- graphic A: Random Forest ---
sns.barplot(x='Importance', y='Feature', data=rf_importance_df, edgecolor='black', color='gray', width=0.8, alpha=0.7, ax=axes[0])
axes[0].set_title('Figura A: Dataset 1 (Random Forest)', fontsize=14, pad=20)
axes[0].set_xlabel('Importância (%)', fontsize=12)
axes[0].set_ylabel('Variável Preditora', fontsize=12)
axes[0].spines['top'].set_visible(False)
axes[0].spines['right'].set_visible(False)
axes[0].set_xlim(0, 25)
axes[0].set_xticks(np.arange(0, 26, 2.5))

# --- graphic B: XGBoost ---
sns.barplot(x='Importance', y='Feature', data=xgb_importance_df, edgecolor='black', color='gray', width=0.8, alpha=0.7, ax=axes[1])
axes[1].set_title('Figura B: Dataset 1 (XGBoost)', fontsize=14, pad=20)
axes[1].set_xlabel('Importância (%)', fontsize=12)
axes[1].set_ylabel('') # remove y
axes[1].spines['top'].set_visible(False)
axes[1].spines['right'].set_visible(False)
axes[1].set_xlim(0, 20)
axes[1].set_xticks(np.arange(0, 22, 2))


# --- graphic C: LASSO ---
sns.barplot(x='Coefficient', y='Feature', data=lasso_coef_nonzero_df, edgecolor='black', color='gray', width=0.8, alpha=0.7, ax=axes[2])
axes[2].set_title('Figura C: Dataset 1 (LASSO)', fontsize=14, pad=20)
axes[2].set_xlabel('Valor do Coeficiente', fontsize=12)
axes[2].set_ylabel('') # remove y
axes[2].axvline(0, color='black', linewidth=0.8)
axes[2].set_xlim(-0.3, 3.0)
axes[2].set_xticks(np.arange(-0.3, 3.5, 0.5))
axes[2].spines['top'].set_visible(False)
axes[2].spines['right'].set_visible(False)

plt.tight_layout()

#plt.savefig('graficos_dataset1_modelos_comparativo1.png', dpi=300, bbox_inches='tight') #save graphic

plt.show()

# Modelo por features