# Trabalho Final Machine Learning

----------------------------------------------------------

## Análise exploratória


In [320]:
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport

In [321]:
df = pd.read_csv('osteoporosis.csv')
df.head()

Unnamed: 0,Id,Age,Gender,Hormonal Changes,Family History,Race/Ethnicity,Body Weight,Calcium Intake,Vitamin D Intake,Physical Activity,Smoking,Alcohol Consumption,Medical Conditions,Medications,Prior Fractures,Osteoporosis
0,104866,69,Female,Normal,Yes,Asian,Underweight,Low,Sufficient,Sedentary,Yes,Moderate,Rheumatoid Arthritis,Corticosteroids,Yes,1
1,101999,32,Female,Normal,Yes,Asian,Underweight,Low,Sufficient,Sedentary,No,,,,Yes,1
2,106567,89,Female,Postmenopausal,No,Caucasian,Normal,Adequate,Sufficient,Active,No,Moderate,Hyperthyroidism,Corticosteroids,No,1
3,102316,78,Female,Normal,No,Caucasian,Underweight,Adequate,Insufficient,Sedentary,Yes,,Rheumatoid Arthritis,Corticosteroids,No,1
4,101944,38,Male,Postmenopausal,Yes,African American,Normal,Low,Sufficient,Active,Yes,,Rheumatoid Arthritis,,Yes,1


In [322]:
df.shape

(1958, 16)

In [323]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Id                   1958 non-null   int64 
 1   Age                  1958 non-null   int64 
 2   Gender               1958 non-null   object
 3   Hormonal Changes     1958 non-null   object
 4   Family History       1958 non-null   object
 5   Race/Ethnicity       1958 non-null   object
 6   Body Weight          1958 non-null   object
 7   Calcium Intake       1958 non-null   object
 8   Vitamin D Intake     1958 non-null   object
 9   Physical Activity    1958 non-null   object
 10  Smoking              1958 non-null   object
 11  Alcohol Consumption  1958 non-null   object
 12  Medical Conditions   1958 non-null   object
 13  Medications          1958 non-null   object
 14  Prior Fractures      1958 non-null   object
 15  Osteoporosis         1958 non-null   int64 
dtypes: int

### Análise de valores nulos

In [324]:
df.isna().sum()

Id                     0
Age                    0
Gender                 0
Hormonal Changes       0
Family History         0
Race/Ethnicity         0
Body Weight            0
Calcium Intake         0
Vitamin D Intake       0
Physical Activity      0
Smoking                0
Alcohol Consumption    0
Medical Conditions     0
Medications            0
Prior Fractures        0
Osteoporosis           0
dtype: int64

nesse caso, não há valores nulos no dataset escolhido.

### contagem de valores únicos

In [325]:
df.nunique()

Id                     1749
Age                      73
Gender                    2
Hormonal Changes          2
Family History            2
Race/Ethnicity            3
Body Weight               2
Calcium Intake            2
Vitamin D Intake          2
Physical Activity         2
Smoking                   2
Alcohol Consumption       2
Medical Conditions        3
Medications               2
Prior Fractures           2
Osteoporosis              2
dtype: int64

Fora os valores inteiros, temos muitas variáveis categóricas com pouco valores únicos. podemos fazer um onehot encoding para essas variáveis.

### análise de balanceamento

In [326]:
df['Osteoporosis'].value_counts()

1    979
0    979
Name: Osteoporosis, dtype: int64

O dataset está balanceado, com 50% dos valores para cada classe.
Isso indica para mim que provavelmente a melhor métrica para 
avaliar os modelos será o Recall, pois queremos minimizar os 
falsos negativos.

## Pré-processamento

----------------------------------------------------------
### OneHot Encoding

In [327]:
listDummies= df.select_dtypes(include='object').columns.to_list()

In [328]:
#onehot encoding
oneHot = pd.get_dummies(df[listDummies], drop_first=True)
df = df.drop(listDummies, axis=1)
df = pd.concat([df, oneHot], axis=1)
df.head()

Unnamed: 0,Id,Age,Osteoporosis,Gender_Male,Hormonal Changes_Postmenopausal,Family History_Yes,Race/Ethnicity_Asian,Race/Ethnicity_Caucasian,Body Weight_Underweight,Calcium Intake_Low,Vitamin D Intake_Sufficient,Physical Activity_Sedentary,Smoking_Yes,Alcohol Consumption_None,Medical Conditions_None,Medical Conditions_Rheumatoid Arthritis,Medications_None,Prior Fractures_Yes
0,104866,69,1,0,0,1,1,0,1,1,1,1,1,0,0,1,0,1
1,101999,32,1,0,0,1,1,0,1,1,1,1,0,1,1,0,1,1
2,106567,89,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0
3,102316,78,1,0,0,0,0,1,1,0,0,1,1,1,0,1,0,0
4,101944,38,1,1,1,1,0,0,0,1,1,0,1,1,0,1,1,1


In [329]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, RocCurveDisplay
from sklearn.metrics import recall_score

In [330]:
Y = df['Osteoporosis']

# Definir as variáveis de entrada (features)
X = df.drop(['Osteoporosis'], axis=1)

### Dividir os dados em treino e teste

In [331]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3, random_state=15)


## Busca aleatória de hiperparâmetros

----------------------------------------------------------
### Espaços de hiperparâmetros

In [332]:
from scipy.stats import randint, uniform

param_dist_xgb = {
    'n_estimators': randint(50, 300),
    'learning_rate': uniform(0.01, 0.6),
    'subsample': uniform(0.3, 0.7),
    'max_depth': randint(3, 10),
    'colsample_bytree': uniform(0.5, 0.5),
    'min_child_weight': randint(1, 6)
}

param_dist_rf = {
    'n_estimators': randint(100, 500),
    'max_features': ['sqrt', 'log2', None],  # Updated to remove 'auto'
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(1, 11)
}

param_dist_lr = {
    'C': uniform(0.01, 10),
    'penalty': ['l2'],  
    'solver': ['liblinear', 'saga'], 
    'max_iter': [10000]  
}




### Escolhendo o Scorer

In [333]:
from sklearn.metrics import make_scorer, f1_score, roc_auc_score, average_precision_score,recall_score
metric_function = recall_score # Escolher a métrica a ser utilizada
scorer = make_scorer(metric_function, pos_label=1)  # pos_label=1 para selecionar a classe positiva

### Treinamento do modelo Random Forest

O número de iterações foi escolhido de tal forma que cada modelo demore aproximadamente o mesmo tempo para treinar.
A quantia de folds foi escolhida para ser 5, pois é um valor comum e que não é muito alto.

In [334]:
import time
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Configurando RandomizedSearchCV para RandomForest
search_rf = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist_rf,
                               n_iter=12, cv=5, random_state=42, scoring=scorer, verbose=2)

# Executando a busca aleatória 
start_time = time.time()
search_rf.fit(X_train, Y_train)
end_time = time.time()

# Tempo total de execução
execution_time = end_time - start_time

print(f"Execution Time: {execution_time:.2f} seconds")
print(f"Random Forest best score: {search_rf.best_score_:.2f}")
print(f"Best parameters found: {search_rf.best_params_}")


Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END max_depth=9, max_features=sqrt, min_samples_leaf=8, min_samples_split=6, n_estimators=202; total time=   0.2s
[CV] END max_depth=9, max_features=sqrt, min_samples_leaf=8, min_samples_split=6, n_estimators=202; total time=   0.2s
[CV] END max_depth=9, max_features=sqrt, min_samples_leaf=8, min_samples_split=6, n_estimators=202; total time=   0.2s
[CV] END max_depth=9, max_features=sqrt, min_samples_leaf=8, min_samples_split=6, n_estimators=202; total time=   0.3s
[CV] END max_depth=9, max_features=sqrt, min_samples_leaf=8, min_samples_split=6, n_estimators=202; total time=   0.3s
[CV] END max_depth=13, max_features=None, min_samples_leaf=8, min_samples_split=6, n_estimators=199; total time=   0.4s
[CV] END max_depth=13, max_features=None, min_samples_leaf=8, min_samples_split=6, n_estimators=199; total time=   0.4s
[CV] END max_depth=13, max_features=None, min_samples_leaf=8, min_samples_split=6, n_estimators=199; tot

### Treinamento do modelo XGBoost

In [335]:
from xgboost import XGBClassifier

search_xgb = RandomizedSearchCV(XGBClassifier(), param_distributions=param_dist_xgb,
                               n_iter=58, cv=5, random_state=42, scoring=scorer, verbose=2)

start_time = time.time()
search_xgb.fit(X_train, Y_train)
end_time = time.time()

# Tempo total de execução
execution_time = end_time - start_time

print(f"Execution Time: {execution_time:.2f} seconds")
print(f"XGBoost best score: {search_xgb.best_score_:.2f}")
print(f"Best parameters found: {search_xgb.best_params_}")

Fitting 5 folds for each of 58 candidates, totalling 290 fits
[CV] END colsample_bytree=0.6872700594236812, learning_rate=0.5804285838459496, max_depth=5, min_child_weight=5, n_estimators=70, subsample=0.40921304830970556; total time=   0.0s
[CV] END colsample_bytree=0.6872700594236812, learning_rate=0.5804285838459496, max_depth=5, min_child_weight=5, n_estimators=70, subsample=0.40921304830970556; total time=   0.0s
[CV] END colsample_bytree=0.6872700594236812, learning_rate=0.5804285838459496, max_depth=5, min_child_weight=5, n_estimators=70, subsample=0.40921304830970556; total time=   0.0s
[CV] END colsample_bytree=0.6872700594236812, learning_rate=0.5804285838459496, max_depth=5, min_child_weight=5, n_estimators=70, subsample=0.40921304830970556; total time=   0.0s
[CV] END colsample_bytree=0.6872700594236812, learning_rate=0.5804285838459496, max_depth=5, min_child_weight=5, n_estimators=70, subsample=0.40921304830970556; total time=   0.0s
[CV] END colsample_bytree=0.5779972601

### Treinamento do modelo de Regressão Logística

In [336]:
from sklearn.linear_model import LogisticRegression

search_lr = RandomizedSearchCV(LogisticRegression(), param_distributions=param_dist_lr,
                               n_iter=7, cv=5, random_state=42, scoring=scorer, verbose=2)



start_time = time.time()
search_lr.fit(X_train, Y_train)
end_time = time.time()

# Tempo total de execução
execution_time = end_time - start_time

print(f"Execution Time: {execution_time:.2f} seconds")
print(f"Linear Regression best score: {search_lr.best_score_:.2f}")
print(f"Best parameters found: {search_lr.best_params_}")

Fitting 5 folds for each of 7 candidates, totalling 35 fits
[CV] END C=3.7554011884736247, max_iter=10000, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END C=3.7554011884736247, max_iter=10000, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END C=3.7554011884736247, max_iter=10000, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END C=3.7554011884736247, max_iter=10000, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END C=3.7554011884736247, max_iter=10000, penalty=l2, solver=liblinear; total time=   0.0s
[CV] END C=1.844347898661638, max_iter=10000, penalty=l2, solver=saga; total time=   1.6s
[CV] END C=1.844347898661638, max_iter=10000, penalty=l2, solver=saga; total time=   1.6s
[CV] END C=1.844347898661638, max_iter=10000, penalty=l2, solver=saga; total time=   1.5s
[CV] END C=1.844347898661638, max_iter=10000, penalty=l2, solver=saga; total time=   1.5s
[CV] END C=1.844347898661638, max_iter=10000, penalty=l2, solver=saga; total time=   1.5s
[CV] END C

## Avaliação dos modelos na base de teste

----------------------------------------------------------

Achando o melhor threshold de probabilidade para cada modelo e obtendo o classification report.

In [337]:
from sklearn.metrics import recall_score

def find_best_threshold(y_true, y_probas, score_func):
    thresholds = np.linspace(0.01, 0.99, 100)
    best_score = 0
    best_thresh = 0.5
    for thresh in thresholds:
        y_pred = (y_probas > thresh).astype(int)
        score = score_func(y_true, y_pred)
        if score > best_score:
            best_score = score
            best_thresh = thresh
    return best_thresh, best_score

# Lista de modelos e métricas
model_list = [search_xgb, search_rf, search_lr]
model_names = ['XGBoost', 'Random Forest', 'Logistic Regression']
metric_name = metric_function.__name__.split('_')[0]  # Para capturar o nome da métrica para impressão

for model, name in zip(model_list, model_names):
    y_test_proba = model.best_estimator_.predict_proba(X_test)
    # Verificar se precisa acessar a segunda coluna ou não
    if y_test_proba.ndim == 2 and y_test_proba.shape[1] > 1:
        y_test_proba = y_test_proba[:, 1]
    best_thresh, best_metric = find_best_threshold(Y_test, y_test_proba, metric_function)
    print(f"{name} - Best Threshold: {best_thresh}, Best {metric_name.capitalize()}: {best_metric:.4f}")



XGBoost - Best Threshold: 0.01, Best Recall: 1.0000
Random Forest - Best Threshold: 0.01, Best Recall: 1.0000
Logistic Regression - Best Threshold: 0.01, Best Recall: 1.0000


In [338]:
# Assumindo que você tem os seguintes objetos de busca configurados e treinados:
# search_xgb, search_rf, search_lr

model_searches = [
    ("XGBoost", search_xgb),
    ("Random Forest", search_rf),
    ("Logistic Regression", search_lr)
]

# Testar cada modelo e gerar o classification report
for model_name, search in model_searches:
    best_model = search.best_estimator_
    y_pred = best_model.predict(X_test)  # Usar .predict() para obter as classes previstas diretamente
    report = classification_report(Y_test, y_pred)
    print(f"Classification Report for {model_name}:\n{report}\n")


Classification Report for XGBoost:
              precision    recall  f1-score   support

           0       0.87      0.99      0.93       313
           1       0.99      0.83      0.90       275

    accuracy                           0.91       588
   macro avg       0.93      0.91      0.91       588
weighted avg       0.92      0.91      0.91       588


Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.85      1.00      0.92       313
           1       1.00      0.80      0.89       275

    accuracy                           0.91       588
   macro avg       0.93      0.90      0.91       588
weighted avg       0.92      0.91      0.91       588


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.82      0.95      0.88       313
           1       0.93      0.76      0.83       275

    accuracy                           0.86       588
   

## Classification Report de cada modelo

----------------------------------------------------------

### Avaliação dos modelos com score de Recall

----------------------------------------------------------
Classification Report for XGBoost:
              precision    recall  f1-score   support

           0       0.86      0.90      0.88       313
           1       0.88      0.84      0.86       275

    accuracy                           0.87       588
   macro avg       0.87      0.87      0.87       588
weighted avg       0.87      0.87      0.87       588


Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.87      0.99      0.93       313
           1       0.98      0.83      0.90       275

    accuracy                           0.91       588
   macro avg       0.93      0.91      0.91       588
weighted avg       0.92      0.91      0.91       588


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.81      0.84      0.83       313
           1       0.81      0.78      0.80       275

    accuracy                           0.81       588
   macro avg       0.81      0.81      0.81       588
weighted avg       0.81      0.81      0.81       588




### Avaliação dos modelos com score de ROC AUC

----------------------------------------------------------
Classification Report for XGBoost:
              precision    recall  f1-score   support

           0       0.85      0.86      0.85       313
           1       0.84      0.83      0.83       275

    accuracy                           0.84       588
   macro avg       0.84      0.84      0.84       588
weighted avg       0.84      0.84      0.84       588


Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.81      0.98      0.88       313
           1       0.97      0.73      0.83       275

    accuracy                           0.86       588
   macro avg       0.89      0.86      0.86       588
weighted avg       0.88      0.86      0.86       588


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.81      0.84      0.83       313
           1       0.81      0.78      0.80       275

    accuracy                           0.81       588
   macro avg       0.81      0.81      0.81       588
weighted avg       0.81      0.81      0.81       588



### Avaliação dos modelos com score de F1

----------------------------------------------------------
Classification Report for XGBoost:
              precision    recall  f1-score   support

           0       0.87      0.99      0.93       313
           1       0.99      0.83      0.90       275

    accuracy                           0.91       588
   macro avg       0.93      0.91      0.91       588
weighted avg       0.92      0.91      0.91       588


Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.87      0.99      0.93       313
           1       0.98      0.83      0.90       275

    accuracy                           0.91       588
   macro avg       0.93      0.91      0.91       588
weighted avg       0.92      0.91      0.91       588


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.82      0.95      0.88       313
           1       0.93      0.76      0.83       275

    accuracy                           0.86       588
   macro avg       0.87      0.85      0.86       588
weighted avg       0.87      0.86      0.86       588



### Avaliação dos modelos com score de Average Precision

----------------------------------------------------------
Classification Report for XGBoost:
              precision    recall  f1-score   support

           0       0.87      0.99      0.93       313
           1       0.99      0.83      0.90       275

    accuracy                           0.91       588
   macro avg       0.93      0.91      0.91       588
weighted avg       0.92      0.91      0.91       588

Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.86      1.00      0.93       313
           1       1.00      0.82      0.90       275

    accuracy                           0.91       588
   macro avg       0.93      0.91      0.91       588
weighted avg       0.93      0.91      0.91       588


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.82      0.95      0.88       313
           1       0.93      0.76      0.83       275

    accuracy                           0.86       588
   macro avg       0.87      0.85      0.86       588
weighted avg       0.87      0.86      0.86       588



## Conclusão

----------------------------------------------------------
A julgar pelos resultados, o modelo Random Forest foi melhor ou igual a todos os outros independente da métrica.
Obviamente levando em consideração que o tempo de execução foi equalizado para cada treino, com mais tempo os resultados poderiam ser diferentes.

No entanto para nosso caso, tanto o Random Forest quanto o XGBoost tiveram resultados muito próximos utilizando o score F1, que teve melhor resultado no recall de positivos.


obs: o modelo de regressão logística teve resultados piores em todas as métricas, provavelmente por ser um modelo mais simples e menos flexível.
além disso, me parece que mudar o pos_label para 1 no make_scorer não está funcionando, pois os resultados de recall estão muito baixos, o que não faz sentido para um dataset balanceado.