# Model Research

Debido a la naturaleza del dataset, se probarán los siguientes modelos:
- RandomForestClassifier: Debido a su adaptabilidad en problemas de clasificación multiclase, robustez al sobreajuste y capacidad para manejar datos desbalanceados.
- XGBoostClassifier: Debido a su precisión en el entrenamiento gracias al boost de clases, soporte para multiclase con multi:softmax y escalabilidad.
- AdaBoostClassifier: Debido a su enfoque en muestras difíciles para mejorar clases complicadas y simplicidad en problemas multiclase con SAMME.

In [19]:
# import needed libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, precision_recall_curve
from xgboost import XGBClassifier
# split data set
from sklearn.model_selection import train_test_split
from helpers import basic_eda as beda

In [2]:
# import dataset
curated_data = pd.read_csv('../data/curated/curated_records.csv')
curated_data.head()

Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,serum_cholesterol,fasting_blood_sugar_g_120,resting_egc_results,maximum_heart_rate,exersize_induced_angina,st_depression,slope_of_peak_exercise_st_segment,major_vessels,thalassemia,target
0,63,male,typical angina,145,233,True,left_ventricular_hypertrophy,150,False,2.3,downsloping,0.0,fixed_defect,0
1,67,male,asymptomatic,160,286,False,left_ventricular_hypertrophy,108,True,1.5,flat,3.0,normal,2
2,67,male,asymptomatic,120,229,False,left_ventricular_hypertrophy,129,True,2.6,flat,2.0,reversible_defect,1
3,37,male,non-anginal pain,130,250,False,normal,187,False,3.5,downsloping,0.0,normal,0
4,41,female,atypical angina,130,204,False,left_ventricular_hypertrophy,172,False,1.4,upsloping,0.0,normal,0


In [3]:
categorical_columns = curated_data.columns
categorical_columns
curated_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 14 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   age                                297 non-null    int64  
 1   sex                                297 non-null    object 
 2   chest_pain_type                    297 non-null    object 
 3   resting_blood_pressure             297 non-null    int64  
 4   serum_cholesterol                  297 non-null    int64  
 5   fasting_blood_sugar_g_120          297 non-null    bool   
 6   resting_egc_results                297 non-null    object 
 7   maximum_heart_rate                 297 non-null    int64  
 8   exersize_induced_angina            297 non-null    bool   
 9   st_depression                      297 non-null    float64
 10  slope_of_peak_exercise_st_segment  297 non-null    object 
 11  major_vessels                      297 non-null    float64

In [4]:
dummy_columns = ['sex', 'chest_pain_type', 'fasting_blood_sugar_g_120', 'resting_egc_results', 'exersize_induced_angina', 'slope_of_peak_exercise_st_segment', 'thalassemia']

In [5]:
# get dummies for categorical variables
dummy_data = pd.get_dummies(curated_data, drop_first=True, 
               columns= dummy_columns, 
               dtype=int)

In [6]:
dummy_data.shape

(297, 19)

In [7]:
train_validation, test = train_test_split(dummy_data, 
                                          test_size=0.2, 
                                          random_state=42,
                                          stratify=dummy_data['target']
                                          )

In [8]:
train, validation = train_test_split(train_validation,
                                     test_size=0.25, 
                                     random_state=42,
                                     stratify=train_validation['target'] 
                                     )

In [9]:
train.shape

(177, 19)

In [10]:
# separate features and target
X_train, Y_train = beda.separate_target(train, 'target')
X_validation, Y_validation = beda.separate_target(validation, 'target')
X_test, Y_test = beda.separate_target(test, 'target')


In [None]:
baggin_clf = BaggingClassifier(estimator=DecisionTreeClassifier(max_features='sqrt',
                                                                max_depth=5,
                                                                max_leaf_nodes=5),
                               n_estimators=1000,
                               n_jobs=-1,
                               random_state=42,
                               bootstrap=True)

forest_clf = RandomForestClassifier(n_estimators=1000,
                                    max_depth=5,
                                    random_state=42,
                                    n_jobs=-1)
ada_clf = AdaBoostClassifier(n_estimators=100,
                            learning_rate=0.01,
                            estimator=DecisionTreeClassifier(max_features='sqrt',
                                                            max_leaf_nodes=10,
                                                            max_depth=5,
                                                            random_state=42,
                                                            ),
                            random_state=42)
xgb_clf = XGBClassifier(n_estimators=1000,
                        max_depth=5,
                        random_state=42,
                        n_jobs=-1,
                        objective='multi:softmax',
                        learning_rate=0.01)
# fit models
forest_clf.fit(X_train, Y_train)
ada_clf.fit(X_train, Y_train)
xgb_clf.fit(X_train, Y_train)
baggin_clf.fit(X_train, Y_train)

In [31]:
# evaluate models
forest_data = {'name': 'Random Forest', 
               'model': forest_clf,
               'X_validation': X_validation, 
               'Y_validation': Y_validation}
ada_data = {'name': 'AdaBoost',
             'model': ada_clf,
             'X_validation': X_validation, 
             'Y_validation': Y_validation}
prediction_forest = beda.evaluate_model(**forest_data)
prediction_ada = beda.evaluate_model(**ada_data)
xgb_data = {'name': 'XGBoost',
             'model': xgb_clf,
             'X_validation': X_validation, 
             'Y_validation': Y_validation}
prediction_xgb = beda.evaluate_model(**xgb_data)
bagging_data = {'name': 'Bagging',
                'model': baggin_clf,
                'X_validation': X_validation, 
                'Y_validation': Y_validation}
prediction_bagging = beda.evaluate_model(**bagging_data)

.:EVALUATING MODEL: Random Forest:.
Predicciones
Predicción clase 0: 42
Predicción clase 1: 6
Predicción clase 2: 8
Predicción clase 3: 4
Métricas
F1 Score: 0.5275364134187664
F1 Score (macro): 0.3117521799874742
Precision: 0.4990079365079365
Precision (macro): 0.3261904761904762
Recall: 0.5833333333333334
Recall (macro): 0.3199675324675324
Score: 0.5833333333333334
.:EVALUATING MODEL: AdaBoost:.
Predicciones
Predicción clase 0: 41
Predicción clase 1: 10
Predicción clase 2: 6
Predicción clase 3: 3
Métricas
F1 Score: 0.5204318664592636
F1 Score (macro): 0.29420041146068543
Precision: 0.50579945799458
Precision (macro): 0.3330081300813008
Recall: 0.5666666666666667
Recall (macro): 0.2913961038961038
Score: 0.5666666666666667
.:EVALUATING MODEL: XGBoost:.
Predicciones
Predicción clase 0: 34
Predicción clase 1: 11
Predicción clase 2: 7
Predicción clase 3: 7
Predicción clase 4: 1
Métricas
F1 Score: 0.5457070707070707
F1 Score (macro): 0.4203463203463203
Precision: 0.5588235294117647
Precisi

According to the information the best model with an unbalanced dataset is xgboost so now I´ll try using a oversampling algrithm to increase algorithms eficiency.
