#### **Diccionario de datos**
| Variable         | Tipo             | Descripción                                                                               |
| ---------------- | ---------------- | ----------------------------------------------------------------------------------------- |
| HeartDisease     | categórica       | Indica si la persona ha sido diagnosticada con enfermedad del corazón. (Yes/No)           |
| BMI              | numérico decimal | Índice de masa corporal (Body Mass Index).                                                |
| Smoking          | categórica       | Indica si la persona fuma. (Yes/No)                                                       |
| AlcoholDrinking  | categórica       | Indica si la persona consume alcohol en exceso. (Yes/No)                                  |
| Stroke           | categórica       | Indica si la persona ha sufrido un derrame cerebral. (Yes/No)                             |
| PhysicalHealth   | numérico decimal | Número de días en el último mes en los que la salud física no fue buena.                  |
| MentalHealth     | numérico decimal | Número de días en el último mes en los que la salud mental no fue buena.                  |
| DiffWalking      | categórica       | Indica si la persona tiene dificultades para caminar o subir escaleras. (Yes/No)          |
| Sex              | categórica       | Sexo de la persona. (Male/Female)                                                         |
| AgeCategory      | categórica       | Categoría de edad de la persona (ej. 55-59, 80 or older, etc.)                            |
| Race             | categórica       | Raza o grupo étnico (White, Black, etc.)                                                  |
| Diabetic         | categórica       | Estado diabético de la persona (Yes, No, Yes (during pregnancy), No, borderline diabetes) |
| PhysicalActivity | categórica       | Indica si la persona realiza actividad física en los últimos 30 días. (Yes/No)            |
| GenHealth        | categórica       | Autoevaluación de la salud general (Excellent, Very good, Good, Fair, Poor)               |
| SleepTime        | numérico decimal | Horas promedio de sueño por noche.                                                        |
| Asthma           | categórica       | Indica si la persona ha sido diagnosticada con asma. (Yes/No)                             |
| KidneyDisease    | categórica       | Indica si la persona tiene enfermedad renal. (Yes/No)                                     |
| SkinCancer       | categórica       | Indica si la persona ha tenido cáncer de piel. (Yes/No)                                   |


**Importación de librerias**

In [22]:
### Cargue de las librerias
# Libreria para operaciones mateméticas o estadísticas
import numpy as np
# Libreria para el manejo de datos
import pandas as pd
# Libreria para gráficas 2D
import matplotlib.pyplot as plt

# Libreria para cambio de datos
from sklearn.preprocessing import LabelEncoder
# Libreria para balanceo de los datos
from sklearn.utils import resample
# Libreria árbol de decisión para selección de mejores características y realizar la predicción
from sklearn.tree import DecisionTreeClassifier
# Libreria SVM para el modelo de suport vector clasifier
from sklearn.svm import SVC
# Libreria NB para el modelo NaiveBayes
from sklearn.naive_bayes import GaussianNB
# Libreria para separar los datos de entrenamiento y pruebas
from sklearn.model_selection import train_test_split
# Libreria para las métricas
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score

# Libreria para métricas
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Libreria para calcular la media y la desviación estándar utilizadas en las características
from sklearn.preprocessing import StandardScaler
# Libreria de búsqueda en cuadrícula
from sklearn.model_selection import GridSearchCV

# Ignirar Warning
import warnings
warnings.simplefilter('ignore')

**Cargue de la data**

In [23]:
### Se cargan los datos
data = pd.read_csv("heart_2020_cleaned.csv", sep=",")
data

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.60,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,Yes,27.41,Yes,No,No,7.0,0.0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6.0,Yes,No,No
319791,No,29.84,Yes,No,No,0.0,0.0,No,Male,35-39,Hispanic,No,Yes,Very good,5.0,Yes,No,No
319792,No,24.24,No,No,No,0.0,0.0,No,Female,45-49,Hispanic,No,Yes,Good,6.0,No,No,No
319793,No,32.81,No,No,No,0.0,0.0,No,Female,25-29,Hispanic,No,No,Good,12.0,No,No,No


**Preprocesamiento**

In [24]:
### Se cambian las cadenas de valores a numericos ###
data['PhysicalHealth'] = pd.to_numeric(data['PhysicalHealth'])
data['MentalHealth'] = pd.to_numeric(data['MentalHealth'])
data['SleepTime'] = pd.to_numeric(data['SleepTime'])

In [25]:
### Se cambian las cadenas a valores numéricos
le = LabelEncoder()

data['HeartDisease'] = le.fit_transform( data['HeartDisease'] )
data['Smoking'] = le.fit_transform( data['Smoking'] )
data['AlcoholDrinking'] = le.fit_transform( data['AlcoholDrinking'] )

data['Stroke'] = le.fit_transform( data['Stroke'] )
data['DiffWalking'] = le.fit_transform( data['DiffWalking'] )
data['Sex'] = le.fit_transform( data['Sex'] )

data['Stroke'] = le.fit_transform( data['Stroke'] )
data['AgeCategory'] = le.fit_transform( data['AgeCategory'] )
data['Race'] = le.fit_transform( data['Race'] )

data['Diabetic'] = le.fit_transform( data['Diabetic'] )
data['PhysicalActivity'] = le.fit_transform( data['PhysicalActivity'] )
data['GenHealth'] = le.fit_transform( data['GenHealth'] )

data['Asthma'] = le.fit_transform( data['Asthma'] )
data['KidneyDisease'] = le.fit_transform( data['KidneyDisease'] )
data['SkinCancer'] = le.fit_transform( data['SkinCancer'] )

data

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,0,16.60,1,0,0,3.0,30.0,0,0,7,5,2,1,4,5.0,1,0,1
1,0,20.34,0,0,1,0.0,0.0,0,0,12,5,0,1,4,7.0,0,0,0
2,0,26.58,1,0,0,20.0,30.0,0,1,9,5,2,1,1,8.0,1,0,0
3,0,24.21,0,0,0,0.0,0.0,0,0,11,5,0,0,2,6.0,0,0,1
4,0,23.71,0,0,0,28.0,0.0,1,0,4,5,0,1,4,8.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,1,27.41,1,0,0,7.0,0.0,1,1,8,3,2,0,1,6.0,1,0,0
319791,0,29.84,1,0,0,0.0,0.0,0,1,3,3,0,1,4,5.0,1,0,0
319792,0,24.24,0,0,0,0.0,0.0,0,0,5,3,0,1,2,6.0,0,0,0
319793,0,32.81,0,0,0,0.0,0.0,0,0,1,3,0,0,2,12.0,0,0,0


In [26]:
### Se verifica si existen valores NaN
print("Column        NaN")
data.isnull().sum()

Column        NaN


Unnamed: 0,0
HeartDisease,0
BMI,0
Smoking,0
AlcoholDrinking,0
Stroke,0
PhysicalHealth,0
MentalHealth,0
DiffWalking,0
Sex,0
AgeCategory,0


In [27]:
### Se cuentan los registros para la variable objetivo HeartDisease
data['HeartDisease'].value_counts()

Unnamed: 0_level_0,count
HeartDisease,Unnamed: 1_level_1
0,292422
1,27373


**Oversample**

In [28]:
### Se realiza el balanceo de los datos (undersample)
df_cero = data[ data['HeartDisease'] == 0 ]
df_uno = data[ data['HeartDisease'] == 1 ]

df_resample = resample(df_cero,
                       replace=True,
                       n_samples=	27373,
                       random_state = 42)

df = pd.concat( [df_resample, df_uno] )

df['HeartDisease'].value_counts()

Unnamed: 0_level_0,count
HeartDisease,Unnamed: 1_level_1
0,27373
1,27373


**Entrenamiento y preselección de las mejores características**

In [29]:
### Se separan los datos para entrenamiento y pruebas
features = ["BMI", "Smoking", "AlcoholDrinking", "Stroke", "PhysicalHealth", "MentalHealth", "DiffWalking",
        "Sex", "AgeCategory", "Race", "Diabetic", "PhysicalActivity", "GenHealth", "SleepTime",
        "Asthma", "KidneyDisease", "SkinCancer"]

# Características
X = df[features]

# Variable objetivo
y = df['HeartDisease'].values

# 80% entrenamiento y 20% pruebas
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state= 42)

**Proceso de selección de mejores caracteristicas**

In [16]:
### Se genera el proceso de selección de mejores caractarísticas
# Se define el algoritmo DT
dt = DecisionTreeClassifier()

# Se entrena el modelo
dt.fit(X_train, y_train)

# Se genera la predicción
predictDT = dt.predict(X_test)

# Selección de mejores caracteristicas
importances = dt.feature_importances_
feature_names = X.columns

# Crear un DataFrame para visualizar mejor
feature_importances = pd.DataFrame({'Característica': feature_names, 'Importancia': importances})
feature_importances = feature_importances.sort_values('Importancia', ascending=False)

print(feature_importances)


      Característica  Importancia
0                BMI     0.362299
13         SleepTime     0.101767
8        AgeCategory     0.099237
4     PhysicalHealth     0.071888
5       MentalHealth     0.065411
12         GenHealth     0.044074
9               Race     0.034414
3             Stroke     0.030998
11  PhysicalActivity     0.030059
1            Smoking     0.025754
10          Diabetic     0.025577
6        DiffWalking     0.023191
16        SkinCancer     0.022900
14            Asthma     0.022757
7                Sex     0.015055
15     KidneyDisease     0.014722
2    AlcoholDrinking     0.009897


**Selección de mejores caracteristicas**

In [30]:
### Se seleccionan las mejores características
features = ["BMI", "Smoking", "AlcoholDrinking", "Stroke", "PhysicalHealth", "MentalHealth"]
# Características
X = df[features]
# Variable objetivo
y = df['HeartDisease'].values

# 80% entrenamiento y 20% pruebas
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state= 42)

**Modelo DecisionTree**

In [31]:
### Se genera la predicción
# Se selecciona el algoritmo
dtc = DecisionTreeClassifier()

# Se entrena el algoritmo
dtc.fit(X_train, y_train)

# Se genera la predicción
predict = dtc.predict(X_test)

# Se imprimen las métricas
print("\n #-----Algoritmo de decision tree-----#")
print(confusion_matrix(y_test, predict))
print(classification_report(y_test, predict))
print("Precisión:", precision_score(y_test, predict))
print("Recall:", recall_score(y_test, predict))
print("F1-Score:", f1_score(y_test, predict))


 #-----Algoritmo de decision tree-----#
[[3648 1786]
 [2602 2914]]
              precision    recall  f1-score   support

           0       0.58      0.67      0.62      5434
           1       0.62      0.53      0.57      5516

    accuracy                           0.60     10950
   macro avg       0.60      0.60      0.60     10950
weighted avg       0.60      0.60      0.60     10950

Precisión: 0.62
Recall: 0.528281363306744
F1-Score: 0.5704776820673454


**Hiperparametros DT**

In [32]:
# Escalado de características
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Modelo de clasificación
dtc = DecisionTreeClassifier()

# Hiperparámetros para ajustar
param_grid = {
    'criterion': ['gini'],  # 'entropy' suele ser más costoso y rara vez mejor
    'max_depth': [5, 10, 20],  # valores medianos para evitar overfitting
    'min_samples_split': [5, 10],  # evita árboles muy ramificados
    'min_samples_leaf': [2, 4]     # poda ligera
}
# GridSearchCV
grid_search = GridSearchCV(estimator=dtc,
                           param_grid=param_grid,
                           cv=10,
                           scoring='accuracy')

# Entrenamiento con búsqueda en grilla
searchResults = grid_search.fit(X_train, y_train.ravel())

# Mejor modelo
bestModel = searchResults.best_estimator_

print("Best Parameters (GridSearch):", bestModel)
print("-----------------------------------------------------------")

# Entrenar modelo con mejores parámetros
dtc = bestModel
dtc.fit(X_train, y_train)

# Predicción
pred = dtc.predict(X_test)

# Métricas de evaluación
print("Accuracy:", round(accuracy_score(y_test, pred), 3))
print("\nClassification Report:\n", classification_report(y_test, pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))
print("Precisión: \n", precision_score(y_test, pred))
print("Recall: \n", recall_score(y_test, pred))
print("F1-Score: \n", f1_score(y_test, pred))


Best Parameters (GridSearch): DecisionTreeClassifier(max_depth=5, min_samples_leaf=2, min_samples_split=5)
-----------------------------------------------------------
Accuracy: 0.65

Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.67      0.66      5434
           1       0.66      0.63      0.64      5516

    accuracy                           0.65     10950
   macro avg       0.65      0.65      0.65     10950
weighted avg       0.65      0.65      0.65     10950

Confusion Matrix:
 [[3654 1780]
 [2054 3462]]
Precisión: 
 0.6604349484929416
Recall: 
 0.6276287164612038
F1-Score: 
 0.6436140546569994


**Modelo NB**

In [33]:
'''
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
'''
# Modelo de clasificación: Naive Bayes
nb = GaussianNB()

# Entrenamiento
nb.fit(X_train, y_train)

# Predicción
pred = nb.predict(X_test)

# Resultados
print("\n#-----Algoritmo de Naive Bayes (GaussianNB)-----#")
print("Accuracy:", round(accuracy_score(y_test, pred), 3))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))
print("Classification Report:\n", classification_report(y_test, pred))
print("Precisión:", precision_score(y_test, pred, average='weighted'))
print("Recall:", recall_score(y_test, pred, average='weighted'))
print("F1-Score:", f1_score(y_test, pred, average='weighted'))


#-----Algoritmo de Naive Bayes (GaussianNB)-----#
Accuracy: 0.624
Confusion Matrix:
 [[4764  670]
 [3443 2073]]
Classification Report:
               precision    recall  f1-score   support

           0       0.58      0.88      0.70      5434
           1       0.76      0.38      0.50      5516

    accuracy                           0.62     10950
   macro avg       0.67      0.63      0.60     10950
weighted avg       0.67      0.62      0.60     10950

Precisión: 0.66876721464919
Recall: 0.6243835616438356
F1-Score: 0.5995044721322772


**Hiperparametros NB**

In [34]:
# Escalado de características
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Modelo de clasificación
nb = GaussianNB()

# Hiperparámetro para ajustar
param_grid = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6]
}

# GridSearchCV
grid_search = GridSearchCV(estimator=nb,
                           param_grid=param_grid,
                           cv=10,
                           scoring='accuracy')

# Entrenamiento con búsqueda en grilla
searchResults = grid_search.fit(X_train, y_train.ravel())

# Mejor modelo
bestModel = searchResults.best_estimator_

print("Best Parameters (GridSearch):", searchResults.best_params_)
print("-----------------------------------------------------------")

# Entrenar modelo con mejores parámetros
nb = bestModel
nb.fit(X_train, y_train)

# Predicción
pred = nb.predict(X_test)

# Métricas de evaluación
print("Accuracy:", round(accuracy_score(y_test, pred), 3))
print("\nClassification Report:\n", classification_report(y_test, pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, pred))
print("Precisión:\n", precision_score(y_test, pred, average='weighted'))
print("Recall:\n", recall_score(y_test, pred, average='weighted'))
print("F1-Score:\n", f1_score(y_test, pred, average='weighted'))

Best Parameters (GridSearch): {'var_smoothing': 1e-09}
-----------------------------------------------------------
Accuracy: 0.624

Classification Report:
               precision    recall  f1-score   support

           0       0.58      0.88      0.70      5434
           1       0.76      0.38      0.50      5516

    accuracy                           0.62     10950
   macro avg       0.67      0.63      0.60     10950
weighted avg       0.67      0.62      0.60     10950

Confusion Matrix:
 [[4764  670]
 [3443 2073]]
Precisión:
 0.66876721464919
Recall:
 0.6243835616438356
F1-Score:
 0.5995044721322772
