Pontificia Universidad Católica de Chile <br>
Departamento de Ciencia de la Computación <br>
IIC2433 - Minería de Datos
<br>

<center>
    <h2> Proyecto Mineria de Datos</h2>
    <h1> Predecir un accidente cerebrovascular </h1>
    <p>
        Profesor Marcelo Mendoza<br>
        Primer Semestre 2023<br>    
        Fecha de entrega: Martes 13 de Junio
    </p>
    <p>Integrantes: Lucas Aguilera, Claudio Bórquez, Josefa Fernández
    </p>
    <br>
</center>

<br>

---

Se importan las librerias ausar en este proyecto:

In [None]:
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential, Model
from keras.layers import Dense, Embedding, Input
from keras.utils import to_categorical

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 1-Analisis de Datos, preparación y selección de caracteristicas

## Parte 1: Carga de datos
- Se descarga el conjunto de datos del enlace proporcionado.

Durante este proyecto se usan datos clínicos, centrados en accidentes cerebrovasculares, obtendios de Kaggle:

https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset



In [2]:
dframe = pd.read_csv("../data/healthcare-dataset-stroke-data.csv", encoding = "ISO-8859-1")

## Parte 2: Exploración de datos
- Se exploran los datos para comprender la estrucutura, caracteristicas y posibles problemas.


In [3]:
dframe.shape

(5110, 12)

In [4]:
dframe.head(10)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
8,27419,Female,59.0,0,0,Yes,Private,Rural,76.15,,Unknown,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1


- Los datos contienen 5110 observaciones con 12 atributos.
- Se observa valores nulos en la columna bmi



El dataset se utiliza para predecir si es probable que un paciente sufra un accidente cerebrovascular en función de los parámetros de entrada como el sexo, la edad, diversas enfermedades y el tabaquismo. Cada fila de los datos proporciona información relevante sobre el paciente.

El dataset, que se usará  para construir un modelo para predecir. Los atributos que contiene son:
- id: identificador unico
- gender: "Male", "Female" or "Other" ("Masculino", "Femenino" u "Otro")
- age: edad del paciente
- hypertension: 0 si el paciente no tiene hipertension, 1 si el paciente tiene hipertension
- heart_disease: 0 si el paciente no tiene ninguna enfermedad al corazón, 1 si el paciente tiene alguna enfermedad al corazón 
- ever_married: "No" or "Yes" ("No" o "Si")
- work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed" ("niños", "Govt_jov", "nunca_ha_trabajo", "Privado" o "Independiente")
- Residence_type: "Rural" or "Urban" ("Rural" o "Urbano")
- avg_glucose_level: el promedio del nivel de glucosa en la sangre
- bmi: índice de masa corporal
- smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"* ("anteriormente fumó", "nunca fumó", "fuma"o "Desconocido"*)
- stroke: 1 si el paciente tuvo un accidente cerebrovascular o 0 si no

*Nota: "Desconocido" en smoking_status significa que la información no está disponible para este paciente

Toda esta informacion fue sacada de kaggle.

## Parte 3: Preprocesamiento
- Se realiza una limpieza de datos para manejar valores faltantes, duplicados u otros errores.
- Realizar una transformación de datos si es necesario, como codificación de variables categóricas.

Se preprocesan los datos, por lo que consideraros si es necesario hacer entre otras, las siguientes cosas:
- Remover columnas
- Normalizar variables
- Manejo de valores nulos

Para manejar los datos categoricos tienes que usar One Hot Encoding.

In [5]:
dframe.head(5) #verifico los 1eros 5 registros del dframe

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


#### Veo valores nulos 

In [6]:
dframe.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

### Remover columnas
Saco columnas que no aportan informacion relevante para el modelo  en predecir la supervivencia de los pasajeros y cabin porque tiene muchos datos nulos.

In [7]:
dframe.drop(['id'], axis=1, inplace=True)

### One Hot Encoding
Dado que hay datos categoricos con mas de dos valores posibles se deben codificar usando One hot encoder. Las columnas 'work_type', 'smoking_status' son categoricas, por lo que usamos este metodo para convertirlas en variables numericas.

In [8]:
categorical_columns = ['work_type', 'smoking_status']
encoder = OneHotEncoder(sparse_output=False)

encoded_features = encoder.fit_transform(dframe[categorical_columns])

feature_names = encoder.get_feature_names_out(categorical_columns)

dframe_encoded = pd.DataFrame(encoded_features, columns=feature_names)

dframe = pd.concat([dframe_encoded, dframe.drop(categorical_columns, axis=1)], axis=1)
dframe.head()

Unnamed: 0,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,stroke
0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,Male,67.0,0,1,Yes,Urban,228.69,36.6,1
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,Female,61.0,0,0,Yes,Rural,202.21,,1
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,Male,80.0,0,1,Yes,Rural,105.92,32.5,1
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,Female,49.0,0,0,Yes,Urban,171.23,34.4,1
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,Female,79.0,1,0,Yes,Rural,174.12,24.0,1


### Para evitar posibles problemas futuros con el entrenamiento de modelos, eliminaremos el valor "Other" de "gender"

In [9]:
print(dframe[dframe['gender'] == 'Other'])
rows_to_delete = dframe[dframe['gender'] == 'other'].index
dframe.drop(rows_to_delete, inplace=True)
print(dframe[dframe['gender'] == 'Other'])

      work_type_Govt_job  work_type_Never_worked  work_type_Private  \
3116                 0.0                     0.0                1.0   

      work_type_Self-employed  work_type_children  smoking_status_Unknown  \
3116                      0.0                 0.0                     0.0   

      smoking_status_formerly smoked  smoking_status_never smoked  \
3116                             1.0                          0.0   

      smoking_status_smokes gender   age  hypertension  heart_disease  \
3116                    0.0  Other  26.0             0              0   

     ever_married Residence_type  avg_glucose_level   bmi  stroke  
3116           No          Rural             143.33  22.4       0  
      work_type_Govt_job  work_type_Never_worked  work_type_Private  \
3116                 0.0                     0.0                1.0   

      work_type_Self-employed  work_type_children  smoking_status_Unknown  \
3116                      0.0                 0.0           

### Transformamos las variables categóricas binarias en numéricas

In [10]:
gender_mapping = {'Male': 0, 'Female': 1}
marry_mapping = {'Yes': 1, 'No': 0}
residence_mapping = {'Urban': 1, 'Rural': 0}


dframe['gender'] = dframe['gender'].map(gender_mapping)
dframe['ever_married'] = dframe['ever_married'].map(marry_mapping)
dframe['Residence_type'] = dframe['Residence_type'].map(residence_mapping)
dframe.head()

Unnamed: 0,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,stroke
0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,67.0,0,1,1,1,228.69,36.6,1
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,61.0,0,0,1,0,202.21,,1
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,80.0,0,1,1,0,105.92,32.5,1
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,49.0,0,0,1,1,171.23,34.4,1
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,79.0,1,0,1,0,174.12,24.0,1


### Borramos valores nulos

In [11]:
dframe.dropna(inplace=True)
dframe.isnull().sum()

work_type_Govt_job                0
work_type_Never_worked            0
work_type_Private                 0
work_type_Self-employed           0
work_type_children                0
smoking_status_Unknown            0
smoking_status_formerly smoked    0
smoking_status_never smoked       0
smoking_status_smokes             0
gender                            0
age                               0
hypertension                      0
heart_disease                     0
ever_married                      0
Residence_type                    0
avg_glucose_level                 0
bmi                               0
stroke                            0
dtype: int64

In [12]:
# Analizar outliers y ver si borrar ever_married

### Normalizar variables


In [13]:
numerical_features = ['age', 'bmi', 'avg_glucose_level']
scaler = MinMaxScaler()
dframe[numerical_features] = scaler.fit_transform(dframe[numerical_features])

# 2-Implementación de algoritmos de minería de datos

## Parte 0: División de datos en train y test
- Dividir los datos en conjuntos de entrenamiento y prueba.

In [14]:
X = dframe.drop(['stroke'], axis=1)
y = dframe['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)
X_new_train, X_val, y_new_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=40)

## Parte 1: MLP 
- Utilizar el algoritmo MLP (Perceptrón Multicapa) para construir un modelo de predicción.


In [15]:
classes = np.unique(y_new_train)
print(classes)
print(X_train.shape[1])

[0 1]
17


In [16]:
inputs = Input(shape=(X_train.shape[1],)) # hacemos los inputs y las capas
dense1 = Dense(128, activation="relu")
dense2 = Dense(64, activation="relu")
dense3 = Dense(len(classes), activation="softmax")

# hacemos
x = dense1(inputs)
x = dense2(x)
outputs = dense3(x)

In [17]:
model = Model(inputs=inputs, outputs=outputs) # establecemos el modelo

model.summary() # hacemos el resumen

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 17)]              0         
                                                                 
 dense (Dense)               (None, 128)               2304      
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 2)                 130       
                                                                 
Total params: 10,690
Trainable params: 10,690
Non-trainable params: 0
_________________________________________________________________


In [18]:
y_new_train_encoded = to_categorical(y_new_train)
y_val_encoded = to_categorical(y_val)
print(y_val_encoded)
print("######")
print(y_new_train_encoded)

[[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]
######
[[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]


In [19]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) #hacemos la compilación con el optimizer
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3) #establecemos callback, para establecer la patience

## Parte 2: Regresion Logistica 
- Implementar Regresión Logística y GMM (Modelos de Mezcla Gaussiana) como parte de un pipeline extendido. Para construir un modelo de predicción.


In [20]:
# Regresión Logística sin GMM
clf = LogisticRegression()

# Train the model on the training data
clf = LogisticRegression(class_weight='balanced')

# 3-Evaluación del modelo

## Parte 1: Entrenar los modelos (MLP)
- Entrenar los modelos con los datos de entrenamiento y evaluar su rendimiento con los datos de prueba.

In [21]:
model.fit(X_new_train, y_new_train_encoded, batch_size=32, epochs=20, validation_data=(X_val, y_val_encoded), callbacks=[callback])
#finalmente hacemos model fit, para entrenar al modelo con los datos de entrenamiento, ademas usando la data de validación dicha antes

Epoch 1/20


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x22710ab4b80>

## Parte 2: Metricas de evaluación (MLP)
- Utilizar métricas de evaluación adecuadas, como precisión, sensibilidad, especificidad y F1-score.

In [22]:
Y_preds = model.predict(X_test).argmax(axis=-1) #finalmente, hacemos la predicción

print("Test Accuracy : {}".format(accuracy_score(y_test, Y_preds)))
print("\nClassification Report : ")

print(np.unique(y_test), np.unique(Y_preds), np.unique(y_new_train), np.unique(y_val)) #esta linea nos dejara algo claro que
#veremos mas adelante
labels = ['0', '1'] #establecemos los labels de las clases
print(classification_report(y_test, Y_preds,zero_division=0, target_names=labels))#hacemos el casification report

Test Accuracy : 0.9456890699253224

Classification Report : 
[0 1] [0 1] [0 1] [0 1]
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1401
           1       0.00      0.00      0.00        72

    accuracy                           0.95      1473
   macro avg       0.48      0.50      0.49      1473
weighted avg       0.90      0.95      0.92      1473



## Parte 3: Entrenar los modelos (RL)
- Entrenar los modelos con los datos de entrenamiento y evaluar su rendimiento con los datos de prueba.

In [23]:
# Entrenamiento del modelo de regresión logística
clf.fit(X_new_train, y_new_train)

# Predicción del modelo de regresión logística
y_pred = clf.predict(X_test)

## Parte 4: Metricas de evaluación (RL)
- Utilizar métricas de evaluación adecuadas, como precisión, sensibilidad, especificidad y F1-score.

In [24]:
# Evaluación del modelo de regresión logística
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.745417515274949
              precision    recall  f1-score   support

           0       0.99      0.74      0.85      1401
           1       0.15      0.86      0.25        72

    accuracy                           0.75      1473
   macro avg       0.57      0.80      0.55      1473
weighted avg       0.95      0.75      0.82      1473



# 5-Análisis de resultados

## Parte 1: Interpretación de resultados
- Interpretar y analizar los resultados obtenidos de los diferentes modelos.

Basado en los resultados, podemos ver que MLP tiene precisión del 0% para los datos 1 en "stroke". Por otro lado, podemos observar que RL igualmente tiene una baja precisión, aunque mejor en comparación con MLP. Esto significa que estamos experimentado problemas con la clasificación, y los modelos no están logrando aprender los patrones.

Tenemos un desbalance de datos notorio.

## Parte 2: Por mejorar
- Realizar ajustes o mejoras en el modelo según sea necesario.

GMM (para data augmentation) o alguna otro metodo de data augmentation, basada en la forma de los datos.

Priorizar algunas columnas por sobre otras (cambio de pesos).

Investigar eliminación de columnas.

Probar con otros modelos y/o cambiar hiperparametros de los modelos. (Activation, optimizer, epochs).

# 6 - Implementación de mejoras

## Parte 1: Manejo del desbalance de datos

Para manejar el desbalance de datos haremos un submuestreo de la clase mayoritaria, en este caso 'stroke' = 0. Cada muestra será combinada con la misma muestra de la clase minoritara, 'stroke' = 1. Recordemos que tenemos 249 filas con 'stroke' = 1 y 4861 filas con 'stroke' = 0. Pero estos valores se vieron afectados por el preprocesamiento por lo que para saber cuantos grupos de muestras necesitamos primero hay que verificar el balance actual.

Primero separamos el dataframe según su etiqueta

In [25]:
stroke_1 = dframe[dframe['stroke'] == 1]
stroke_0 = dframe[dframe['stroke'] == 0]

print(stroke_1.shape[0], stroke_0.shape[0])

209 4699


Como vemos, ahora tenemos 209 filas de etiqueta = 1 y 4699 filas = 0. Hacemos una división entera para ver cuantos grupos de 209 filas podemos hacer.

In [26]:
4699//209

22

La idea es hacer el submuestreo tomando muestras de 249 filas de stroke_0, por lo que tomaremos 22 muestras al azar. Para evitar cualquier tipo de sesgo u orden intrínseco en que se ordenaron los datos en un inicio, reordenaremos las filas de stroke_0 de manera aleatoria.

In [27]:
stroke_0_shuffled = stroke_0.sample(frac=1, random_state=42)

Dividimos en 22 grupos.

In [28]:
stroke_0_groups = np.array_split(stroke_0_shuffled, 22)
count = 0
for i in range(22):
    count +=stroke_0_groups[i].shape[0]

print(count)
print(stroke_0_shuffled.shape[0])

4699
4699


Si bien las muestras no son todas del mismo tamaño, terminan sumando la cantidad correcta de columnas, por lo que procedemos a concatenar cada muestra de stroke_0 con la misma muestra de stroke_1.

In [29]:
training_sets = []
for group in stroke_0_groups:
    training_set = pd.concat([stroke_1, group])
    training_sets.append(training_set)

In [30]:
print(len(training_sets))

22


In [31]:
for sets in training_sets:
    print(sets['stroke'].value_counts())

stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    214
1    209
Name: count, dtype: int64
stroke
0    213
1    209
Name: count, dtype: int64
stroke
0    213
1    209
Name: count, dtype: int64
stroke
0    213
1    209
Name: count, dtype: int64
stroke
0    213
1    209
Name: count, dtype: int64
stroke
0    213
1    209
Name: count, dtype: int64
stroke
0    213
1    209
Name: count, dtype: int64
stroke
0    213
1    209
Name: 

De manera general podemos ver que cada grupo tiene balance de clases, por lo que ahora entrenaremos distintos modelos de Regresión Logistica en base a cada uno de estos grupos.

In [38]:
logistic_models = []

for training_set in training_sets:
    X = training_set.drop('stroke', axis=1)
    y = training_set['stroke']

    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)
    
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    logistic_models.append(clf)

Accuracy: 0.7086614173228346
              precision    recall  f1-score   support

           0       0.70      0.66      0.68        59
           1       0.72      0.75      0.73        68

    accuracy                           0.71       127
   macro avg       0.71      0.71      0.71       127
weighted avg       0.71      0.71      0.71       127

Accuracy: 0.7086614173228346
              precision    recall  f1-score   support

           0       0.70      0.64      0.67        59
           1       0.71      0.76      0.74        68

    accuracy                           0.71       127
   macro avg       0.71      0.70      0.71       127
weighted avg       0.71      0.71      0.71       127

Accuracy: 0.7401574803149606
              precision    recall  f1-score   support

           0       0.71      0.75      0.73        59
           1       0.77      0.74      0.75        68

    accuracy                           0.74       127
   macro avg       0.74      0.74      0.

Podemos notar que al accuracy en general más o menos se mantuvo para cada modelo, pero la precisión para predecir los 1 subió considerablemente, por lo que ahora construiremos un modelo que use todos estos modelos y nos de un resultado final.

Volvemos a definir el dataset completo

In [34]:
X = dframe.drop(['stroke'], axis=1)
y = dframe['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)

Hacemos las predicciones en cada modelo, pero ahora utilizando los mismos datos en todos y almacenamos las predicciones de cada modelo para crear el ensemble.

## Voting Classifier

In [41]:
model_predictions = []

for model in logistic_models:
    y_pred = model.predict(X_test)
    model_predictions.append(y_pred)

In [53]:
from sklearn.ensemble import VotingClassifier

# Create a list of tuples (name, model) for the VotingClassifier
ensemble_models = [('logistic_' + str(i), model) for i, model in enumerate(logistic_models)]

# Create the VotingClassifier with majority voting
voting_classifier = VotingClassifier(estimators=ensemble_models, voting='hard')
voting_classifier.fit(X_train, y_train)  # Train the ensemble model
ensemble_prediction = voting_classifier.predict(X_test)

In [54]:
print(accuracy_score(y_test, ensemble_prediction))
print(classification_report(y_test, ensemble_prediction))



0.7637795275590551
              precision    recall  f1-score   support

           0       0.72      0.81      0.76        59
           1       0.82      0.72      0.77        68

    accuracy                           0.76       127
   macro avg       0.77      0.77      0.76       127
weighted avg       0.77      0.76      0.76       127



Probemos con otros modelos de ensemble

## Gradient Boosting Classifier

In [64]:
from sklearn.ensemble import GradientBoostingClassifier, StackingClassifier
meta_classifier = GradientBoostingClassifier()

In [65]:
# Create the stacking ensemble
stacking_clf = StackingClassifier(
    estimators=ensemble_models,
    final_estimator=meta_classifier
)

In [66]:
stacking_clf.fit(X_train, y_train)
y_pred = stacking_clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.7007874015748031
              precision    recall  f1-score   support

           0       0.66      0.75      0.70        59
           1       0.75      0.66      0.70        68

    accuracy                           0.70       127
   macro avg       0.70      0.70      0.70       127
weighted avg       0.71      0.70      0.70       127



## Random Forest

Basandonos en los métodos ya aplicados, probaremos usando modelos de Decision Trees en lugar de Regresión Logística y los ensamblaremos en un modelo de Random Forest

In [73]:
stroke_1 = dframe[dframe['stroke'] == 1]
stroke_0 = dframe[dframe['stroke'] == 0]

print(stroke_1.shape[0], stroke_0.shape[0])

209 4699


In [74]:
stroke_0_shuffled = stroke_0.sample(frac=1, random_state=42)

In [108]:
stroke_0_groups = np.array_split(stroke_0_shuffled, 5)
count = 0
for i in range(5):
    count +=stroke_0_groups[i].shape[0]

print(count)
print(stroke_0_shuffled.shape[0])

4699
4699


In [109]:
training_sets = []
for group in stroke_0_groups:
    training_set = pd.concat([stroke_1, group])
    training_sets.append(training_set)

In [110]:
from sklearn.tree import DecisionTreeClassifier
decision_tree_models = []

for training_set in training_sets:
    X = training_set.drop('stroke', axis=1)
    y = training_set['stroke']

    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)
    
    clf = DecisionTreeClassifier()
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    decision_tree_models.append(clf)

Accuracy: 0.7507246376811594
              precision    recall  f1-score   support

           0       0.85      0.84      0.85       285
           1       0.30      0.32      0.31        60

    accuracy                           0.75       345
   macro avg       0.58      0.58      0.58       345
weighted avg       0.76      0.75      0.75       345

Accuracy: 0.7797101449275362
              precision    recall  f1-score   support

           0       0.89      0.84      0.86       285
           1       0.40      0.52      0.45        60

    accuracy                           0.78       345
   macro avg       0.64      0.68      0.66       345
weighted avg       0.81      0.78      0.79       345

Accuracy: 0.7536231884057971
              precision    recall  f1-score   support

           0       0.84      0.86      0.85       285
           1       0.27      0.25      0.26        60

    accuracy                           0.75       345
   macro avg       0.56      0.55      0.

In [120]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

random_forest = RandomForestClassifier(n_estimators=len(decision_tree_models))
random_forest.estimators_ = decision_tree_models

In [112]:
X = dframe.drop(['stroke'], axis=1)
y = dframe['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)

In [113]:
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)

In [114]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9470468431771895
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1401
           1       0.33      0.08      0.13        72

    accuracy                           0.95      1473
   macro avg       0.64      0.54      0.55      1473
weighted avg       0.92      0.95      0.93      1473



## Uso de ensembles sin aplicar el submuestreo

### AdaBoost

In [121]:
X = dframe.drop(['stroke'], axis=1)
y = dframe['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=42)

In [122]:
max_depth_values = [2, 3, 4, 5]
range_T = [50, 100, 150]

for max_depth in max_depth_values:
    for t_ in range_T:
        clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=max_depth), n_estimators=t_, random_state=0)
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        print(
            "For max_depth =", 
            max_depth,
            "and t =",
            t_,
            "The accuracy is :",
            acc,
    )

For max_depth = 2 and t = 50 The accuracy is : 0.9395790902919212
For max_depth = 2 and t = 100 The accuracy is : 0.9361846571622539
For max_depth = 2 and t = 150 The accuracy is : 0.9341479972844535
For max_depth = 3 and t = 50 The accuracy is : 0.9314324507807196
For max_depth = 3 and t = 100 The accuracy is : 0.9368635437881874
For max_depth = 3 and t = 150 The accuracy is : 0.9416157501697217
For max_depth = 4 and t = 50 The accuracy is : 0.9389002036659878
For max_depth = 4 and t = 100 The accuracy is : 0.9402579769178547
For max_depth = 4 and t = 150 The accuracy is : 0.9470468431771895
For max_depth = 5 and t = 50 The accuracy is : 0.945010183299389
For max_depth = 5 and t = 100 The accuracy is : 0.9497623896809233
For max_depth = 5 and t = 150 The accuracy is : 0.9490835030549898


In [123]:
clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=5), n_estimators=100, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      1401
           1       0.25      0.01      0.03        72

    accuracy                           0.95      1473
   macro avg       0.60      0.51      0.50      1473
weighted avg       0.92      0.95      0.93      1473

