# Regresion Logistica: Detección de enfermedades basado en datos médicos.

Descripción

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

El objetivo de este notebook es construir un modelo de regresión logística para predecir la presencia de enfermedades cardíacas basado en datos médicos. Utilizaremos el dataset "heart.csv".

https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset


In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

In [16]:
# Cargar el dataset
data = pd.read_csv("datasets/heart/heart.csv")
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


## Preprocesamiento de datos

### Visualización inicial

In [18]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB
None


In [19]:
print(data.describe())

               age          sex           cp     trestbps        chol  \
count  1025.000000  1025.000000  1025.000000  1025.000000  1025.00000   
mean     54.434146     0.695610     0.942439   131.611707   246.00000   
std       9.072290     0.460373     1.029641    17.516718    51.59251   
min      29.000000     0.000000     0.000000    94.000000   126.00000   
25%      48.000000     0.000000     0.000000   120.000000   211.00000   
50%      56.000000     1.000000     1.000000   130.000000   240.00000   
75%      61.000000     1.000000     2.000000   140.000000   275.00000   
max      77.000000     1.000000     3.000000   200.000000   564.00000   

               fbs      restecg      thalach        exang      oldpeak  \
count  1025.000000  1025.000000  1025.000000  1025.000000  1025.000000   
mean      0.149268     0.529756   149.114146     0.336585     1.071512   
std       0.356527     0.527878    23.005724     0.472772     1.175053   
min       0.000000     0.000000    71.000000  

### Codificación de variables categóricas y escalado de variables numéricas

In [20]:
# Definir características y etiquetas
X = data.drop("target", axis=1)
y = data["target"]

In [38]:
print(X.head())

   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   52    1   0       125   212    0        1      168      0      1.0      2   
1   53    1   0       140   203    1        0      155      1      3.1      0   
2   70    1   0       145   174    0        1      125      1      2.6      0   
3   61    1   0       148   203    0        1      161      0      0.0      2   
4   62    0   0       138   294    1        1      106      0      1.9      1   

   ca  thal  
0   2     3  
1   0     3  
2   0     3  
3   1     3  
4   3     2  


In [39]:
print(y.head())

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64


In [40]:
# Identificar columnas categóricas y numéricas
num_features = X.select_dtypes(include=["int64", "float64"]).columns
cat_features = X.select_dtypes(include=["object"]).columns

In [41]:
print("\nColumnas numéricas:", num_features.tolist())


Columnas numéricas: ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']


In [42]:
print("Columnas categóricas:", cat_features.tolist())

Columnas categóricas: []


In [46]:
# Preprocesamiento
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_features),
        ("cat", OneHotEncoder(), cat_features)
    ],
    remainder="passthrough"
)

## División del DataSet

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Creación del pipeline con regresión logística

In [48]:
# Construcción del pipeline
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(random_state=42, max_iter=1000)),
])

## Entrenamiento del modelo

In [51]:
# Definir los tamaños de muestra para el entrenamiento
sizes = [100, 500, len(X_train)]
print(f"Tamaños de muestra definidos para el entrenamiento: {sizes}")


Tamaños de muestra definidos para el entrenamiento: [100, 500, 820]


In [52]:
# Iterar sobre los diferentes tamaños de muestra
for size in sizes:
    print(f"\nEntrenando con {size} muestras de entrenamiento:")
    X_train_subset = X_train[:size]
    y_train_subset = y_train[:size]


Entrenando con 100 muestras de entrenamiento:

Entrenando con 500 muestras de entrenamiento:

Entrenando con 820 muestras de entrenamiento:


In [57]:
# Entrenar el modelo con el subconjunto de datos
pipeline.fit(X_train_subset, y_train_subset)


In [59]:
# Realizar predicciones en el conjunto de prueba
y_pred = pipeline.predict(X_test)

## Evaluación del modelo

In [60]:
# Calcular métricas de evaluación
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
    
print(f"Accuracy del modelo: {acc:.3f}")
print(f"F1 Score del modelo: {f1:.3f}")

Accuracy del modelo: 0.810
F1 Score del modelo: 0.831


In [61]:
# Mostrar un ejemplo de predicciones y etiquetas reales
print("\nEjemplo de predicciones:")
print("Predicciones:", y_pred[:10])
print("Etiquetas reales:", y_test[:10].values)


Ejemplo de predicciones:
Predicciones: [0 0 0 1 0 0 1 1 1 0]
Etiquetas reales: [0 1 0 1 0 0 1 0 1 1]


In [68]:
# Predicciones en el conjunto de prueba
y_pred = model.predict(X_test)

In [69]:
# Métricas de evaluación
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

In [70]:
print(f"Accuracy: {accuracy:.3f}")


Accuracy: 0.810


In [71]:
print(f"F1 Score: {f1:.3f}")


F1 Score: 0.831


In [72]:
print(f"Confusion Matrix:\n{conf_matrix}")

Confusion Matrix:
[[70 30]
 [ 9 96]]


In [73]:
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.810
