# An√°lisis Estad√≠stico: Prueba de Independencia Chi-cuadrado üî¨

Para validar si nuestras variables categ√≥ricas tienen una relaci√≥n estad√≠sticamente significativa con la variable objetivo `stroke`, realizaremos la prueba de Chi-cuadrado de Pearson. 

* **Hip√≥tesis Nula ($H_0$):** La variable es independiente del infarto (No influye).
* **Hip√≥tesis Alternativa ($H_1$):** La variable est√° asociada al riesgo de infarto (Es significativa).

### **1. Preparaci√≥n del Entorno**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Cargamos el dataset limpio
df = pd.read_csv('dataset/healthcare-dataset-stroke-clean.csv')

# Configuramos el estilo
sns.set_theme(style="white")

# Visulizamos las primeras filas del dataset
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.1,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


### **2. Importaci√≥n de Herramientas Estad√≠sticas**
Necesitaremos la librer√≠a scipy.stats.

In [2]:
from scipy.stats import chi2_contingency

# Lista para almacenar resultados
resultados_chi = []

categorical_cols = ['gender', 'hypertension', 'heart_disease', 'ever_married', 
                    'work_type', 'Residence_type', 'smoking_status']

### **3. Ejecuci√≥n de la Prueba Paso a Paso**
Un profesional no solo mira el resultado final, sino que genera la Tabla de Contingencia (frecuencias observadas) para cada caso.

In [3]:
for col in categorical_cols:
    # Crear la tabla de contingencia (Valores observados)
    contingency_table = pd.crosstab(df[col], df['stroke'])
    
    # Aplicar la prueba Chi-cuadrado
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    
    # Determinar significancia (Alfa = 0.05)
    es_significativa = "S√≠" if p_value < 0.05 else "No"
    
    resultados_chi.append({
        'Variable': col,
        'Chi2 Score': round(chi2, 2),
        'P-Value': round(p_value, 4),
        '¬øEs Significativa?': es_significativa
    })

# Convertir a DataFrame para una vista profesional
df_resultados = pd.DataFrame(resultados_chi)
df_resultados.sort_values(by='P-Value', inplace=True)
df_resultados

Unnamed: 0,Variable,Chi2 Score,P-Value,¬øEs Significativa?
1,hypertension,81.57,0.0,S√≠
2,heart_disease,90.23,0.0,S√≠
3,ever_married,58.87,0.0,S√≠
4,work_type,49.16,0.0,S√≠
6,smoking_status,29.23,0.0,S√≠
5,Residence_type,1.07,0.2998,No
0,gender,0.34,0.5598,No


fdfdfdfdfdfdfdf

In [4]:
import numpy as np

def cramers_v(contingency_table):
    """ Calcula el coeficiente V de Cramer para una tabla de contingencia. """
    chi2 = chi2_contingency(contingency_table)[0]
    n = contingency_table.sum().sum()
    phi2 = chi2 / n
    r, k = contingency_table.shape
    # Correcci√≥n para el sesgo
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

# Aplicamos a tus resultados
resultados_completos = []

for col in categorical_cols:
    tabla = pd.crosstab(df[col], df['stroke'])
    chi2, p, _, _ = chi2_contingency(tabla)
    v_cramer = cramers_v(tabla)
    
    resultados_completos.append({
        'Variable': col,
        'P-Value': round(p, 4),
        'V de Cramer': round(v_cramer, 4),
        'Asociaci√≥n': 'Fuerte' if v_cramer > 0.15 else ('D√©bil' if v_cramer > 0.05 else 'Insignificante')
    })

df_final_cat = pd.DataFrame(resultados_completos).sort_values(by='V de Cramer', ascending=False)
df_final_cat

Unnamed: 0,Variable,P-Value,V de Cramer,Asociaci√≥n
2,heart_disease,0.0,0.1322,D√©bil
1,hypertension,0.0,0.1256,D√©bil
3,ever_married,0.0,0.1064,D√©bil
4,work_type,0.0,0.094,D√©bil
6,smoking_status,0.0,0.0717,D√©bil
5,Residence_type,0.2998,0.0038,Insignificante
0,gender,0.5598,0.0,Insignificante
