# Estadística Descriptiva — Pima Indians Diabetes
Autor: Aaron Cuevas — Fecha: 30 Oct  
Dataset: `data/diabetes.csv`

In [5]:
import sys
print("PY:", sys.executable)
!{sys.executable} -m pip install -U pip
!{sys.executable} -m pip install pandas numpy matplotlib

PY: /Users/star/.venvs/langevin/bin/python
Collecting pip
  Using cached pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Using cached pip-25.3-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 25.2
    Uninstalling pip-25.2:
      Successfully uninstalled pip-25.2
Successfully installed pip-25.3
Collecting pandas
  Using cached pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl (10.7 MB)
Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [pandas]2m2/3[0m [

In [6]:
import sys, subprocess
!{sys.executable} -m pip install ipykernel
!{sys.executable} -m ipykernel install --user --name "analitica-venv" --display-name "Python (analitica-venv)"

Installed kernelspec analitica-venv in /Users/star/Library/Jupyter/kernels/analitica-venv


In [7]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
pd.set_option("display.max_columns", None)

In [8]:
from pathlib import Path
csv_path = Path("data/diabetes.csv")
assert csv_path.exists(), f"No encuentro {csv_path.resolve()}. Copia el CSV a data/."
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [9]:
print("Filas, Columnas:", df.shape)
df.columns.tolist()

Filas, Columnas: (768, 9)


['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Outcome']

In [10]:
df.info()
df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

## Notas de variables
Todas las variables numéricas son cuantitativas (continuas o discretas por registro).
`Outcome` es categórica binaria (0 = no diabetes, 1 = diabetes).
En este dataset algunos ceros representan faltantes (Glucose, BloodPressure, SkinThickness, Insulin, BMI).

In [None]:
cols_zero_na = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]
df[cols_zero_na] = df[cols_zero_na].replace(0, np.nan)
df.isna().sum()

In [None]:
vars_sel = ["Glucose","BMI","Age","Outcome"]
stats = df.agg({
    "Glucose": ["min","max","mean","median","std"],
    "BMI": ["min","max","mean","median","std"],
    "Age": ["min","max","mean","median","std"],
    "Outcome": ["min","max","mean"]
})
stats

In [None]:
df[vars_sel].corr(numeric_only=True)

In [None]:
for c in ["Glucose","BMI","Age"]:
    df[c].dropna().plot(kind="hist", bins=30, alpha=0.7)
    plt.title(f"Histograma de {c}"); plt.xlabel(c); plt.ylabel("Frecuencia")
    plt.show()

In [None]:
A continuación, tres consultas que usan las variables asignadas y Outcome.

In [12]:
q1 = (df.query("Age >= 50 and Glucose >= 140")
        [["Age","Glucose","BMI","Outcome"]]
        .sort_values("Glucose", ascending=False))
q1.head(10)

Unnamed: 0,Age,Glucose,BMI,Outcome
8,53,197,30.5,1
579,62,197,34.7,1
206,57,196,37.5,1
498,55,195,25.1,1
489,67,194,26.1,0
319,59,194,23.5,1
759,66,190,35.5,1
13,59,189,30.1,1
546,53,187,43.6,1
186,60,181,30.1,1


In [13]:
bins = [0, 99, 125, 500]
labels = ["normoglucemia","prediabetes","hiperglucemia"]
q2 = (df.assign(glu_bin=pd.cut(df["Glucose"], bins=bins, labels=labels, include_lowest=True))
        .groupby("glu_bin")["Outcome"].mean()
        .rename("Pr(Outcome=1)")).to_frame()
q2

  .groupby("glu_bin")["Outcome"].mean()


Unnamed: 0_level_0,Pr(Outcome=1)
glu_bin,Unnamed: 1_level_1
normoglucemia,0.081218
prediabetes,0.277372
hiperglucemia,0.592593


In [14]:
bmi_bins = [0, 18.5, 25, 30, 100]
bmi_lbls = ["bajo_peso","normal","sobrepeso","obesidad"]
q3 = (df.assign(bmi_cat=pd.cut(df["BMI"], bins=bmi_bins, labels=bmi_lbls, include_lowest=True))
        .groupby("bmi_cat")["Outcome"].agg(["mean","count"]))
q3

  .groupby("bmi_cat")["Outcome"].agg(["mean","count"]))


Unnamed: 0_level_0,mean,count
bmi_cat,Unnamed: 1_level_1,Unnamed: 2_level_1
bajo_peso,0.133333,15
normal,0.064815,108
sobrepeso,0.244444,180
obesidad,0.462366,465


### Conclusiones
- **Glucose** (cuantitativa continua): rango [min–max]; media vs mediana sugieren (simetría/cola derecha). Desviación estándar indica (dispersión baja/media/alta).
- **BMI** (cuantitativa continua): rango [min–max]; distribución sesgada a la derecha; outliers posibles.
- **Age** (cuantitativa discreta): rango [min–max]; puede agruparse por décadas.

- **Outcome** (categórica binaria): proporción de 1 ≈ X% (ver stats).  
  Se observa aumento de `Outcome=1` en categorías de glucosa más altas y en IMC elevado (ver Q2 y Q3).

En síntesis: mayor glucosa y mayor IMC se asocian con mayor probabilidad de diabetes en esta muestra Pima.