## Modelo lineal

> **Objetivo:** Entrenar una **Regresión Lineal** en **pipeline** (con escalado), usando **validación cruzada (≥5 folds)** y reportar **MAE, MSE, RMSE, R²** como **media ± desviación estándar**, sin fuga de datos.

### 1) Preparación

In [3]:
# Reproducibilidad y librerías
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import KFold, cross_validate
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error, r2_score

In [6]:
# Confirmar carga de datos
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
df = pd.read_csv('parkinsons+telemonitoring/parkinsons_updrs.data')
display(df.head())
display(df.info())

Unnamed: 0,subject#,age,sex,test_time,motor_UPDRS,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,...,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
0,1,72,0,5.6431,28.199,34.398,0.00662,3.4e-05,0.00401,0.00317,...,0.23,0.01438,0.01309,0.01662,0.04314,0.01429,21.64,0.41888,0.54842,0.16006
1,1,72,0,12.666,28.447,34.894,0.003,1.7e-05,0.00132,0.0015,...,0.179,0.00994,0.01072,0.01689,0.02982,0.011112,27.183,0.43493,0.56477,0.1081
2,1,72,0,19.681,28.695,35.389,0.00481,2.5e-05,0.00205,0.00208,...,0.181,0.00734,0.00844,0.01458,0.02202,0.02022,23.047,0.46222,0.54405,0.21014
3,1,72,0,25.647,28.905,35.81,0.00528,2.7e-05,0.00191,0.00264,...,0.327,0.01106,0.01265,0.01963,0.03317,0.027837,24.445,0.4873,0.57794,0.33277
4,1,72,0,33.642,29.187,36.375,0.00335,2e-05,0.00093,0.0013,...,0.176,0.00679,0.00929,0.01819,0.02036,0.011625,26.126,0.47188,0.56122,0.19361


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5875 entries, 0 to 5874
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   subject#       5875 non-null   int64  
 1   age            5875 non-null   int64  
 2   sex            5875 non-null   int64  
 3   test_time      5875 non-null   float64
 4   motor_UPDRS    5875 non-null   float64
 5   total_UPDRS    5875 non-null   float64
 6   Jitter(%)      5875 non-null   float64
 7   Jitter(Abs)    5875 non-null   float64
 8   Jitter:RAP     5875 non-null   float64
 9   Jitter:PPQ5    5875 non-null   float64
 10  Jitter:DDP     5875 non-null   float64
 11  Shimmer        5875 non-null   float64
 12  Shimmer(dB)    5875 non-null   float64
 13  Shimmer:APQ3   5875 non-null   float64
 14  Shimmer:APQ5   5875 non-null   float64
 15  Shimmer:APQ11  5875 non-null   float64
 16  Shimmer:DDA    5875 non-null   float64
 17  NHR            5875 non-null   float64
 18  HNR     

None

### 2) Preprocesamiento
Validación de nulos y y tipos

In [8]:
# Asegurase de uq el avariable objetivo existe y esta completa
assert 'total_UPDRS' in df.columns, "No se encontró la columna 'total_UPDRS' en el dataset."

# Confirmación de filas y columnas
print('Filas x columnas:', df.shape)

# Confirmacion de nulos
print('\nNulos por columna:')
print(df.isna().sum())

Filas x columnas: (5875, 22)

Nulos por columna:
subject#         0
age              0
sex              0
test_time        0
motor_UPDRS      0
total_UPDRS      0
Jitter(%)        0
Jitter(Abs)      0
Jitter:RAP       0
Jitter:PPQ5      0
Jitter:DDP       0
Shimmer          0
Shimmer(dB)      0
Shimmer:APQ3     0
Shimmer:APQ5     0
Shimmer:APQ11    0
Shimmer:DDA      0
NHR              0
HNR              0
RPDE             0
DFA              0
PPE              0
dtype: int64


#### Definir las caracteristicas y el objetivo

In [None]:
# Características
FEATURES = [
    'age', 'test_time',
    'Jitter(%)', 'Jitter(Abs)', 'Jitter:RAP', 'Jitter:PPQ5', 'Jitter:DDP',
    'Shimmer', 'Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'Shimmer:APQ11', 'Shimmer:DDA',
    'NHR', 'HNR', 'RPDE', 'DFA', 'PPE',
    'sex'
]

# Variable Objetivo
missing = [c for c in FEATURES + ['total_UPDRS'] if c not in df.columns]
assert not missing, f"Faltan columnas en el dataset: {missing}"

X = df[FEATURES].copy()
y = df['total_UPDRS'].astype(float).copy()

### 3) Pipeline (escalado → Regresión Lineal)

In [None]:
# Definicion de pipeline
lin_pipeline = Pipeline([
    ('scaler', StandardScaler(with_mean=True, with_std=True)),
    ('linreg', LinearRegression())
])

### 4) Validación cruzada (sin fuga) y métricas

Usamos **KFold(10)** con `shuffle=True` y `random_state` fijo. Reportamos **MAE, MSE, RMSE, R²**.


In [11]:
# Definición de CV y métricas (scorers)
kf = KFold(n_splits=10, shuffle=True, random_state=RANDOM_STATE)

# Definicion de scorers
scoring = {
    'MAE': make_scorer(mean_absolute_error, greater_is_better=False),
    'MSE': make_scorer(mean_squared_error, greater_is_better=False),
    'R2' : make_scorer(r2_score)
}

# Definicion de resultados
cv_results = cross_validate(
    lin_pipeline,
    X, y,
    cv=kf,
    scoring=scoring,
    return_train_score=False,
    n_jobs=-1
)

# Convertimos a DataFrame para formateo
res = pd.DataFrame({
    'MAE': -cv_results['test_MAE'],  # se invierte el signo (sklearn reporta negativo por convención)
    'MSE': -cv_results['test_MSE'],
    'R2' :  cv_results['test_R2']
})
res['RMSE'] = np.sqrt(res['MSE'])

summary = res.agg(['mean','std']).T
summary.columns = ['mean','std']
summary

Unnamed: 0,mean,std
MAE,8.070563,0.297726
MSE,94.953242,6.802704
R2,0.170409,0.027477
RMSE,9.738828,0.34715


In [14]:
def fmt(mean, std):
    return f"{mean:.3f} ± {std:.3f}"

print(f"== Regresión Lineal (CV={kf.n_splits} folds) ==")
print("MAE :", fmt(summary.loc['MAE','mean'], summary.loc['MAE','std']))
print("MSE :", fmt(summary.loc['MSE','mean'], summary.loc['MSE','std']))
print("RMSE:", fmt(summary.loc['RMSE','mean'], summary.loc['RMSE','std']))
print("R²  :", fmt(summary.loc['R2','mean'],  summary.loc['R2','std']))

== Regresión Lineal (CV=10 folds) ==
MAE : 8.071 ± 0.298
MSE : 94.953 ± 6.803
RMSE: 9.739 ± 0.347
R²  : 0.170 ± 0.027


### 5) Comprobacion de modelo lineal.

> Confirma si la regresion lineal es suficiente (supera un predictor dummy)


In [15]:
baseline = DummyRegressor(strategy='median')
base_cv = cross_validate(
    baseline, X, y, cv=kf, scoring=scoring, n_jobs=-1
)
base_res = pd.DataFrame({
    'MAE': -base_cv['test_MAE'],
    'MSE': -base_cv['test_MSE'],
    'R2' :  base_cv['test_R2']
})
base_res['RMSE'] = np.sqrt(base_res['MSE'])
base_summary = base_res.agg(['mean','std']).T
base_summary.columns = ['mean','std']

print("\n== Baseline Dummy (mediana) (CV=10) ==")
print("MAE :", fmt(base_summary.loc['MAE','mean'], base_summary.loc['MAE','std']))
print("MSE :", fmt(base_summary.loc['MSE','mean'], base_summary.loc['MSE','std']))
print("RMSE:", fmt(base_summary.loc['RMSE','mean'], base_summary.loc['RMSE','std']))
print("R²  :", fmt(base_summary.loc['R2','mean'],  base_summary.loc['R2','std']))


== Baseline Dummy (mediana) (CV=10) ==
MAE : 8.582 ± 0.321
MSE : 116.595 ± 6.267
RMSE: 10.794 ± 0.293
R²  : -0.019 ± 0.009


### Resultados
Se entrenó una **Regresión Lineal** en pipeline con **StandardScaler** y se evaluó mediante **validación cruzada de 10 folds** (shuffle, `random_state=42`). Las métricas promedio (media ± std) fueron:  
- **MAE = *8.071* ± *0.298***
- **RMSE = *9.739* ± *0.347***
- **MSE = *94.953* ± *6.803***
- **R² = *0.170* ± *0.027***.  

En comparación con un **baseline Dummy (mediana)**, el modelo lineal mostró mejor desempeño en todas las metricas establecidas, lo cual sugiere que capta relaciones lineales útiles entre las variables y `total_UPDRS`.

No se realizó ninguna transformación que induzca **fuga de datos**, dado que el escalado se ejecuta **dentro del pipeline** y, por tanto, se ajusta **solo en los datos de entrenamiento** de cada fold.

### Notas
Falta realizar un analisis mas profundo sobre los datos. No he verificado si hay outliers.  

* Si hay **valores atípicos** muy fuertes, considerar **RobustScaler** en lugar de `StandardScaler`, O eliminar outliers.
* `FEATURES` ya esta definido y va a ser utilizado en el la selección de características y la curvas.

# Selección de caracteristicas

In [None]:
# aqui el codigo para la seleccion de caracteristicas