- Predicción de calidad de sueño
    - Target: Quality of Sleep (regresión)

- Modelos: regresión lineal regularizada (Ridge/Lasso), Random Forest Regressor, GBM.
- Features clave: Sleep Duration, Stress Level, Physical Activity Level, AHI_Score, Age.

In [59]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from imblearn.over_sampling import SMOTE


from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import (
    train_test_split, StratifiedKFold, KFold, cross_val_score, GridSearchCV
)
from sklearn.preprocessing import StandardScaler
# Configuración warnings
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

In [60]:


df = pd.read_csv('../data/combined_sleep_dataset.csv')


In [61]:
# Selección de features y target
features = [
    'Sleep Duration',
    'Stress Level',
    'Physical Activity Level',
    'Age'
]
X = df[features]
y = df['Quality of Sleep']

# Split train / test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

In [62]:
# Definición de pipelines y grids
models = {
    'Ridge': {
        'pipe': Pipeline([
            ('scaler', StandardScaler()),
            ('ridge', Ridge())
        ]),
        'params': {
            'ridge__alpha': [0.1, 1.0, 10.0, 100.0]
        }
    },
    'RandomForest': {
        'pipe': Pipeline([
            ('scaler', StandardScaler()),
            ('rf', RandomForestRegressor(random_state=42))
        ]),
        'params': {
            'rf__n_estimators': [100, 200],
            'rf__max_depth': [None, 5, 10],
            'rf__min_samples_leaf': [1, 3]
        }
    },
    'XGBoost': {
        'pipe': Pipeline([
            ('scaler', StandardScaler()),
            ('xgb', XGBRegressor(
                objective='reg:squarederror',
                random_state=42,
                eval_metric='rmse',
                use_label_encoder=False
            ))
        ]),
        'params': {
            'xgb__n_estimators': [100, 200],
            'xgb__max_depth': [3, 5],
            'xgb__learning_rate': [0.01, 0.1],
            'xgb__subsample': [0.8, 1.0]
        }
    }
}



In [63]:
# GridSearchCV + evaluación
results = []

for name, m in models.items():
    print(f"\n>> Entrenando {name}…")
    grid = GridSearchCV(
        m['pipe'],
        m['params'],
        cv=3,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        verbose=0
    )
    grid.fit(X_train, y_train)
    
    best = grid.best_estimator_
    y_pred = best.predict(X_test)
    
    mse  = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae  = mean_absolute_error(y_test, y_pred)
    r2   = r2_score(y_test, y_pred)
    
    print(f"  → Mejores params: {grid.best_params_}")
    print(f"  → RMSE: {rmse:.3f}, MAE: {mae:.3f}, R²: {r2:.3f}")
    
    results.append({
        'modelo': name,
        'RMSE': rmse,
        'MAE': mae,
        'R2': r2
    })

# Comparativa final
df_res = pd.DataFrame(results).set_index('modelo')
print("\n==== Comparativa de métricas ====")
print(df_res)


>> Entrenando Ridge…
  → Mejores params: {'ridge__alpha': 10.0}
  → RMSE: 0.361, MAE: 0.255, R²: 0.911

>> Entrenando RandomForest…
  → Mejores params: {'rf__max_depth': None, 'rf__min_samples_leaf': 1, 'rf__n_estimators': 200}
  → RMSE: 0.001, MAE: 0.000, R²: 1.000

>> Entrenando XGBoost…
  → Mejores params: {'xgb__learning_rate': 0.1, 'xgb__max_depth': 5, 'xgb__n_estimators': 200, 'xgb__subsample': 1.0}
  → RMSE: 0.001, MAE: 0.000, R²: 1.000

==== Comparativa de métricas ====
                  RMSE       MAE        R2
modelo                                    
Ridge         0.361197  0.255042  0.911194
RandomForest  0.000848  0.000077  1.000000
XGBoost       0.001098  0.000361  0.999999


### Interpretación de resultados de regresión

1. **Ridge (Regresión lineal regularizada)**  
   - **Mejores parámetros**: α = 10.0  
   - **RMSE: 0.361**  
   - **MAE: 0.255**  
   - **R²: 0.911**  
   > Explica un 91 % de la varianza de la calidad de sueño con un error medio de ≈0.25 puntos. Indica buen balance **sesgo-varianza** sin sobreajuste evidente.

2. **Random Forest Regressor**  
   - **Mejores parámetros**:  
     - nº de árboles = 200  
     - profundidad ilimitada  
     - mínimos 1 muestra por hoja  
   - **RMSE: 0.001**, **MAE: 0.000**, **R²: 1.000**  
   > Resultados casi “perfectos” en el test set: error prácticamente cero y R²=1. Esto casi siempre señala **sobreajuste** o **fuga de información** (alguna variable es, o deriva, directamente de la calidad de sueño).

3. **XGBoost Regressor**  
   - **Mejores parámetros**:  
     - learning_rate = 0.1  
     - max_depth = 5  
     - estimadores = 200  
     - subsample = 1.0  
   - **RMSE: 0.001**, **MAE: 0.000**, **R²: 0.999999**  
   > Mismo caso que Random Forest: perfección en el test, indicativo de **memorizar** los datos de entrenamiento o de un **leak**.

- **Ridge** ofrece un ajuste realista y generalizable.  
- **Random Forest** y **XGBoost** muestran métricas irreales que apuntan a **sobreajuste** o **fuga de información**.  



---

##### Revisar

In [64]:
import pandas as pd
import numpy as np
from sklearn.model_selection import (
    train_test_split, KFold, cross_val_score, GridSearchCV
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor, callback
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# 0. Carga de datos y definición de X, y

features = [
    'Sleep Duration',
    'Stress Level',
    'Physical Activity Level',
    'AHI Score',   # asegúrate que coincide con tu columna real
    'Age'
]
X = df[features]
y = df['Quality of Sleep']

# 1. Detectar fuga de información
features_corr = [f for f in features if f in df.columns]
corr = df[features_corr + ['Quality of Sleep']].corr()['Quality of Sleep'] \
         .abs().sort_values(ascending=False)
print("Correlaciones con Quality of Sleep:\n", corr)

# 2. Validación anidada (nested CV)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('xgb', XGBRegressor(
        objective='reg:squarederror',
        use_label_encoder=False,
        eval_metric='rmse',
        random_state=42
    ))
])
param_grid = {
    'xgb__n_estimators': [100, 200],
    'xgb__max_depth': [3, 5],
    'xgb__learning_rate': [0.05, 0.1]
}
outer_cv = KFold(n_splits=5, shuffle=True, random_state=1)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
grid = GridSearchCV(
    pipeline, param_grid,
    cv=inner_cv,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)
scores = cross_val_score(
    grid, X, y,
    cv=outer_cv,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)
print("Nested CV RMSE promedio:", np.sqrt(-scores).mean())

# 3. Restringir complejidad de XGBoost
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
xgb_simple = XGBRegressor(
    objective='reg:squarederror',
    n_estimators=100,
    max_depth=3,
    min_child_weight=5,
    gamma=1.0,
    reg_alpha=0.1,
    reg_lambda=1.0,
    learning_rate=0.1,
    random_state=42,
    use_label_encoder=False,
    eval_metric='rmse'
)
xgb_simple.fit(X_train, y_train)
y_pred_simple = xgb_simple.predict(X_test)
print("RMSE simple:", np.sqrt(mean_squared_error(y_test, y_pred_simple)))
print("R² simple:", r2_score(y_test, y_pred_simple))

# 4. Early stopping con callbacks
X_tr2, X_val, y_tr2, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=1
)
xgb_es = XGBRegressor(
    objective='reg:squarederror',
    n_estimators=1000,
    max_depth=3,
    learning_rate=0.05,
    random_state=42,
    use_label_encoder=False,
    eval_metric='rmse'
)
xgb_es.fit(
    X_tr2, y_tr2,
    eval_set=[(X_val, y_val)],
    callbacks=[callback.EarlyStopping(rounds=20, save_best=True)],
    verbose=False
)
y_pred_es = xgb_es.predict(X_test)
print("Iteraciones usadas:", xgb_es.best_iteration)
print("RMSE early stopping:", np.sqrt(mean_squared_error(y_test, y_pred_es)))
print("R² early stopping:", r2_score(y_test, y_pred_es))

# 5. Selección de features por importancia
fi = pd.Series(xgb_simple.feature_importances_, index=features).sort_values(ascending=False)
print("Importancia de features:\n", fi)
plt.figure(figsize=(8,5))
fi.plot.barh()
plt.title('Importancia de Features (XGBoost)')
plt.tight_layout()
plt.show()


KeyError: "['AHI Score'] not in index"