**Descripción del conjunto de datos**

Este dataset contiene 88,489 registros con información recopilada a nivel diario de diferentes ciudades, enfocado en variables ambientales, condiciones climáticas y datos hospitalarios. La fuente es un conjunto de datos público (o indica la fuente si la sabes), utilizado para analizar la relación entre la calidad del aire y la salud pública.

Variables principales:

* city: Nombre de la ciudad donde se tomaron los datos (tipo texto).

* date: Fecha del registro (formato objeto/string).

* aqi: Índice de calidad del aire (AQI), valor entero.

* pm2_5: Concentración de partículas finas (μg/m³).

* pm10: Concentración de partículas inhalables (μg/m³).

* no2: Nivel de dióxido de nitrógeno (μg/m³).

* o3: Nivel de ozono (μg/m³).

* temperature: Temperatura ambiental en grados Celsius.

* humidity: Humedad relativa en porcentaje.

* hospital_admissions: Número de admisiones hospitalarias diarias.

* population_density: Densidad poblacional en la ciudad (categoría/objeto).

* hospital_capacity: Capacidad total hospitalaria en la ciudad.

Este dataset es ideal para estudios de correlación entre contaminación ambiental y efectos en la salud pública, permitiendo aplicar análisis exploratorios y modelos predictivos en el área de epidemiología ambiental y salud pública.

Predecir la admision nos ayuda a contratar mas personal en empocas de crisis y mejorar gastos de operativos

**Importaciones**

In [170]:
# 📦 Librerías estándar
import os

# 📊 Visualización
import matplotlib.pyplot as plt
import seaborn as sns


# 🧪 Scikit-learn
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import (accuracy_score,
    auc,
    classification_report,
    ConfusionMatrixDisplay,
    confusion_matrix,
    f1_score,
    mean_squared_error,
    precision_score,
    recall_score,
    roc_curve,
    RocCurveDisplay,
    r2_score,
)
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import label_binarize, OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# 💾 Guardado de modelos
import joblib



# 📐 Numpy y Pandas
import numpy as np
import pandas as pd

In [171]:
path = 'C:\\Users\\gvald\\Desktop\\Proyecto2\\selected_dataset\\dataset_select.csv'
df = pd.read_csv(path)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88489 entries, 0 to 88488
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           88489 non-null  int64  
 1   city                 88489 non-null  object 
 2   date                 88489 non-null  object 
 3   aqi                  88489 non-null  int64  
 4   pm2_5                88489 non-null  float64
 5   pm10                 88489 non-null  float64
 6   no2                  88489 non-null  float64
 7   o3                   88489 non-null  float64
 8   temperature          88489 non-null  float64
 9   humidity             88489 non-null  int64  
 10  hospital_admissions  88489 non-null  int64  
 11  population_density   88489 non-null  object 
 12  hospital_capacity    88489 non-null  int64  
dtypes: float64(5), int64(5), object(3)
memory usage: 8.8+ MB


In [172]:
# Eliminar Unnamed
df = df.drop(columns=['Unnamed: 0'])
# Limpieza de espacios
df.columns = df.columns.str.lower().str.strip().str.replace(" ", "_")

In [173]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
aqi,88489.0,249.370182,144.479132,0.0,124.0,249.0,374.0,499.0
pm2_5,88489.0,35.144951,14.767994,0.0,24.9,35.1,45.2,109.9
pm10,88489.0,50.118654,19.796392,0.0,36.6,50.0,63.5,143.5
no2,88489.0,30.006211,9.963139,0.0,23.3,30.0,36.7,71.4
o3,88489.0,39.978895,12.007258,0.0,31.9,40.0,48.1,93.5
temperature,88489.0,17.522962,12.961024,-5.0,6.4,17.5,28.7,40.0
humidity,88489.0,56.950966,21.629675,20.0,38.0,57.0,76.0,94.0
hospital_admissions,88489.0,8.049385,3.715458,0.0,6.0,8.0,10.0,25.0
hospital_capacity,88489.0,1024.463165,561.978071,50.0,539.0,1026.0,1511.0,1999.0


In [174]:
df.head()

Unnamed: 0,city,date,aqi,pm2_5,pm10,no2,o3,temperature,humidity,hospital_admissions,population_density,hospital_capacity
0,Los Angeles,2020-01-01,65,34.0,52.7,2.2,38.5,33.5,33,5,Rural,1337
1,Beijing,2020-01-02,137,33.7,31.5,36.7,27.5,-1.6,32,4,Urban,1545
2,London,2020-01-03,266,43.0,59.6,30.4,57.3,36.4,25,10,Suburban,1539
3,Mexico City,2020-01-04,293,33.7,37.9,12.3,42.7,-1.0,67,10,Urban,552
4,Delhi,2020-01-05,493,50.3,34.8,31.2,35.6,33.5,72,9,Suburban,1631


**Entrenamiento de modelos** 

# Transformación de Columnas

In [175]:
X = df.drop(columns=['hospital_admissions'])
y = df['hospital_admissions']

In [176]:
num_cols = [ 'aqi', 'pm2_5', 'pm10', 'no2', 'o3', 'temperature', 'humidity','hospital_capacity']
num_cat = ['city','population_density']

In [177]:
# Split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# LinearRegression

In [178]:
# ColumnTransformer
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("ord", OneHotEncoder(), num_cat)
])

In [179]:
# Pipeline con logistic regression.
pipeline_line = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

In [180]:
print("Columnas en X:", X.columns.tolist())

Columnas en X: ['city', 'date', 'aqi', 'pm2_5', 'pm10', 'no2', 'o3', 'temperature', 'humidity', 'population_density', 'hospital_capacity']


In [181]:
# Entrenar.
pipeline_line.fit(X_train, y_train)

0,1,2
,steps,"[('preprocessor', ...), ('regressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('ord', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [182]:
# Prediccion.
y_pred_line_Regresion = pipeline_line.predict(X_test)

In [183]:
print("Linear Regression")
print("R² score:", round(r2_score(y_test, y_pred_line_Regresion), 2))
print("MSE:", round(mean_squared_error(y_test, y_pred_line_Regresion), 2))

Linear Regression
R² score: 0.16
MSE: 11.58


# RandomForestRegressor

In [184]:
preprocessor_forest = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("ord", OneHotEncoder(), num_cat)
])

In [185]:
# Pipeline con logistic regression.
pipeline_forest = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

In [186]:
# Entrenar.
pipeline_forest.fit(X_train, y_train)

0,1,2
,steps,"[('preprocessor', ...), ('regressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('ord', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [187]:
# Prediccion.
y_pred_forest = pipeline_forest.predict(X_test)

In [188]:
print("Linear Regression")
print("R² score:", round(r2_score(y_test, y_pred_forest), 2))
print("MSE:", round(mean_squared_error(y_test, y_pred_forest), 2))

Linear Regression
R² score: 0.14
MSE: 11.84


# XGBOOST para regresion

In [196]:
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, cross_val_score
import optuna

preprocessor_XGBOOST = ColumnTransformer([
  ("num", "passthrough", num_cols),
  ("ord", OneHotEncoder(), num_cat)
])

In [198]:
# Optuna.
def objective(trial):
  params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "gamma": trial.suggest_float("gamma", 0, 5),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 1),
        "reg_lambda": trial.suggest_float("reg_lambda", 0, 1)
  }

  pipeline = Pipeline([
    ("pp", preprocessor_XGBOOST),
    ("model", XGBRegressor(random_state=42, **params))
  ])

  score = cross_val_score(pipeline, X_train, y_train, cv=3, scoring="r2", n_jobs=-1)
  return score.mean()

In [199]:
# Optuna.
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

best_params = study.best_params
print("Mejores hiperparámetros (regresión):", best_params)

[I 2025-07-10 16:53:56,967] A new study created in memory with name: no-name-b0b40d78-67d0-45ab-9e40-b8c28030fcf9
[I 2025-07-10 16:53:57,801] Trial 0 finished with value: 0.13916095097859701 and parameters: {'n_estimators': 81, 'max_depth': 8, 'subsample': 0.9351236437277755, 'colsample_bytree': 0.5424147763526191, 'learning_rate': 0.11534723446406316, 'gamma': 0.1758549363535905, 'min_child_weight': 9, 'reg_alpha': 0.5699178272847945, 'reg_lambda': 0.1435720985905472}. Best is trial 0 with value: 0.13916095097859701.
[I 2025-07-10 16:53:58,389] Trial 1 finished with value: 0.1552524964014689 and parameters: {'n_estimators': 64, 'max_depth': 4, 'subsample': 0.8006349840802853, 'colsample_bytree': 0.7101082197627622, 'learning_rate': 0.10912235698684378, 'gamma': 0.9279388325694299, 'min_child_weight': 4, 'reg_alpha': 0.7407732404051127, 'reg_lambda': 0.09872267920063538}. Best is trial 1 with value: 0.1552524964014689.
[I 2025-07-10 16:53:59,373] Trial 2 finished with value: 0.14747983

Mejores hiperparámetros (regresión): {'n_estimators': 126, 'max_depth': 3, 'subsample': 0.8409252037181181, 'colsample_bytree': 0.912252284145761, 'learning_rate': 0.08468010852960271, 'gamma': 1.6784509766892888, 'min_child_weight': 7, 'reg_alpha': 0.4680764752606493, 'reg_lambda': 0.2693299440850042}


In [200]:
# Pipeline.
model = Pipeline([
    ("pp", preprocessor),
    ("model", XGBRegressor(random_state=42, **best_params))
])

In [201]:
# Entrenamiento.
model.fit(X_train, y_train)

0,1,2
,steps,"[('pp', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('ord', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,objective,'reg:squarederror'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.912252284145761
,device,
,early_stopping_rounds,
,enable_categorical,False


In [202]:
y_pred = model.predict(X_test)

print(f"R² score: {r2_score(y_test, y_pred):.4f}")

R² score: 0.1596


Se pudo mejor R² score: 0.1596 con el XGBOOST, claramente los datos no esta muy relacionados para hacer un prediccion mas mejorada , por lo cual para mejorar se debe tener mas datos de columnas que tenga mas relacion para predecir 