**Descripción del conjunto de datos**

Este dataset contiene 88,489 registros con información recopilada a nivel diario de diferentes ciudades, enfocado en variables ambientales, condiciones climáticas y datos hospitalarios. La fuente es un conjunto de datos público (o indica la fuente si la sabes), utilizado para analizar la relación entre la calidad del aire y la salud pública.

Variables principales:

* city: Nombre de la ciudad donde se tomaron los datos (tipo texto).

* date: Fecha del registro (formato objeto/string).

* aqi: Índice de calidad del aire (AQI), valor entero.

* pm2_5: Concentración de partículas finas (μg/m³).

* pm10: Concentración de partículas inhalables (μg/m³).

* no2: Nivel de dióxido de nitrógeno (μg/m³).

* o3: Nivel de ozono (μg/m³).

* temperature: Temperatura ambiental en grados Celsius.

* humidity: Humedad relativa en porcentaje.

* hospital_admissions: Número de admisiones hospitalarias diarias.

* population_density: Densidad poblacional en la ciudad (categoría/objeto).

* hospital_capacity: Capacidad total hospitalaria en la ciudad.

Este dataset es ideal para estudios de correlación entre contaminación ambiental y efectos en la salud pública, permitiendo aplicar análisis exploratorios y modelos predictivos en el área de epidemiología ambiental y salud pública.

Predecir la admision nos ayuda a contratar mas personal en empocas de crisis y mejorar gastos de operativos

**Importaciones**

In [117]:
# 📦 Librerías estándar
import os

# 📊 Visualización
import matplotlib.pyplot as plt
import seaborn as sns


# 🧪 Scikit-learn
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import (accuracy_score,
    auc,
    classification_report,
    ConfusionMatrixDisplay,
    confusion_matrix,
    f1_score,
    mean_squared_error,
    precision_score,
    recall_score,
    roc_curve,
    RocCurveDisplay,
    r2_score,
)
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import label_binarize, OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# 💾 Guardado de modelos
import joblib



# 📐 Numpy y Pandas
import numpy as np
import pandas as pd

In [118]:
path = 'C:\\Users\\gvald\\Desktop\\Proyecto2\\selected_dataset\\dataset_select.csv'
df = pd.read_csv(path)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88489 entries, 0 to 88488
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           88489 non-null  int64  
 1   city                 88489 non-null  object 
 2   date                 88489 non-null  object 
 3   aqi                  88489 non-null  int64  
 4   pm2_5                88489 non-null  float64
 5   pm10                 88489 non-null  float64
 6   no2                  88489 non-null  float64
 7   o3                   88489 non-null  float64
 8   temperature          88489 non-null  float64
 9   humidity             88489 non-null  int64  
 10  hospital_admissions  88489 non-null  int64  
 11  population_density   88489 non-null  object 
 12  hospital_capacity    88489 non-null  int64  
dtypes: float64(5), int64(5), object(3)
memory usage: 8.8+ MB


In [119]:
# Eliminar Unnamed
df = df.drop(columns=['Unnamed: 0'])
# Limpieza de espacios
df.columns = df.columns.str.lower().str.strip().str.replace(" ", "_")

In [120]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
aqi,88489.0,249.370182,144.479132,0.0,124.0,249.0,374.0,499.0
pm2_5,88489.0,35.144951,14.767994,0.0,24.9,35.1,45.2,109.9
pm10,88489.0,50.118654,19.796392,0.0,36.6,50.0,63.5,143.5
no2,88489.0,30.006211,9.963139,0.0,23.3,30.0,36.7,71.4
o3,88489.0,39.978895,12.007258,0.0,31.9,40.0,48.1,93.5
temperature,88489.0,17.522962,12.961024,-5.0,6.4,17.5,28.7,40.0
humidity,88489.0,56.950966,21.629675,20.0,38.0,57.0,76.0,94.0
hospital_admissions,88489.0,8.049385,3.715458,0.0,6.0,8.0,10.0,25.0
hospital_capacity,88489.0,1024.463165,561.978071,50.0,539.0,1026.0,1511.0,1999.0


In [121]:
df.head()

Unnamed: 0,city,date,aqi,pm2_5,pm10,no2,o3,temperature,humidity,hospital_admissions,population_density,hospital_capacity
0,Los Angeles,2020-01-01,65,34.0,52.7,2.2,38.5,33.5,33,5,Rural,1337
1,Beijing,2020-01-02,137,33.7,31.5,36.7,27.5,-1.6,32,4,Urban,1545
2,London,2020-01-03,266,43.0,59.6,30.4,57.3,36.4,25,10,Suburban,1539
3,Mexico City,2020-01-04,293,33.7,37.9,12.3,42.7,-1.0,67,10,Urban,552
4,Delhi,2020-01-05,493,50.3,34.8,31.2,35.6,33.5,72,9,Suburban,1631


In [122]:
df['hospital_admissions'].value_counts()

hospital_admissions
7     9496
8     9434
9     9170
6     8613
10    7947
5     7020
11    6583
4     5365
12    4985
3     3690
13    3628
14    2627
2     2523
0     1776
15    1661
1     1568
16    1030
17     642
18     335
19     190
20     105
21      57
22      26
23      10
24       7
25       1
Name: count, dtype: int64

In [123]:
def categorize_admissions(admissions):
    if admissions <= 5:
        return 0      # Bajo: 0-5 admisiones
    elif admissions <= 10:
        return 1      # Medio: 6-10 admisiones  
    else:
        return 2      # Alto: 11+ admisiones

In [124]:
df['admissions_group'] = df['hospital_admissions'].apply(categorize_admissions)

In [125]:
print(df['admissions_group'].value_counts())

admissions_group
1    44660
0    21942
2    21887
Name: count, dtype: int64


**Entrenamiento de modelos** 

# Transformación de Columnas

In [126]:
X = df.drop(columns=['admissions_group'])
y = df['admissions_group']

In [127]:
num_cols = [ 'aqi', 'pm2_5', 'pm10', 'no2', 'o3', 'temperature', 'humidity','hospital_capacity','hospital_admissions']
num_cat = ['city','population_density']

In [128]:
# Split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# LogisticRegression

In [129]:
# ColumnTransformer
preprocessor_Regresión = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("ord", OneHotEncoder(), num_cat)
])

In [130]:
# Pipeline con logistic regression.
pipeline_line_Regresion = Pipeline([
    ('preprocessor', preprocessor_Regresión),
    ('regressor', LogisticRegression())
])

In [131]:
# Entrenar.
pipeline_line_Regresion.fit(X_train, y_train)

0,1,2
,steps,"[('preprocessor', ...), ('regressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('ord', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [132]:
# Prediccion.
y_pred_line_Regresion = pipeline_line_Regresion.predict(X_test)

In [133]:
# Prediccion.
y_pred_line_Regresion = pipeline_line_Regresion.predict(X_test)

# RandomForestClassifier GridSearchCV

In [136]:
preprocessor_Random = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("ord", OneHotEncoder(), num_cat)
])

In [137]:
# Pipeline con RandomForest
pipeline_forest_regressor = Pipeline(steps=[
    ("preprocessing", preprocessor_Random),
    ("regressor", RandomForestClassifier())
])
forest_params = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__max_depth': [None, 5, 10],
    'regressor__min_samples_split': [2, 5, 10]
}

forest_grid = GridSearchCV(pipeline_forest_regressor, forest_params, cv=3, scoring="accuracy")
forest_grid.fit(X_train, y_train)# Evaluación.
forest_best = forest_grid.best_estimator_
y_pred_forest_GridSearchCV = forest_best.predict(X_test)

In [141]:


print("Logistic Regression")
print("Accuracy:", accuracy_score(y_test, y_pred_line_Regresion))
print("\nReporte clasificación:\n", classification_report(y_test, y_pred_line_Regresion))



print("RandomForestClassifier GridSearchCV")
print("Mejores parámetros:", forest_grid.best_params_)
print("clasificación:", classification_report(y_test, y_pred_forest_GridSearchCV))



Logistic Regression
Accuracy: 1.0

Reporte clasificación:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      4350
           1       1.00      1.00      1.00      8924
           2       1.00      1.00      1.00      4424

    accuracy                           1.00     17698
   macro avg       1.00      1.00      1.00     17698
weighted avg       1.00      1.00      1.00     17698

RandomForestClassifier GridSearchCV
Mejores parámetros: {'regressor__max_depth': None, 'regressor__min_samples_split': 2, 'regressor__n_estimators': 50}
clasificación:               precision    recall  f1-score   support

           0       1.00      1.00      1.00      4350
           1       1.00      1.00      1.00      8924
           2       1.00      1.00      1.00      4424

    accuracy                           1.00     17698
   macro avg       1.00      1.00      1.00     17698
weighted avg       1.00      1.00      1.00     17698



# XGBClassifier

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import optuna

preprocessor_XGBOOST = ColumnTransformer([
  ("num", "passthrough", num_cols),
  ("ord", OneHotEncoder(), num_cat)
])

In [None]:
def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "max_depth": trial.suggest_int("max_depth", 3, 25),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),  # Mejora rendimiento
    }

    pipeline = Pipeline([
        ("pp", preprocessor_XGBOOST),
        ("model", XGBClassifier(
            random_state=42,
            eval_metric="logloss",
            **params
        ))
    ])

    score = cross_val_score(pipeline, X_train, y_train, cv=3, scoring="accuracy", n_jobs=-1)
    return score.mean()

In [None]:
# Optuna.
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

best_params = study.best_params
print("Mejores hiperparámetros (regresión):", best_params)

[I 2025-07-15 13:47:36,233] A new study created in memory with name: no-name-350ae6dd-7a3d-4565-8f6b-baa6890e15f4
[I 2025-07-15 13:47:41,833] Trial 0 finished with value: 1.0 and parameters: {'n_estimators': 431, 'max_depth': 7, 'subsample': 0.5419339633852364, 'colsample_bytree': 0.5010985646366588, 'learning_rate': 0.25426827159070936}. Best is trial 0 with value: 1.0.
[I 2025-07-15 13:47:47,474] Trial 1 finished with value: 1.0 and parameters: {'n_estimators': 422, 'max_depth': 6, 'subsample': 0.5012136256581762, 'colsample_bytree': 0.574206395897932, 'learning_rate': 0.14055175313173143}. Best is trial 0 with value: 1.0.
[I 2025-07-15 13:47:53,059] Trial 2 finished with value: 1.0 and parameters: {'n_estimators': 487, 'max_depth': 11, 'subsample': 0.9817271930399122, 'colsample_bytree': 0.8868251049051772, 'learning_rate': 0.24546102864857158}. Best is trial 0 with value: 1.0.
[I 2025-07-15 13:47:57,837] Trial 3 finished with value: 1.0 and parameters: {'n_estimators': 175, 'max_de

Mejores hiperparámetros (regresión): {'n_estimators': 431, 'max_depth': 7, 'subsample': 0.5419339633852364, 'colsample_bytree': 0.5010985646366588, 'learning_rate': 0.25426827159070936}


In [None]:
# Pipeline.
model = Pipeline([
  ("pp", preprocessor_XGBOOST),
  ("model", XGBClassifier(random_state=42, use_label_encoder=False, eval_metric="logloss", **best_params))
])

In [None]:
# Entrenamiento.
model.fit(X_train, y_train)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


0,1,2
,steps,"[('pp', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('ord', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,objective,'multi:softprob'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.5010985646366588
,device,
,early_stopping_rounds,
,enable_categorical,False


In [None]:
# Evaluacion.
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy en test: {acc:.4f}")

Accuracy en test: 1.0000


# El mejor modelo
 **Todos salieron con un 1.00**
* LogisticRegression 
* RandomForestClassifier GridSearchCV
* XGBClassifier