# Análisis de Regresión del Costo de Seguro Médico

Este notebook desarrolla un análisis completo para predecir el costo de un seguro médico.  
Incluye:

- Modelo de datos  
- Entrenamiento de un modelo base (Regresión Lineal)  
- Entrenamiento de 4 modelos adicionales  
- Cálculo de métricas: MAPE, MSE, RMSE  
- Comparación de modelos  
- Selección del mejor modelo  
- Análisis de error  
- Exportación del modelo final (.pkl)  


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

import pickle


In [2]:
df = pd.read_csv("train_insurance.csv")
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges,log_charges
0,47,female,33.915,3,no,northwest,10115.00885,9.221874
1,34,male,34.675,0,no,northeast,4518.82625,8.416229
2,43,female,35.64,1,no,southeast,7345.7266,8.90201
3,18,female,36.85,0,no,southeast,1629.8335,7.396847
4,54,male,24.035,0,no,northeast,10422.91665,9.251858


In [3]:
cat_cols = ["sex", "smoker", "region"]
num_cols = ["age", "bmi", "children"]

X = df[cat_cols + num_cols]
y = df["charges"]

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(drop="first"), cat_cols)
    ],
    remainder="passthrough"
)


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model_lr = Pipeline([
    ("pre", preprocessor),
    ("model", LinearRegression())
])

model_lr.fit(X_train, y_train)


The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [5]:
pred_train = model_lr.predict(X_train)

mape_lr = mean_absolute_percentage_error(y_train, pred_train)
mse_lr = mean_squared_error(y_train, pred_train)
rmse_lr = np.sqrt(mse_lr)

mape_lr, mse_lr, rmse_lr


(0.28940056869516456, 19937704.67641902, np.float64(4465.165694172952))

In [6]:
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(max_depth=8),
    "Random Forest": RandomForestRegressor(n_estimators=300),
    "XGBoost": XGBRegressor(n_estimators=300, learning_rate=0.05, max_depth=5)
}

results = []

for name, mdl in models.items():
    pipe = Pipeline([
        ("pre", preprocessor),
        ("model", mdl)
    ])

    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_train)

    mape = mean_absolute_percentage_error(y_train, pred)
    mse = mean_squared_error(y_train, pred)
    rmse = np.sqrt(mse)

    results.append([name, mape, mse, rmse])


In [7]:
df_metrics = pd.DataFrame(
    results, columns=["Modelo", "MAPE", "MSE", "RMSE"]
)
df_metrics


Unnamed: 0,Modelo,MAPE,MSE,RMSE
0,Linear Regression,0.289401,19937700.0,4465.165694
1,Decision Tree,0.16926,7262116.0,2694.831342
2,Random Forest,0.120479,2735710.0,1653.998041
3,XGBoost,0.138824,3534838.0,1880.116462


In [8]:
best_model_name = df_metrics.sort_values("RMSE").iloc[0]["Modelo"]
best_model_name


'Random Forest'

In [9]:
best_mdl = models[best_model_name]

final_model = Pipeline([
    ("pre", preprocessor),
    ("model", best_mdl)
])

final_model.fit(X_train, y_train)

with open("modelo_seguro.pkl", "wb") as f:
    pickle.dump(final_model, f)


In [10]:
errors = pd.DataFrame({
    "real": y_train,
    "pred": final_model.predict(X_train)
})

errors["error"] = errors["real"] - errors["pred"]
errors.sort_values("error", ascending=False).head(10)


Unnamed: 0,real,pred,error
661,27724.28875,18605.025368,9119.263382
755,32108.66282,23114.370754,8994.292066
725,28468.91901,20862.440904,7606.478106
354,26018.95052,18429.288849,7589.661671
38,27375.90478,19786.316982,7589.587798
432,28340.18885,20845.366688,7494.822162
625,24671.66334,17451.635354,7220.027986
308,20277.80751,13429.358117,6848.449393
90,29186.48236,22456.551269,6729.931091
478,21595.38229,14993.54964,6601.83265
