# Taller en clase - Regresión lineal

## Caso de estudio

Usaremos los datos de costos de una compañia de seguros médicos. El dataset cuenta con información acerca de los asegurados (edad, sexo, bmi, hijos, fuma, región) y los costos que tuvo para la compañia esa persona.

Algunos costos en los que incurre la compañia de seguros pueden estar asociados a eventos fortuitos, pero se espera que ciertas poblaciones (personas de más edad o fumadores por ejemplo) impliquen mayores costos para las compañias.

Teniendo en cuenta los datos históricos de la compañia de seguros médicos, La compañia quiere estimar el costo de pacientes nuevos para poder ajustar mejor su esquema de costos y financiación.


### Créditos

El dataset de costos médicos es un dataset sintético creado para el libro Machine Learning with R (2nd. ed.) Brett Lantz. Packt Publishing.


In [2]:
import pandas as pd

In [3]:
# Creación de objeto pandas dataframe
patients_df = pd.read_csv('https://github.com/stedy/Machine-Learning-with-R-datasets/blob/master/insurance.csv?raw=true')

### Exploración

In [4]:
patients_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [5]:
patients_df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [9]:
patients_df[['sex', 'region', 'smoker']].value_counts()

sex     region     smoker
female  southwest  no        141
        southeast  no        139
        northwest  no        135
male    southeast  no        134
female  northeast  no        132
male    northwest  no        132
        southwest  no        126
        northeast  no        125
        southeast  yes        55
        northeast  yes        38
        southwest  yes        37
female  southeast  yes        36
        northeast  yes        29
        northwest  yes        29
male    northwest  yes        29
female  southwest  yes        21
Name: count, dtype: int64

### Procesamiento

In [11]:
# Manejo de variables categoricas
# Forma 1: Reemplazo manual

patients_df_1 = patients_df.replace({'sex': {'male': 0, 'female': 1}, 
                                     'smoker': {'yes': 0, 'no': 1}, 
                                     'region': {'southwest': 0, 'southeast': 1, 'northwest': 2, 'northeast': 3}})
patients_df_1.head()

  patients_df_1 = patients_df.replace({'sex': {'male': 0, 'female': 1},


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,1,27.9,0,0,0,16884.924
1,18,0,33.77,1,1,1,1725.5523
2,28,0,33.0,3,1,1,4449.462
3,33,0,22.705,0,1,2,21984.47061
4,32,0,28.88,0,1,2,3866.8552


In [17]:
# Forma 2: Reemplazo automático

patients_df_2 = patients_df.copy()
patients_df_2 = pd.get_dummies(patients_df_2, columns=['sex', 'smoker', 'region']) # , drop_first=True
# PAso de valores booleanos a enteros
for col in patients_df_2.columns:
    patients_df_2.loc[patients_df_2[col] == True, col] = 1
    patients_df_2.loc[patients_df_2[col] == False, col] = 0

patients_df_2.head()

  patients_df_2.loc[patients_df_2[col] == True, col] = 1
  patients_df_2.loc[patients_df_2[col] == True, col] = 1
  patients_df_2.loc[patients_df_2[col] == True, col] = 1
  patients_df_2.loc[patients_df_2[col] == True, col] = 1
  patients_df_2.loc[patients_df_2[col] == True, col] = 1
  patients_df_2.loc[patients_df_2[col] == True, col] = 1
  patients_df_2.loc[patients_df_2[col] == True, col] = 1
  patients_df_2.loc[patients_df_2[col] == True, col] = 1


Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,1,0,0,1,0,0,0,1
1,18,33.77,1,1725.5523,0,1,1,0,0,0,1,0
2,28,33.0,3,4449.462,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.88,0,3866.8552,0,1,1,0,0,1,0,0


### Partición de los datos

In [29]:
# Forma 1: Partición manual

# Uso 70% para entrenamiento (random split)
train_df = patients_df_1.sample(frac = 0.7, random_state = 42)

rest_df = patients_df_1.drop(train_df.index)

# Uso 15% para validacion y 15% para test
val_df = rest_df.sample(frac = 0.5, random_state = 42)

test_df = rest_df.drop(val_df.index)

print('train: ', train_df.shape)
print('val: ', val_df.shape)
print('test: ', test_df.shape)

train:  (937, 7)
val:  (200, 7)
test:  (201, 7)


In [30]:
# Forma 2: Partición automática
from sklearn.model_selection import train_test_split

patients_df_train, patients_df_test = train_test_split(patients_df_1, test_size = 0.3, random_state = 42)

print('patients_df_train: ', patients_df_train.shape)
print('patients_df_test: ', patients_df_test.shape)

patients_df_train:  (936, 7)
patients_df_test:  (402, 7)


### Entrenamiento del modelo

In [40]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(patients_df_train[['age', 'sex', 'bmi', 'children', 'smoker', 'region']], patients_df_train[['charges']])
model

In [41]:
charges_predicted = model.predict(patients_df_test[['age', 'sex', 'bmi', 'children', 'smoker', 'region']])
charges_predicted

array([[ 8.93142116e+03],
       [ 7.07090670e+03],
       [ 3.69370805e+04],
       [ 9.59699214e+03],
       [ 2.70083549e+04],
       [ 1.08664849e+04],
       [ 3.74610217e+01],
       [ 1.72228092e+04],
       [ 9.18308115e+02],
       [ 1.13965537e+04],
       [ 2.79154456e+04],
       [ 9.53381323e+03],
       [ 5.18928014e+03],
       [ 3.86124990e+04],
       [ 4.05094490e+04],
       [ 3.72748566e+04],
       [ 1.53562559e+04],
       [ 3.59449407e+04],
       [ 9.10631783e+03],
       [ 3.14429410e+04],
       [ 3.66298253e+03],
       [ 1.00966745e+04],
       [ 2.21091896e+03],
       [ 7.10598084e+03],
       [ 1.13521417e+04],
       [ 1.30231210e+04],
       [ 1.44472857e+04],
       [ 6.12031303e+03],
       [ 9.94564893e+03],
       [ 2.18617424e+03],
       [ 8.91389260e+03],
       [ 1.31869496e+04],
       [ 4.49110116e+03],
       [ 3.30469662e+03],
       [ 4.32885102e+03],
       [ 1.32330189e+04],
       [ 1.67071398e+03],
       [ 8.63238607e+03],
       [ 3.3

### Validación del modelo

In [43]:
from sklearn.metrics import mean_squared_error, r2_score

mean_squared_error(patients_df_test['charges'], charges_predicted)

33805466.89868861

In [44]:
r2_score(patients_df_test['charges'], charges_predicted)

0.7694415927057693