# Planificación del proyecto

## Entenidimiento del negocio 

En esta práctica se deben estimar los costes médidos de diferentes pacientes usando un modelo de Machine Learning, dicha practica se podrá constrastar con las compañeras en la competición de Kaggle.
https://www.kaggle.com/c/estimacin-de-costes-mdicos-sic-ed2-2021/overview

## Comprensión de los datos 

Nos han dado dos ficheros csv divididos en test y train.
Las columnas que contienen son las siguientes:
- id: columna identificativa para PK
- age: edad del principal beneficiario del seguro médico.
- sex: sexo del tomador del seguro médico.
- bmi: indice de masa corporal.
- children: número de hijos cubiertos por el seguro médico / número de descendientes.
- smoker: fumador.
- region: área residencial del beneficiario del seguro médico.
- charges: costes médicos cargados a la aseguradora. Dicha columna no la tiene test.

A continuación cargaremos los datos para su correspondiente procesamiento.

### Importamos librerias

In [1]:
import pandas as pd
import numpy as np

from sklearn import datasets
from sklearn import metrics, preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import BayesianRidge
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor


### Cargamos el dataset 

En este caso tenemos el dataset dividido en dos ficheros, la parte de entrenamiento en _train.csv_ y la parte de validación en _test.csv_, por lo tanto creamos dos variables para ellos y hacemos drop de la PK ya que no es necesaria.

In [2]:
#Parte para el entrenamiento
dataset_train = pd.read_csv('train.csv')
#dataset_train = dataset_train.drop('id', 1) # Para el entrenamiento no usamos los ids

#Parte para la validacion
dataset_test = pd.read_csv('test.csv')
#dataset_test = dataset_test.drop('id', 1) # Para la validacion no usamos los ids

dataset_train.head()
#dataset_test.head()

Unnamed: 0,id,age,sex,bmi,children,smoker,region,charges
0,1229,58,male,30.305,0,no,northeast,11938.25595
1,1073,54,female,28.88,2,no,northeast,12096.6512
2,768,64,female,39.7,0,no,southwest,14319.031
3,606,27,female,25.175,0,no,northeast,3558.62025
4,342,60,female,27.55,0,no,northeast,13217.0945


Podemos comprobar que tenemos tres variables que no son númericas que son ``sex``, ``smoker`` y ``region``. De las cuales ``sex`` y ``smoker`` son variables binarias, a continuación comprobamos los datos que puede tener ``region`` para saber como prepararla en el siguiente paso.

In [3]:
dataset_train['region'].value_counts()

southeast    251
northeast    235
northwest    231
southwest    219
Name: region, dtype: int64

In [4]:
dataset_train['sex'].value_counts()

male      481
female    455
Name: sex, dtype: int64

In [5]:
dataset_train['smoker'].value_counts()

no     733
yes    203
Name: smoker, dtype: int64

Comprobamos que en este caso, region tiene cuatro posibilidades, será tratada en la siguiente fase.

## Preparación de los datos 

Para tener todas nuestras variables de forma numerica, pasaremos las mencionadas anteriormente por un **LabelEncoder**.

In [6]:
le = preprocessing.LabelEncoder()
# Para la parte de entrenamiento
dataset_train['sex'] = le.fit_transform(dataset_train['sex'])
dataset_train['smoker'] = le.fit_transform(dataset_train['smoker'])
dataset_train['region'] = le.fit_transform(dataset_train['region'])

# Para la parte de validación
dataset_test['sex'] = le.fit_transform(dataset_test['sex'])
dataset_test['smoker'] = le.fit_transform(dataset_test['smoker'])
dataset_test['region'] = le.fit_transform(dataset_test['region'])

dataset_train.head()

Unnamed: 0,id,age,sex,bmi,children,smoker,region,charges
0,1229,58,1,30.305,0,0,0,11938.25595
1,1073,54,0,28.88,2,0,0,12096.6512
2,768,64,0,39.7,0,0,3,14319.031
3,606,27,0,25.175,0,0,0,3558.62025
4,342,60,0,27.55,0,0,0,13217.0945


In [7]:
type(dataset_train)

pandas.core.frame.DataFrame

Como vemos a continuación, ha dejado de tener la cadena de texto y ha pasado a tener un número categorico. 

In [8]:
dataset_train['region'].value_counts()

2    251
0    235
1    231
3    219
Name: region, dtype: int64

In [9]:
dataset_train['sex'].value_counts()

1    481
0    455
Name: sex, dtype: int64

In [10]:
dataset_train['smoker'].value_counts()

0    733
1    203
Name: smoker, dtype: int64

A continuación realizaremos un **MinMaxScaler** para normalizar las variables y que ninguna categoría de una columna tenga más peso que otra, para así entrenar a nuestro modelo correctamente.

In [11]:
dataset_train.iloc[:,1:-1]

Unnamed: 0,age,sex,bmi,children,smoker,region
0,58,1,30.305,0,0,0
1,54,0,28.880,2,0,0
2,64,0,39.700,0,0,3
3,27,0,25.175,0,0,0
4,60,0,27.550,0,0,0
...,...,...,...,...,...,...
931,60,0,32.450,0,1,2
932,62,0,39.160,0,0,2
933,55,0,29.830,0,0,0
934,20,0,33.300,0,0,3


In [12]:
# Parte para el entrenamiento
sc_train = MinMaxScaler()
sc_train.fit(dataset_train.iloc[:,1:-1]) # Se ajusta el reescalador, y omitimos la columna de id, para que no la normalice
dataset_train[['age','sex','bmi','children','smoker','region']] = sc_train.transform(dataset_train[['age','sex','bmi','children','smoker','region']])

#Parte para validacion
sc_test = MinMaxScaler()
sc_test.fit(dataset_test.iloc[:,1:]) # Se ajusta el reescalador
dataset_test[['age','sex','bmi','children','smoker','region']] = sc_test.transform(dataset_test[['age','sex','bmi','children','smoker','region']])

In [13]:
dataset_train

Unnamed: 0,id,age,sex,bmi,children,smoker,region,charges
0,1229,0.869565,1.0,0.377184,0.0,0.0,0.000000,11938.25595
1,1073,0.782609,0.0,0.337341,0.4,0.0,0.000000,12096.65120
2,768,1.000000,0.0,0.639871,0.0,0.0,1.000000,14319.03100
3,606,0.195652,0.0,0.233748,0.0,0.0,0.000000,3558.62025
4,342,0.913043,0.0,0.300154,0.0,0.0,0.000000,13217.09450
...,...,...,...,...,...,...,...,...
931,845,0.913043,0.0,0.437159,0.0,1.0,0.666667,45008.95550
932,928,0.956522,0.0,0.624773,0.0,0.0,0.666667,13470.80440
933,1091,0.804348,0.0,0.363903,0.0,0.0,0.000000,11286.53870
934,1268,0.043478,0.0,0.460925,0.0,0.0,1.000000,1880.48700


In [14]:
#Otra forma de realizar el MinMaxScaler, esta variable no se utilizara en este ejercicio
scaler = preprocessing.MinMaxScaler()
df = pd.DataFrame(scaler.fit_transform(dataset_train.iloc[:,1:]), columns=['age','sex','bmi','children','smoker','region','charges'], index=dataset_train.id)

Teniendo el dataset cargado y las variables normalizadas, ahora tendremos que separar la columna a predecir, que en este caso es ``charges`` y dividir nuestro dataset en las variables correspondientes. Tales como:
- X_train: Será nuestra fuente de datos para el entrenamiento.
- Y_train: Será nuestra fuente de datos de entrenamiento para la variable a predecir.
- X_test: Será nuestra fuente de datos de validación, la cual no le ofreceremos la variable a predecir.
- Y_test: Será nuestra fuente de datos de validación, sirve para obtener el rendimiento de nuestro modelo.

In [15]:
#MinMaxScaler nos devuelve un numpy array, lo pasamos a pandas DataFrame para poder trabajar mejor
dataset_train = pd.DataFrame(dataset_train)
dataset_test = pd.DataFrame(dataset_test)

# Se separa la columna a predecir
X_train = dataset_train.iloc[:,1:-1] #Ignoramos la columna de Id 
Y_train = dataset_train.iloc[:,-1]
X_test = dataset_test

## Modelado 

Tenemos ya los datos correctos y normalizados, ahora toca elegir el tipo de modelo que vamos a desarrollar y la estructura que más se adecue, probaremos varias para constrastar datos.

En este caso, al tener claramente una variable de salida no categorica, por lo tanto descaramos _Clasificacion_ , como tenemos variable de salida, descartamos _Clustering_, nos quedaría la opción más sensata que es **Regresion**.



### Linear Regression

In [16]:
reg = LinearRegression().fit(X_train, Y_train)
print(reg.score(X_train, Y_train))

0.7496636971119938


## Evaluación 

## Subir el modelo a Kaggle

In [17]:
!pip install kaggle



In [18]:
X_train

Unnamed: 0,age,sex,bmi,children,smoker,region
0,0.869565,1.0,0.377184,0.0,0.0,0.000000
1,0.782609,0.0,0.337341,0.4,0.0,0.000000
2,1.000000,0.0,0.639871,0.0,0.0,1.000000
3,0.195652,0.0,0.233748,0.0,0.0,0.000000
4,0.913043,0.0,0.300154,0.0,0.0,0.000000
...,...,...,...,...,...,...
931,0.913043,0.0,0.437159,0.0,1.0,0.666667
932,0.956522,0.0,0.624773,0.0,0.0,0.666667
933,0.804348,0.0,0.363903,0.0,0.0,0.000000
934,0.043478,0.0,0.460925,0.0,0.0,1.000000


In [20]:
len(X_test.iloc[:,0:1].to_numpy().flatten())

402

In [23]:
# Use the model to make predictions
predicted_prices = reg.predict(X_test.iloc[:,1:])
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)
#,index=X_test.iloc[:,0:1].to_numpy().flatten()
my_submission = pd.DataFrame({'id': X_test.iloc[:,0:1].to_numpy().flatten(), 'charges': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

[ 7.40737392e+03  4.79878681e+03  1.80380871e+03  3.22211659e+04
  1.36691731e+04  1.43939936e+04  5.50030692e+03  1.24789496e+04
  1.16838785e+04  7.54873613e+03  1.52014767e+04  3.37313342e+04
  3.26213450e+03  3.55453229e+03  7.93869936e+03  1.38792589e+04
  1.29662678e+04  4.80501967e+03  9.00096850e+03  3.38592297e+04
  1.60039289e+04  3.23725099e+03  2.75251785e+04  1.54906986e+04
  1.46435857e+04  1.16223729e+04  5.42728109e+03 -5.11244592e+02
  5.51439703e+03  5.19934300e+03  1.52464534e+04  1.39937100e+04
  2.66278287e+03  6.66500512e+03  2.86749354e+04  1.06467640e+04
  6.79938511e+03  3.00275599e+04  1.15172130e+04  1.33812813e+04
  3.23418232e+03  8.15175694e+03  4.03454899e+03  2.92867873e+04
  7.75707380e+03  1.27460829e+04  1.12377155e+04  1.22739098e+04
  3.41178574e+03  2.83725601e+03  2.94801121e+04  3.54682786e+03
  1.49554186e+04  6.26866757e+03  1.01539805e+04  6.19249953e+03
  3.65554784e+04  3.73502416e+04  7.94226392e+03  3.25345496e+04
  5.12171533e+03  3.14102

In [24]:
!kaggle competitions submit -c estimacin-de-costes-mdicos-sic-ed2-2021 -f submission.csv -m "Modelo ML con Regresion Lineal v2"

Successfully submitted to Estimación de costes médicos (SIC - Ed.2 - 2021)



  0%|          | 0.00/9.29k [00:00<?, ?B/s]
100%|##########| 9.29k/9.29k [00:01<00:00, 5.55kB/s]
