# Summary
We have done a Exploratory Data Analysis and we have tried two different kind of explanatory machine learning models (Linear regression and Trees) to understand what are the variables with more influence in the outcome. This models are very helpful because, although they are not the more accurate ones, they aren't black boxes and the knwoledge that provide permits to act over the relevant variables to have influence over the outcome. We have tried also a Random Forest which also provides info about feature importance.

In [32]:
import warnings
import os
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')

# Exploratory Data Analysis 

In [33]:
data = pd.read_csv('../input/insurance.csv')

In [34]:
data.head()

Let's check if there are any missing value:

In [35]:
data.isnull().any()

Some summary statistics to see if there are outliers:

In [36]:
data.describe()

In [37]:
data.info()

This are categorical variables:

In [38]:
data['sex'] = pd.Categorical(data['sex'])
data['smoker'] = pd.Categorical(data['smoker'])
data['region'] = pd.Categorical(data['region'])

In [39]:
data.info()

Let's create some graphs to understand the data:

In [40]:
plt.figure(figsize=(18, 6))
plt.subplot(131)
sns.distplot(data['age']).set_title("Age")
plt.subplot(132)
sns.distplot(data['bmi']).set_title("Bmi")
plt.subplot(133)
sns.distplot(data['charges']).set_title("Charges")
plt.show()

Bmi follows a normal distribution and charges is right skewed. 

In [41]:
corr = data.corr()

In [42]:
sns.heatmap(corr, annot=True, linewidths=.5, fmt= '.3f', cmap="YlGnBu")
plt.show()

The following graphs will help us to see the relation between variables:

In [43]:
sns.pairplot(data, kind="reg")
plt.show()

It looks like that: age, bmi and children have a positive correlation with charges. 

Let's see the influence of the categorical variables (sex, region and smoker):

In [44]:
plt.figure(figsize=(18, 6))
plt.subplot(131)
sns.boxplot(x='sex', y='charges', data=data)
plt.subplot(132)
sns.boxplot(x='region', y='charges', data=data)
plt.subplot(133)
sns.boxplot(x='smoker', y='charges', data=data)
plt.show()

In [45]:
data.groupby('sex')['charges'].mean()

In [46]:
data.groupby('smoker')['charges'].mean()

It looks like that men are responsible of more charges than women and it is pretty clear than smokers charges more than non smokers.

People who smoke charges quite more. 

# Models
We are going to try two different explanatory data models to understand the influence of each variable in the outcome (charges)

## Linear regression

In [47]:
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler

In [48]:
y = data['charges']
X = data.drop('charges', axis=1)
X = pd.get_dummies(X, drop_first=True, prefix = ['sex', 'smoker', 'region'])

scaler = MinMaxScaler()
X[['age', 'bmi', 'children']] = scaler.fit_transform(X[['age', 'bmi', 'children']])

In [49]:
np.random.seed(1)
X_2 = sm.add_constant(X)
model_lr = sm.OLS(y, X_2)
linear = model_lr.fit()
print(linear.summary())

We see that the only variables who are statistically significant (p-value < 0.05) are: smoker, bmi, age, children.

The influence of the variables takes that order: smoker is what affect the most to the outcome and it is followed by bmi, age and children.

## Trees
We are going to try now a different kind of model: a classification Tree. This is not the most accurate model but it helps to explain the influence of the variables in the outcome.

In [50]:
from sklearn.tree import DecisionTreeRegressor

In [51]:
model = DecisionTreeRegressor(random_state = 100)

In [52]:
y = data['charges']
X = data.drop('charges', axis=1)

In [53]:
X['sex'] = X['sex'].cat.codes
X['smoker'] = X['smoker'].cat.codes
X['region'] = X['region'].cat.codes

In [54]:
model.fit(X, y)

In [55]:
sns.barplot(x=X.columns, 
            y=model.feature_importances_, 
            order=X.columns[np.argsort(model.feature_importances_)[::-1]])
plt.show()

This model produces similar results to the linear regression model. The variables with more influence are: smoker, bmi, age

## Random Forest
Let's try a more accurate model

In [100]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics.regression import mean_squared_error

In [109]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [110]:
model_rf = RandomForestRegressor(n_estimators=200, random_state=1)
scores = cross_val_score(model_rf, X_train, y_train, cv=5, scoring="neg_mean_squared_error")
scores = np.sqrt(-scores)
print("validation RMSE: {:0.4f} (+/- {:0.4f})".format(np.mean(scores), np.std(scores)))

In [112]:
model_rf.fit(X_train, y_train)
y_pred = model_rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

Let's see how predictions and reality are related:

In [115]:
y_train_pred = model_rf.predict(X_train)
sns.regplot(x=y_train, y=y_train_pred)
plt.title("Predicted vs Real")
plt.show()

Rnadom Forest is a more precise than the previous models and also helps us to understand the importance of each feature. The results we obtain are aligned with our previous results: 

In [114]:
importances = model_rf.feature_importances_
indices = np.argsort(importances)[::-1]

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), X.columns[indices])
plt.xlim([-1, X.shape[1]])
plt.show()

# Conclusion

We have identified with graphs and with two different machine learning models (Linear regression and Trees) that the more relevant variables related to a person which affect their medical costs are, for this order: smoker, bmi, age. This results are aligned with what we obtain when using Random Forest.

As expected, the data confirm that a health insurance company should charge more to people who smoke, are fat an old. 