# **Health Care Cost - Linear Regression**

Rodolfo Camargo de Freitas

## 1 - The data

This data can be found in [kaggle](https://www.kaggle.com/mirichoi0218/insurance/version/1) and is an attempt to gather the datasets of the book **Machine Learning With R**, by Brett Lantz, as can be seen in  [Machine-Learning-with-R-datasets](https://github.com/stedy/Machine-Learning-with-R-datasets).

There are 1338 points with 7 features in the dataset. They are:

-  age: age of the primary beneficiary - numeric;
-  sex: male or female - string;
-  bmi: [Body mass index](https://en.wikipedia.org/wiki/Body_mass_index) ($kg/m²$) - numeric;

-  children: Number of children covered by health insurance - numeric;

-  smoker: do the beneficiary smoke (yes, no)? 

-  region: beneficiary's residential area in the US (northeast, southeast, southwest, northwest) - string;

-  charges: Individual medical costs billed by health insurance - numeric;

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
#loading the dataset
data = pd.read_csv("../input/insurance.csv")

#basic infos
data.info()

#changing data types
for column in ['sex', 'smoker', 'region']:
    data[column] = data[column].astype('category')

#note the memory usage reduction from 73.2 kB to 46.2 kB
data.info()

In [5]:
#the numerical features
data.describe()

In [6]:
#the categorical features
data.describe(include='category', exclude='float')

## 2 - Visualizing the data

Plotting the data can help more to give insights than the statistical summary above.

In [7]:
from pandas.plotting import scatter_matrix

scatter_matrix(data[['charges','age','bmi', 'children']], alpha=0.3, diagonal='kde')

In [8]:
plt.figure(1)
plt.subplot(2,2,1)
data.groupby(['sex'])['charges'].sum().plot.bar()
plt.subplot(2,2,2)
data.groupby(['smoker'])['charges'].sum().plot.bar()
plt.subplot(2,2,3)
data.groupby(['region'])['charges'].sum().plot.bar()

plt.figure(2)
plt.subplot(2,2,1)
data.groupby(['sex'])['bmi'].sum().plot.bar()
plt.subplot(2,2,2)
data.groupby(['smoker'])['bmi'].sum().plot.bar()
plt.subplot(2,2,3)
data.groupby(['region'])['bmi'].sum().plot.bar()

# 3 - Regression

Can we predict the charges knowing the value of the other 6 features? Which feature is more important?
First, we will consider only the numerical features. Later, we will analyze the result with the categorical features.

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

numerical = ['age','bmi', 'children']
categorical = ['sex', 'smoker', 'region']
X_train, X_test, y_train, y_test = train_test_split(data[numerical], 
                                                    data['charges'], 
                                                    test_size=0.2,
                                                   random_state=42)

In [10]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

In [11]:
print("The mean squared error is {:.2f}".format(mean_squared_error(y_test,y_pred)))
print("R2-score: {:.2f}".format(r2_score(y_test,y_pred)))

Let's test each numerical feature individually.

In [12]:
for feature in numerical:
    X_train, X_test, y_train, y_test = train_test_split(data[feature].values.reshape(-1,1),
                                                       data['charges'],
                                                       test_size=0.2,
                                                       random_state=42)
    regressor = LinearRegression()
    regressor.fit(X_train,y_train)
    y_pred = regressor.predict(X_test)
    print("Feature: {}".format(feature))
    print("Mean squared error: {:.2f}".format(mean_squared_error(y_test,y_pred)))
    print("R2-score: {:.2f}".format(r2_score(y_test,y_pred)))
    plt.scatter(X_train,y_train, color='black')
    plt.plot(X_test,y_pred, color='blue')
    plt.ylabel('Charges')
    plt.xlabel(feature)
    plt.show()

What happens when we add the numerical features one by one into the model?

In [13]:
X_train, X_test, y_train, y_test = train_test_split(data[numerical[0]].values.reshape(-1,1),
                                                       data['charges'],
                                                       test_size=0.2,
                                                       random_state=42)
regressor = LinearRegression()
regressor.fit(X_train,y_train)
y_pred = regressor.predict(X_test)
print("Features: {}".format(numerical[0]))
print("Mean squared error: {:.2f}".format(mean_squared_error(y_test,y_pred)))
print("R2-score: {:.2f}".format(r2_score(y_test,y_pred)))
for i in range(2,4):
    X_train, X_test, y_train, y_test = train_test_split(data[numerical[0:i]],
                                                       data['charges'],
                                                       test_size=0.2,
                                                       random_state=42)
    regressor = LinearRegression()
    regressor.fit(X_train,y_train)
    y_pred = regressor.predict(X_test)
    print("Features: {}".format(numerical[0:i]))
    print("Mean squared error: {:.2f}".format(mean_squared_error(y_test,y_pred)))
    print("R2-score: {:.2f}".format(r2_score(y_test,y_pred)))

Although the small reduction in the mean squared error, the R2 score indicates that adding the number of children to the model won't improve it.
In order to add the categorical features we need first to transform it into numerical features. We will do it as follows.

In [14]:
data = pd.get_dummies(data)
data.head()

As a result, each categorical feature was transformed into a new column for each one of the categories. 
Let's see how the model evolves for each of the features.

In [15]:
features = list(data.columns)
features.remove('charges')

r2_scores = []

X_train, X_test, y_train, y_test = train_test_split(data[features[0]].values.reshape(-1,1),
                                                       data['charges'],
                                                       test_size=0.2,
                                                       random_state=42)
regressor = LinearRegression()
regressor.fit(X_train,y_train)
y_pred = regressor.predict(X_test)
print("Feature added: {}. Total features: {}".format(features[0],len(features[0:1])))
print("Mean squared error: {:.2f}".format(mean_squared_error(y_test,y_pred)))
print("R2-score: {:.2f}".format(r2_score(y_test,y_pred)))
r2_scores.append(r2_score(y_test,y_pred))
for i in range(2,11):
    X_train, X_test, y_train, y_test = train_test_split(data[features[0:i]],
                                                       data['charges'],
                                                       test_size=0.2,
                                                       random_state=42)
    regressor = LinearRegression()
    regressor.fit(X_train,y_train)
    y_pred = regressor.predict(X_test)
    print("Feature addes: {}. Total features: {}".format(features[i],len(features[0:i])))
    print("Mean squared error: {:.2f}".format(mean_squared_error(y_test,y_pred)))
    print("R2-score: {:.2f}".format(r2_score(y_test,y_pred)))
    r2_scores.append(r2_score(y_test,y_pred))

In [16]:
plt.plot(list(range(0,10)),r2_scores)
plt.ylabel('R2 score')
plt.xlabel('features')
plt.show()

It seems that a optimal model is the one with the features 'age', 'bmi', 'sex' and 'smoker'.

In [17]:
best = ['age','bmi','sex_male','sex_female','smoker_yes', 'smoker_no']
X_train, X_test, y_train, y_test = train_test_split(data[best],
                                                   data['charges'],
                                                   test_size=0.2,
                                                   random_state=42)
regressor = LinearRegression()
regressor.fit(X_train,y_train)
y_pred = regressor.predict(X_test)

print("Mean squared error: {:.2f}".format(mean_squared_error(y_test,y_pred)))
print("R2-score: {:.2f}".format(r2_score(y_test,y_pred)))