## Simple Linear Regression

The objective of this notebook is to determine how age affects insurance costs. This will be achieved my using simple linear regression model.

In [53]:
# Import the data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.style as style
from sklearn.linear_model import LinearRegression


In [54]:
# Import the data
df = pd.read_csv("insurance.csv")

# Show descriptive satistics for the data
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Below is a scatter plot of the insurance where we can see a linear relationship between age and charges. The independent variable (age of beneficiary ranging from 18 to 64) is on the x-axis, and the dependent variable (charges - individual medical costs billed by health insurance) is on the y-axis. It is a straight line with a positive gradient, indicating a positive correlation between the variables.

We can observe that older people are charged more, but the highest charges apply to smokers, even if they are young.

In [55]:
# Plot a scatter graph with age on the x-axis and charges on the y-axis.

# To make the graph more informative, show who is a smoker and who is not.
# Use 538 styling for nice pictures
style.use('fivethirtyeight')
sns.scatterplot(x='age',y='charges',hue='smoker', data=df, palette='deep')
plt.title('Age vs Charges by Smoker',size=18)
plt.xlabel('Age (years)',size=14)
plt.ylabel('Charges ($)',size=14)
plt.show()

In [56]:
# Plot another scatter plot with the best-fit line.
style.use('fivethirtyeight')
sns.lmplot(data = df, x = 'age', y = 'charges')
plt.xlabel('Age (years)',size=14)
plt.ylabel('Charges ($)',size=14)
plt.title('Scatter Plot with Line of Best fit')
plt.show()

In [57]:
# An alternative scatter graph
style.use('fivethirtyeight')
sns.lmplot(x = 'age', y = 'charges', data=df, hue='smoker', palette='Set1')
plt.xlabel('Age (years)')
plt.ylabel('Charges ($)')
plt.title('Linear Regression Model for Charges by Age and Smoking Habbits')
plt.show()

Fit a simple linear regression model to insurance data in order to make predictions. 

In [58]:
# Using linear_model.LinearRegression() from sklearn, fit a model to your data, and make predictions on data.

x = df.iloc[:,:1].values
y = df.iloc[:,-1].values
insurance_model = LinearRegression()
insurance_model.fit(x,y)

We can see how similar those predictions are to the x values for which we know the corresponding y values. Below these predictions are shown in red, and the observed values are in blue. 

In [59]:
# Plot the data and model
y_pred = insurance_model.predict(x)
plt.scatter(x,y,color = 'b')
plt.plot(x,y_pred,color = 'r')
plt.xlabel('Age (years)')
plt.ylabel('Charges ($)')
plt.title('Scatter Plot with Line of Best fit')
plt.show()

We can use this simple model to determine how much the insurance would pay a 22-year-old. For example, the prediction of 8835.78261673$ here could be better as the actual observed value is 35585.576$ for a male smoker and 2755.02095$ for a 22-year-old non-smoker. Individual gender, BMI, number of children and region they are from will all be contributors to the charges applied, but according to the scatter plot shown here and additional in-depth analysis (https://docs.google.com/presentation/d/1q4_bmZ_eJHww9Jg6QCkCBd9OaIYBDwRzWyRLz6TmAoY/edit?usp=sharing), smoking is the highest predictor of charges. To make the prediction more accurate, I would fitthe data based on smoking as it has the highest impact on medical costs, even though the costs are growing with age.

One should also be careful about extrapolating the linear model to predict charges outside of the known data; for example, entering the age of 82 years will generate unrealistic charges.

In [60]:
# Predict an unknown value
unk_x = [[22]] 

x_pred = np.append(x, unk_x).reshape(-1,1)
y_pred = insurance_model.predict(x_pred)

plt.scatter(x,y,color = 'b')
plt.plot(x_pred,y_pred,color = 'r')
plt.xlabel('Age (years)')
plt.ylabel('Charges ($)')
plt.title('Linear Regression Model for Charges by Age')
plt.show()

print("Insurance charges should be:", insurance_model.predict(unk_x))

Insurance charges should be: [8835.78261673]
