### Predicting Health Insurance Premiums (based on customer charges). 


We are using a dataset that contains information about potential health insurance customers such as age, smoking history and bmi. We will use the 'cost' column to predict how much a potential customer may spend on health care needs. This spending trend could be used by health insurance companies to determine what an appropriate health insurance premium should be. 

In [1]:
#Import dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import plotly.express as px
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [2]:
#Load csv file into Pandas DataFrame
h_data = pd.read_csv("Resources/Health_insurance.csv")

#View DataFrame
h_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
#Check if DataFrame contains any null values
h_data.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [4]:
#Check attributes
h_data.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [5]:
#Encode Categorical attributes
#Sex Encoded: Male:1, Female:0
#Smoker Encoded: Yes:1, No:0
#Change the 'male' and 'female' to numerical data in the 'sex' column 
h_data['sex'] = h_data['sex'].map({'female': 0, 'male': 1})

#Change the 'yes' and 'no' values to numerical data in the 'smoker' column
h_data['smoker'] = h_data['smoker'].map({'no': 0, 'yes': 1})

print(h_data.head())

   age  sex     bmi  children  smoker     region      charges
0   19    0  27.900         0       1  southwest  16884.92400
1   18    1  33.770         1       0  southeast   1725.55230
2   28    1  33.000         3       0  southeast   4449.46200
3   33    1  22.705         0       0  northwest  21984.47061
4   32    1  28.880         0       0  northwest   3866.85520


In [6]:
#Region will be transformed into 4 columns - southwest, southeast, northwest, northeast
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
columnTransformer = ColumnTransformer(transformers = [('encoder',OneHotEncoder(),[5])], remainder="passthrough")
datavalues = columnTransformer.fit_transform(h_data)

In [7]:
#Splitting the dependent and independent variables
X = datavalues[:, :-1]
y = datavalues[:, -1]

print(X.shape)
print(y.shape)

(1338, 9)
(1338,)


In [8]:
#split data using train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state=11)

In [9]:
print(X_train.shape)
print(X_test.shape)

(1003, 9)
(335, 9)


In [21]:
X_test

array([[ 0.   ,  0.   ,  0.   , ..., 34.7  ,  2.   ,  1.   ],
       [ 0.   ,  0.   ,  1.   , ..., 27.72 ,  0.   ,  0.   ],
       [ 1.   ,  0.   ,  0.   , ..., 33.155,  1.   ,  0.   ],
       ...,
       [ 1.   ,  0.   ,  0.   , ..., 25.46 ,  1.   ,  0.   ],
       [ 0.   ,  1.   ,  0.   , ..., 28.5  ,  2.   ,  0.   ],
       [ 0.   ,  1.   ,  0.   , ..., 28.595,  3.   ,  0.   ]])

In [14]:
#Scale the data
#from sklearn.preprocessing import StandardScaler
#st_X = StandardScaler()
#st_Y = StandardScaler()

In [15]:
#Fit and transform the features
#X_train = st_X.fit_transform(X_train)
#X_test = st_X.transform(X_test)

In [16]:
#Fit and transform the target
#y_train = st_Y.fit_transform(y_train.reshape(-1,1))
#y_test = st_Y.transform(y_test.reshape(-1,1))

In [10]:
y_train[:5]

array([ 4529.477  ,  6455.86265,  5397.6167 , 27117.99378, 13126.67745])

In [11]:
#Train the model
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train,y_train)

GradientBoostingRegressor()

In [12]:
y_pred = gbr.predict(X_test)

In [20]:
data = pd.DataFrame(data={"Predicted Amount": y_pred})
print(data.head())

   Predicted Amount
0      36294.035593
1       6639.353967
2       8540.784562
3      10391.289534
4       9451.972525


In [25]:
data = pd.DataFrame(data={"Actual Amount": y_test, "Predicted Amount": y_pred})
data.head()

Unnamed: 0,Actual Amount,Predicted Amount
0,36397.576,36294.035593
1,4415.1588,6639.353967
2,7639.41745,8540.784562
3,8965.79575,10391.289534
4,9563.029,9451.972525


In [13]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [14]:
print("Mean Squared Error:",mean_squared_error(y_test, y_pred))

Mean Squared Error: 18237626.460028622


In [15]:
print("R Squared Value:", r2_score(y_test, y_pred))

R Squared Value: 0.8696632783741738
