# Objective

- Leveraging the sample data build a machine learning model to predict the insurance charges on the basis of the given features.

**Algorithm**: Polynomial Regression

# Library & Data Import

In [1]:
# Improting packages and libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline

In [2]:
# reading & previewing data
df = pd.read_csv('/content/drive/MyDrive/Regression Modelling/My Projects/Insurance Project/insurance.csv', header=0) # header 0 means telling dataset has header and it is in the first (0th) line
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges_thousand
0,19,female,27.9,0,yes,southwest,16.884924
1,18,male,33.77,1,no,southeast,1.725552
2,28,male,33.0,3,no,southeast,4.449462
3,33,male,22.705,0,no,northwest,21.984471
4,32,male,28.88,0,no,northwest,3.866855


## Dummy Variable Creation
- This is done to convert categorical values into numerical/boolean values for model building.

## Note: Avoid the Dummy Variable Trap
- Always remember to avoid the dummy variable trap. The way to do that is N-1.
- While creating dummy variables off of the categorical variable, always reduce the number of values by 1.

In [3]:
# creating dummy variables for all the categorical variables in the dataset
df = pd.get_dummies(df)

In [4]:
# testing the operations
df.head()

Unnamed: 0,age,bmi,children,charges_thousand,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16.884924,1,0,0,1,0,0,0,1
1,18,33.77,1,1.725552,0,1,1,0,0,0,1,0
2,28,33.0,3,4.449462,0,1,1,0,0,0,1,0
3,33,22.705,0,21.984471,0,1,1,0,0,1,0,0
4,32,28.88,0,3.866855,0,1,1,0,0,1,0,0


In [5]:
# deleting the unnecessary dummy variables
del df['sex_male']
del df['smoker_no']
del df['region_southwest']

In [6]:
# testing the operations
df.head()

Unnamed: 0,age,bmi,children,charges_thousand,sex_female,smoker_yes,region_northeast,region_northwest,region_southeast
0,19,27.9,0,16.884924,1,1,0,0,0
1,18,33.77,1,1.725552,0,0,0,0,1
2,28,33.0,3,4.449462,0,0,0,0,1
3,33,22.705,0,21.984471,0,0,0,1,0
4,32,28.88,0,3.866855,0,0,0,1,0


 ## Declaring the dependent and the independent variables

In [7]:
x = df[['age', 'bmi', 'children', 'sex_female',
       'smoker_yes', 'region_northeast', 'region_northwest',
       'region_southeast']]
y = df['charges_thousand']

 ## Splitting the dataset into the Training set and Test set

 Note: Using 80:20 Train:Test split

In [8]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

## Training the Polynomial Regression model on the Training set

In [9]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 4)
x_poly = poly_reg.fit_transform(x_train)
regressor = LinearRegression()
regressor.fit(x_poly, y_train)

LinearRegression()

## Predicting the Test set results

In [11]:
y_pred = regressor.predict(poly_reg.transform(x_test))

## Evaluating the Model Performance

In [12]:
# Checking the R squared score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.8449061181505393

## Conclusion

Basis the goodness of fit of the model it is observed that the upon feeding all the features the model is able to explain around 84% of the variations of the dependent variable "insurance charges".

Whilst this is an excellent amount of variation explained, prior to employing selection of this model, it is advised to review some other regression algorithms.