# **Task:**
# Regression Project for Medical Insurance Forecast
Insurance companies need to set the insurance premiums following the population trends despite having limited information about the insured population if they have to put themselves in a position to make profits. This makes it necessary to estimate the average medical care expenses based on trends in the population segments, such as smokers, drivers, etc. To implement this regression project example, you can use the Medical Cost Personal Dataset (Insurance.csv) The aim here will be to predict the medical costs billed by health insurance on an individual given some or all of the independent variables of the dataset. Since the cost to be predicted is a continuous variable, it is pretty natural that regression is to be applied in its truest form (i.e., without the decision boundary as in regression-based classification). Therefore, you could choose to implement polynomial, multiple linear regression, or even Elastic Net Regression. Exploratory data analysis can be an essential step (even in this case despite the limited features). You will observe patterns, like the decreased tendency to smoke among those having children, helping you achieve reasonable feature selection and simpler models.

# Importing Necessary Packages

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle

Matplotlib is building the font cache; this may take a moment.


# Importing Dataset

In [2]:
data_set=pd.read_csv("insurance.csv")
data_set.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# Preprocessing and Analyzing Data Set

In [3]:
data_set.shape

(1338, 7)

In [4]:
data_set.duplicated().sum()

1

In [5]:
data_set.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [6]:
data_set.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [7]:
data_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [8]:
df=data_set.copy()

temp=pd.get_dummies(df[["sex", "smoker", "region"]])
df=df.drop(["sex", "smoker", "region"], axis=1)
df = pd.concat([df, temp], axis=1)
df.head()

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,True,False,False,True,False,False,False,True
1,18,33.77,1,1725.5523,False,True,True,False,False,False,True,False
2,28,33.0,3,4449.462,False,True,True,False,False,False,True,False
3,33,22.705,0,21984.47061,False,True,True,False,False,True,False,False
4,32,28.88,0,3866.8552,False,True,True,False,False,True,False,False


#Extracting Features

In [9]:
x=df[["age",	"bmi",	"children",	"sex_female",	"sex_male",	"smoker_no",	"smoker_yes",	"region_northeast",	"region_northwest",	"region_southeast",	"region_southwest"]]
y=df[[	"charges"]]


In [10]:
print(x)

      age     bmi  children  sex_female  sex_male  smoker_no  smoker_yes  \
0      19  27.900         0        True     False      False        True   
1      18  33.770         1       False      True       True       False   
2      28  33.000         3       False      True       True       False   
3      33  22.705         0       False      True       True       False   
4      32  28.880         0       False      True       True       False   
...   ...     ...       ...         ...       ...        ...         ...   
1333   50  30.970         3       False      True       True       False   
1334   18  31.920         0        True     False       True       False   
1335   18  36.850         0        True     False       True       False   
1336   21  25.800         0        True     False       True       False   
1337   61  29.070         0        True     False      False        True   

      region_northeast  region_northwest  region_southeast  region_southwest  
0       

In [11]:
print(y)

          charges
0     16884.92400
1      1725.55230
2      4449.46200
3     21984.47061
4      3866.85520
...           ...
1333  10600.54830
1334   2205.98080
1335   1629.83350
1336   2007.94500
1337  29141.36030

[1338 rows x 1 columns]


#Splitting Training and Test Sets

In [12]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.2, random_state=2)

#Multiple Linear Regression Model

In [13]:
from sklearn.linear_model import LinearRegression
model=LinearRegression()

# Training Model
model.fit(x_train,y_train)
# The Intercept
print('Intercept:',model.intercept_)
# The Coefficients
print('Coefficients:', model.coef_)

Intercept: [-349.37038695]
Coefficients: [[   251.22566407    332.82271398    587.9253102      18.56120037
     -18.56120037 -11956.17261513  11956.17261513    527.72812674
     148.53816329   -256.75623287   -419.51005716]]


#Testing the Trained Model

In [14]:
y_hat=model.predict(x_test)
print(y_hat[:10])

[[ 1917.97181268]
 [11986.25940683]
 [10490.48005024]
 [ 2304.12993764]
 [ 8293.50537439]
 [11166.05230839]
 [ 3358.09571616]
 [ 1110.00194483]
 [12035.96686456]
 [ 9458.90891087]]


In [15]:
print(y_test[:10])

          charges
17     2395.17155
1091  11286.53870
273    9617.66245
270    1719.43630
874    8891.13950
790    5662.22500
957   12609.88702
492    2196.47320
1125  14254.60820
794    7209.49180


# Evaluating Model Performance

In [16]:
print("Mean Squared Error (MSE) : %.2f" % np.mean((y_hat - y_test) ** 2))
print('Variance score : %.2f' % model.score(x_test, y_test))
from sklearn.metrics import r2_score
print("R2_Score : %.3f " % r2_score(y_test,y_hat))

Mean Squared Error (MSE) : 38304871.35
Variance score : 0.74
R2_Score : 0.745 


# Saving Trained Model

In [17]:
filename="model.pkl"
with open(filename, 'wb') as file:
    pickle.dump(model, file)

print('Model saved as:',filename)

Model saved as: model.pkl
