We will build a Multiple Linear regression model for Medical cost dataset. The dataset consists of age, sex, BMI(body mass index), children, smoker and region feature, which are independent and charge as a dependent feature. We will predict individual medical costs billed by health insurance.

# STEP A: DATA PRE-PROCESSING

## STEP 1: IMPORTING LIBRARIES

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## STEP 2: IMPORT DATASET

In [2]:
insurance_data=pd.read_csv("insurance.csv")

In [3]:
insurance_data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [4]:
insurance_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## STEP 3: DIVIDING THE DATASET INTO FEATURE MATRIX (X) AND DEPENDENT VARIABLE VECTOR (V)
The X is our input features and Y is the target variable or output

In [5]:
X = insurance_data.iloc[:,:-1].values
Y = insurance_data.iloc[:, -1].values

In [6]:
X

array([[19, 'female', 27.9, 0, 'yes', 'southwest'],
       [18, 'male', 33.77, 1, 'no', 'southeast'],
       [28, 'male', 33.0, 3, 'no', 'southeast'],
       ...,
       [18, 'female', 36.85, 0, 'no', 'southeast'],
       [21, 'female', 25.8, 0, 'no', 'southwest'],
       [61, 'female', 29.07, 0, 'yes', 'northwest']], dtype=object)

In [7]:
Y

array([16884.924 ,  1725.5523,  4449.462 , ...,  1629.8335,  2007.945 ,
       29141.3603])

## STEP 4: REPLACE MISSING VALUES
No missing values present in dataset

In [8]:
insurance_data.isnull().sum() 

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

## STEP 5: ENCODING CATEGORICAL DATA
Categorical data is to be encoded to be readable by computer.
No need to encode output since it is not categorical

### ENCODING FEATURE MATRIX USING OneHotEncoder

In [9]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1,4,5])], remainder='passthrough')
X=np.array(ct.fit_transform(X))

In [10]:
X

array([[1.0, 0.0, 0.0, ..., 19, 27.9, 0],
       [0.0, 1.0, 1.0, ..., 18, 33.77, 1],
       [0.0, 1.0, 1.0, ..., 28, 33.0, 3],
       ...,
       [1.0, 0.0, 1.0, ..., 18, 36.85, 0],
       [1.0, 0.0, 1.0, ..., 21, 25.8, 0],
       [1.0, 0.0, 0.0, ..., 61, 29.07, 0]], dtype=object)

In [11]:
Y

array([16884.924 ,  1725.5523,  4449.462 , ...,  1629.8335,  2007.945 ,
       29141.3603])

## STEP 6: SPLITTING THE DATASET INTO TRAINING AND TESTING DATASET

In [12]:
from sklearn.model_selection import train_test_split
Xtrain,Xtest,Ytrain,Ytest=train_test_split(X,Y,test_size=0.2,random_state=1)

In [13]:
Xtrain

array([[1.0, 0.0, 1.0, ..., 53, 26.6, 0],
       [0.0, 1.0, 1.0, ..., 53, 21.4, 1],
       [0.0, 1.0, 1.0, ..., 18, 37.29, 0],
       ...,
       [1.0, 0.0, 0.0, ..., 51, 34.96, 2],
       [1.0, 0.0, 0.0, ..., 40, 22.22, 2],
       [0.0, 1.0, 1.0, ..., 57, 27.94, 1]], dtype=object)

In [14]:
Ytrain

array([10355.641 , 10065.413 ,  1141.4451, ..., 44641.1974, 19444.2658,
       11554.2236])

## STEP 7: FEATURE SCALING

In [15]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
Xtrain[:, :-1]=sc.fit_transform(Xtrain[:,:-1])
Xtest[:, :-1]=sc.fit_transform(Xtest[:,:-1])

# STEP B: BUILDING MULTIPLE LINEAR REGRESSION MODEL

## STEP 1: TRAINING THE MODEL

In [16]:
from sklearn.linear_model import LinearRegression
MultiLR=LinearRegression()
MultiLR.fit(Xtrain, Ytrain)

LinearRegression()

## STEP 2: TESTING THE MODEL

In [17]:
Y_estimate=MultiLR.predict(Xtest)

In [18]:
Xtest

array([[-1.0226443793995803, 1.0226443793995803, 0.502331014967354, ...,
        -1.4658537218273997, 0.8173366199501073, 0],
       [-1.0226443793995803, 1.0226443793995803, 0.502331014967354, ...,
        1.1847596132383391, 0.150902982034023, 0],
       [-1.0226443793995803, 1.0226443793995803, 0.502331014967354, ...,
        0.7662417182279593, 1.0628648023402434, 0],
       ...,
       [-1.0226443793995803, 1.0226443793995803, -1.9907192074632172,
        ..., 0.48722978822103946, 1.3868951927054567, 2],
       [0.977857034316387, -0.977857034316387, 0.502331014967354, ...,
        -1.4658537218273997, 0.9918787632138439, 0],
       [0.977857034316387, -0.977857034316387, 0.502331014967354, ...,
        -0.3498060017997202, -0.7535426694235194, 0]], dtype=object)

In [19]:
Ytest

array([ 1646.4297 , 11353.2276 ,  8798.593  , 10381.4787 ,  2103.08   ,
       38746.3551 ,  9304.7019 , 11658.11505,  3070.8087 , 19539.243  ,
       12629.8967 , 11538.421  ,  6338.0756 ,  7050.642  ,  1137.4697 ,
        8968.33   , 21984.47061,  6414.178  , 28287.89766, 13462.52   ,
        9722.7695 , 40932.4295 ,  8026.6666 ,  8444.474  ,  2203.47185,
        6664.68595,  8606.2174 ,  8283.6807 ,  5375.038  ,  3645.0894 ,
       11674.13   , 11737.84884, 24873.3849 , 33750.2918 , 24180.9335 ,
        9863.4718 , 36837.467  , 17942.106  , 11856.4115 , 39725.51805,
        4349.462  , 11743.9341 , 19749.38338, 12347.172  ,  4931.647  ,
       30259.99556, 27724.28875, 34672.1472 ,  9644.2525 , 14394.39815,
       12557.6053 , 11881.358  ,  2352.96845,  9101.798  , 17178.6824 ,
        3994.1778 , 40941.2854 , 12644.589  , 22395.74424,  1149.3959 ,
        3366.6697 , 13143.33665, 18328.2381 ,  2690.1138 , 12741.16745,
        8765.249  , 10264.4421 , 22192.43711,  2709.24395, 14571

In [20]:
MultiLR.coef_

array([ 3.15539548e+16,  3.15539548e+16, -6.84463584e+15, -6.84463584e+15,
        1.20319794e+17,  1.19696857e+17,  1.23114753e+17,  1.20778958e+17,
        3.60247220e+03,  1.99074840e+03,  3.72633732e+02])

In [21]:
MultiLR.intercept_

12842.042412370789

In [22]:
#Evaluation: MSE(Mean Square Error)
from sklearn.metrics import mean_squared_error
J_mse = mean_squared_error(Y_estimate, Ytest)

# R_square (R squared)
R_square = MultiLR.score(Xtest,Ytest)
print('The Mean Square Error(MSE) or J(theta) is: ',J_mse)
print('R square obtained is :',R_square)

The Mean Square Error(MSE) or J(theta) is:  9.15419269994171e+31
R square obtained is : -6.132203562262155e+23


conclusion: the negative r-squared error means the model does not follow the trend of the data