#No:01 Regression Project on Medical Cost Personal Dataset


> Objective: 
 1. *Various regression algorithms on the real world dataset.*

##Dataset (Medical Cost Personal Datasets)
Columns

**age:** age of primary beneficiary

**sex:** insurance contractor gender, female, male

**bmi:** Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

**children:** Number of children covered by health insurance / Number of dependents

**smoker: **Smoking

**region:** the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

**charges:** Individual medical costs billed by health insurance

In [1]:
import warnings
warnings.filterwarnings("ignore")

### **Import the Libraries**

In [2]:
import numpy as np        
import pandas as pd     
import matplotlib.pyplot as plt 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR 

### **Dataset**

In [3]:
# Download the data
!wget -O insurance.csv https://www.dropbox.com/s/mwgqgjbmfw0xa5p/insurance.csv?dl=0

--2021-12-20 21:30:54--  https://www.dropbox.com/s/mwgqgjbmfw0xa5p/insurance.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/mwgqgjbmfw0xa5p/insurance.csv [following]
--2021-12-20 21:30:54--  https://www.dropbox.com/s/raw/mwgqgjbmfw0xa5p/insurance.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uceeaf3b65c8e92125f8833a527a.dl.dropboxusercontent.com/cd/0/inline/BcP2nYZX8mh7HLO-gWL_M1xv2cnF7xINM1CO_GORUIyr4Xk83SkqBatpyuLek2KXZlXE8su3bq4yWWIuh_-hdgjL3aPXw2ef3mTtmM4Gc5Cz1Pu6MyTGeozmIVmo8InzokBkFK6YWaAEC7YJhFqehcRz/file# [following]
--2021-12-20 21:30:54--  https://uceeaf3b65c8e92125f8833a527a.dl.dropboxusercontent.com/cd/0/inline/BcP2nYZX8mh7HLO-gWL_M1xv2cnF7xINM1CO_GORUIyr4Xk83SkqBatpyuLek2KXZlXE8su3bq4yWWIuh_

In [16]:
"""importing the dataset """

dataset = pd.read_csv('insurance.csv')
dataset

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [17]:
# Feature Columns
features = dataset[['age', 'sex', 'bmi', 'children','smoker','region']]
# Target Columns
target = dataset[['charges']]

In [None]:
features

### **Label Encoding**

In [19]:
from sklearn.preprocessing import LabelEncoder

In [20]:
labelencoder_f = LabelEncoder()
#the country column is represented by numeric value
features['sex'] = labelencoder_f.fit_transform(features['sex'])
features['smoker'] = labelencoder_f.fit_transform(features['smoker'])

In [None]:
features

### Taking care of missing values

In [26]:
from sklearn.impute import SimpleImputer

In [None]:
imputer = SimpleImputer(missing_values=np.nan,strategy = "mean") # imputer is an object of Imputer class 
imputer = imputer.fit(features[['age', 'sex', 'bmi', 'children','smoker','region']])

In [None]:
features[['age', 'sex', 'bmi', 'children','smoker','region']]= imputer.transform(features[['age', 'sex', 'bmi', 'children','smoker','region']])

In [None]:
features

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19.0,0.0,27.900,0.0,1.0,3.0
1,18.0,1.0,33.770,1.0,0.0,2.0
2,28.0,1.0,33.000,3.0,0.0,2.0
3,33.0,1.0,22.705,0.0,0.0,1.0
4,32.0,1.0,28.880,0.0,0.0,1.0
...,...,...,...,...,...,...
1333,50.0,1.0,30.970,3.0,0.0,1.0
1334,18.0,0.0,31.920,0.0,0.0,0.0
1335,18.0,0.0,36.850,0.0,0.0,2.0
1336,21.0,0.0,25.800,0.0,0.0,3.0


In [None]:
imputer = SimpleImputer(missing_values=np.nan,strategy = "mean") # imputer is an object of Imputer class 
imputer = imputer.fit(target[['charges']])

In [None]:
target[['charges']]= imputer.transform(target[['charges']])

In [None]:
target

Unnamed: 0,charges
0,16884.92400
1,1725.55230
2,4449.46200
3,21984.47061
4,3866.85520
...,...
1333,10600.54830
1334,2205.98080
1335,1629.83350
1336,2007.94500


### **Splitting Dataset**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
"""Spliting the Dataset into Training Set and Test Set """

X_train,X_test,y_train,y_test=train_test_split(features,target,test_size = 0.2,random_state = 0)
# random_state = 0 is select to get the same result

In [None]:
print(X_train.shape)
print(X_test.shape)

(1070, 6)
(268, 6)


### **Feature Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler
X_sc = StandardScaler()
y_sc = StandardScaler()
y_train = y_sc.fit_transform(y_train[['charges']])
y_test = y_sc.transform(y_test[['charges']])

### Different types of Regression Algorithm

1. Linear Regression (Univariate or Multivariate)
2. Support Vector Regression
3. Decision Tree Regression
4. Random Forest Regressrion

### **Simple Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression


regressor = LinearRegression()


regressor.fit(X_train,y_train)

LinearRegression()

In [None]:
# predicting the Test set Results
y_pred = regressor.predict(X_test)

In [None]:
regressor.score(X_train,y_train)

0.7368306228430945

In [None]:
y_test

In [None]:
X_test

Unnamed: 0,age,sex,bmi,children,smoker,region
578,52.0,1.0,30.200,1.0,0.0,3.0
610,47.0,0.0,29.370,1.0,0.0,2.0
569,48.0,1.0,40.565,2.0,1.0,1.0
1034,61.0,1.0,38.380,0.0,0.0,1.0
198,51.0,0.0,18.050,0.0,0.0,1.0
...,...,...,...,...,...,...
1084,62.0,0.0,30.495,2.0,0.0,1.0
726,41.0,1.0,28.405,1.0,0.0,1.0
1132,57.0,1.0,40.280,0.0,0.0,0.0
725,30.0,0.0,39.050,3.0,1.0,2.0


In [None]:
y_pred

array([[-1.82397841e-01],
       [-2.85099148e-01],
       [ 2.07069569e+00],
       [ 2.44003975e-01],
       [-5.23689076e-01],
       [-7.71045640e-01],
       [-9.72926790e-01],
       [ 9.03680276e-02],
       [-3.54761076e-01],
       [-4.81831323e-01],
       [-7.22976711e-01],
       [-2.46702450e-01],
       [-3.79509044e-01],
       [-7.57918614e-01],
       [ 1.22019971e+00],
       [-1.80529836e-01],
       [-1.62836802e-01],
       [-5.98833743e-01],
       [-4.19907641e-01],
       [ 1.15573478e+00],
       [ 1.70180511e+00],
       [ 8.90214986e-02],
       [-1.26625563e-01],
       [ 1.60401397e+00],
       [-7.33827716e-01],
       [-3.38672047e-01],
       [-1.00833105e+00],
       [-2.59908886e-01],
       [-7.61123005e-01],
       [-2.37545828e-01],
       [-3.54025144e-01],
       [ 2.25865181e+00],
       [ 1.90146991e-01],
       [ 3.89151803e-02],
       [ 9.63818549e-01],
       [-6.74078969e-01],
       [-2.99239284e-02],
       [ 1.44695638e+00],
       [ 1.6

### **Evaluation Matrices**

1. Mean Absolute Error (MAE)
2. Mean Squared Error (MSE)
3. R-Squared Error

In [None]:
from sklearn.metrics import mean_absolute_error

# MAE

mean_absolute_error(y_test, y_pred)

0.328251006254937

In [None]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_pred)

0.22213004284614

In [None]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred)

0.7998747145449959

In [None]:
regressor.score(X_train,y_train)

0.7368306228430945

### **Support Vector Regression**

In [None]:
# Fitting SVR to the dataset
from sklearn.svm import SVR 

regressor = SVR(kernel = 'linear')
regressor.fit(X_train,y_train)

SVR(kernel='linear')

In [None]:
y_pred = regressor.predict(X_test)

In [None]:
regressor.score(X_train,y_train)

0.6795638290744734

In [None]:
r2_score(y_test, y_pred)

0.7689076024715965

In [None]:
mean_absolute_error(y_test, y_pred)

0.26206449818981736

In [None]:
mean_squared_error(y_test, y_pred)

0.25650214088485607

### **Decision Tree Regression**

In [None]:
from sklearn.tree import DecisionTreeRegressor


regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train,y_train) 


DecisionTreeRegressor(random_state=0)

In [None]:
y_pred = regressor.predict(X_test)
r2_score(y_test, y_pred)

0.6562575965127319

In [None]:
regressor.score(X_train,y_train)

0.9982963931606104

In [None]:
mean_squared_error(y_test, y_pred)

0.3815385679079004

In [None]:
mean_absolute_error(y_test, y_pred)

0.3008404697042962

### **Random Forest Regression**

In [None]:
# Fitting the Random Forest Regression with the dataset
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators = 10,random_state = 0) # n estiamator is the number of decision trees
regressor.fit(X_train,y_train) 

RandomForestRegressor(n_estimators=10, random_state=0)

In [None]:
y_pred = regressor.predict(X_test)
r2_score(y_test, y_pred)

0.8757214770662562

In [None]:
regressor.score(X_train,y_train)

0.9644959129365694

In [None]:
mean_absolute_error(y_test, y_pred)

0.2156241682333011

In [None]:
mean_squared_error(y_test, y_pred)

0.1379435565144235

## Discussion
In this experiment, four alternative Regressor algorithms are used to do regression on the "Medical Cost Personal Dataset," including Linear Regression, Support Vector Regression, Decision Tree Regression, and Rando Forest Regression. The libraries were first imported, after which the dataset was retrieved from storage and some preprocessing was performed, such as encoding, handling missing data, and feature scaling. The dataset's charges column was used as the target dataset, and the rest of the columns were used as features. The dataset is then divided into training and testing segments, with training data accounting for 80% of the total. Following the training of the data into regressors, evaluation matrices were utilized to assess the model's performance. The Random Forest Regressor has better evaluation matrices, according to the results analysis. 