**Problem Name:**
*An insurance cost prediction problem using different types of regression algorithms.*

**Objectives:**


1.   To learn about different types of data pre-processing such as encoding of categorical data and handling missing data.
2.   To learn about feature scaling and different types of library in Python.
3. To acquire knowledge about different types of regression algorithms and how to use them on a dataset to predict the target column.

In [None]:
# Downloading the data
!wget -O insurance.csv https://www.dropbox.com/s/mwgqgjbmfw0xa5p/insurance.csv?dl=0

--2021-12-10 14:00:44--  https://www.dropbox.com/s/mwgqgjbmfw0xa5p/insurance.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/mwgqgjbmfw0xa5p/insurance.csv [following]
--2021-12-10 14:00:44--  https://www.dropbox.com/s/raw/mwgqgjbmfw0xa5p/insurance.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc93bc1ee1d087e2cc409ec63073.dl.dropboxusercontent.com/cd/0/inline/Bbm8AVSDkThFAqLzXdhwoy_k4PLuX71b0o-aBv0C4qijDRLBvA5jqUYl6D2P5S881cNChWcpTE8aXyDrqFt5kge1yIdZg2HmZS6f1w9YEb5099eEKUhjnYfuHZTG5ZJXRH1FIiUCx9zn-_xNkoF6_WQE/file# [following]
--2021-12-10 14:00:44--  https://uc93bc1ee1d087e2cc409ec63073.dl.dropboxusercontent.com/cd/0/inline/Bbm8AVSDkThFAqLzXdhwoy_k4PLuX71b0o-aBv0C4qijDRLBvA5jqUYl6D2P5S881cNChWcpTE8aXyDrqF

**Data Pre-processing:** After downloading the regression task dataset, the following pre-processing steps are performed which are given as a summary:
1. First of all, all the entries of the dataset is viewed to check all the columns.
2. Then, it is found that if there are any NaN values in the categorical column using isnull and notnull.
3.The target and features column are separated from the dataset.
4. Using describe function,the summary of a particular feature column is found to create a new features column which is named as 'bmi_update'.
5.One column (region) is not taken as feature column to find the prediction model performance without that feature column.
6. Both label encoding and one-hot encoding is performed on the categorical feature columns.
7. For training and testing the developed regression model, the data is splitted into train (80%) and test set(20%).
8. Feature scaling is performed on the train (fit_transform) and test (transform) set data.

**Regression Algorithms:** For this task, the following algorithms are utilized:
1. Linear regression algorithm
2. Support Vector regression (kernel=linear)
3. Decision tree algorithm
4. Random forest algorithms (with different types of hyper-parameter)

**Importing the required library**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split



**Importing the dataset of the given task**

---



In [None]:
dataset = pd.read_csv('insurance.csv')
dataset   # To show the dataset

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500




**Seperating the features column and the target column of the dataset**
---



In [None]:
# Creating Feature Columns
features = dataset[['age', 'sex', 'bmi', 'children','smoker']]
# Creating Target Columns
target = dataset[['charges']]

**Showing the 1st 10 entries of the dataset**

---



In [None]:
dataset.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


**Showing the last 10 entries of the dataset**

---



In [None]:
dataset.tail(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1328,23,female,24.225,2,no,northeast,22395.74424
1329,52,male,38.6,2,no,southwest,10325.206
1330,57,female,25.74,2,no,southeast,12629.1656
1331,23,female,33.4,0,no,southwest,10795.93733
1332,52,female,44.7,3,no,southwest,11411.685
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


**Finding if there is any null value in the column of categorical data**
---



In [None]:
dataset[dataset.sex. isnull()]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges


**Showing the not null value in the gender column**

---



In [None]:
dataset[dataset.sex. notnull()]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


**Finding if there is any NaN value in the region column of categorical data**

---



In [None]:
dataset[dataset.region. isnull()]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges


**Summary of feature column bmi**

---



In [None]:
dataset.bmi.describe()

count    1338.000000
mean       30.663397
std         6.098187
min        15.960000
25%        26.296250
50%        30.400000
75%        34.693750
max        53.130000
Name: bmi, dtype: float64

**Summary of feature column age**

---



In [None]:
dataset.age.describe()

count    1338.000000
mean       39.207025
std        14.049960
min        18.000000
25%        27.000000
50%        39.000000
75%        51.000000
max        64.000000
Name: age, dtype: float64

**The value count in region column**

---



In [None]:
dataset.region.value_counts()

southeast    364
northwest    325
southwest    325
northeast    324
Name: region, dtype: int64

Finding the mean of bmi and assigning these values to a new variable

---



In [None]:
dataset1 = pd.read_csv('insurance.csv')
bmi_mean = dataset1.bmi.mean()

**Creating a function named remean_bmi to map the bmi value to a new mean named bmi_update**

---



In [None]:
def remean_bmi(row):
    row = row - bmi_mean
    return row

dataset1['bmi_update']= dataset1.bmi.apply(remean_bmi)

**Showing the dataset1 after the bmi_update column mapping**

---



In [None]:
dataset1 # the new dataset after mapping

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,bmi_update
0,19,female,27.900,0,yes,southwest,16884.92400,-2.763397
1,18,male,33.770,1,no,southeast,1725.55230,3.106603
2,28,male,33.000,3,no,southeast,4449.46200,2.336603
3,33,male,22.705,0,no,northwest,21984.47061,-7.958397
4,32,male,28.880,0,no,northwest,3866.85520,-1.783397
...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,0.306603
1334,18,female,31.920,0,no,northeast,2205.98080,1.256603
1335,18,female,36.850,0,no,southeast,1629.83350,6.186603
1336,21,female,25.800,0,no,southwest,2007.94500,-4.863397


In [None]:
dataset # The original dataset

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


**Label Encoding**


In [None]:
labelencoder_f = LabelEncoder()
#the gender and smoker column is represented by numeric value
features['sex'] = labelencoder_f.fit_transform(features['sex'])
features['smoker'] = labelencoder_f.fit_transform(features['smoker'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


**Showing the features after label encoding (categorical column is turned into numerical column)**

---



In [None]:
features

Unnamed: 0,age,sex,bmi,children,smoker
0,19,female,27.900,0,yes
1,18,male,33.770,1,no
2,28,male,33.000,3,no
3,33,male,22.705,0,no
4,32,male,28.880,0,no
...,...,...,...,...,...
1333,50,male,30.970,3,no
1334,18,female,31.920,0,no
1335,18,female,36.850,0,no
1336,21,female,25.800,0,no


**Using the bmi_update column instead of bmi, the linear regressor model is trained and r2 squared error is shown**

---



In [None]:
# Creating Feature Columns
features = dataset1[['age', 'sex', 'bmi_update', 'children','smoker']]
# Creating Target Columns
target = dataset1[['charges']]

labelencoder_f = LabelEncoder()
#the gender and smoker column is represented by numeric value
features['sex'] = labelencoder_f.fit_transform(features['sex'])
features['smoker'] = labelencoder_f.fit_transform(features['smoker'])
print(features)
X_train,X_test,y_train,y_test=train_test_split(features,target,test_size = 0.2,random_state = 0)
# random_state = 0 is select to get the same result
print(X_train.shape)
print(X_test.shape)
X_test
target
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
# Predicting the Test Set Results using the trained regressor model
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

      age  sex  bmi_update  children  smoker
0      19    0   -2.763397         0       1
1      18    1    3.106603         1       0
2      28    1    2.336603         3       0
3      33    1   -7.958397         0       0
4      32    1   -1.783397         0       0
...   ...  ...         ...       ...     ...
1333   50    1    0.306603         3       0
1334   18    0    1.256603         0       0
1335   18    0    6.186603         0       0
1336   21    0   -4.863397         0       0
1337   61    0   -1.593397         0       1

[1338 rows x 5 columns]
(1070, 5)
(268, 5)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


0.7978644236809904

**One-hot encoding**

---



In [None]:
encoder=OneHotEncoder(sparse=False)
encoded_labels = pd.DataFrame (encoder.fit_transform(features[['smoker','sex']]))
encoded_labels.columns = encoder.get_feature_names(['smoker','sex'])
dataset= pd.concat([features, encoded_labels ], axis=1)



**Showing the dataset after one-hot encoding**

---



In [None]:
dataset

Unnamed: 0,age,sex,bmi,children,smoker,smoker_no,smoker_yes,sex_female,sex_male
0,19,female,27.900,0,yes,0.0,1.0,1.0,0.0
1,18,male,33.770,1,no,1.0,0.0,0.0,1.0
2,28,male,33.000,3,no,1.0,0.0,0.0,1.0
3,33,male,22.705,0,no,1.0,0.0,0.0,1.0
4,32,male,28.880,0,no,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,1.0,0.0,0.0,1.0
1334,18,female,31.920,0,no,1.0,0.0,1.0,0.0
1335,18,female,36.850,0,no,1.0,0.0,1.0,0.0
1336,21,female,25.800,0,no,1.0,0.0,1.0,0.0


**Selecting the new features for the regression model (without region colum)**
---



In [None]:
new_features = dataset[['age', 'bmi','children', 'smoker_no','smoker_yes','sex_female','sex_male']]
new_features

Unnamed: 0,age,bmi,children,smoker_no,smoker_yes,sex_female,sex_male
0,19,27.900,0,0.0,1.0,1.0,0.0
1,18,33.770,1,1.0,0.0,0.0,1.0
2,28,33.000,3,1.0,0.0,0.0,1.0
3,33,22.705,0,1.0,0.0,0.0,1.0
4,32,28.880,0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...
1333,50,30.970,3,1.0,0.0,0.0,1.0
1334,18,31.920,0,1.0,0.0,1.0,0.0
1335,18,36.850,0,1.0,0.0,1.0,0.0
1336,21,25.800,0,1.0,0.0,1.0,0.0


**Data splitting for taining and testing.**

---



In [None]:
X_train,X_test,y_train,y_test=train_test_split(new_features,target,test_size = 0.2,random_state = 0)
print(X_train.shape)
print(X_test.shape)
X_test
target

(1070, 7)
(268, 7)


Unnamed: 0,charges
0,16884.92400
1,1725.55230
2,4449.46200
3,21984.47061
4,3866.85520
...,...
1333,10600.54830
1334,2205.98080
1335,1629.83350
1336,2007.94500


**Applying features scaling on the train and test set to get better prediction model if possible**

---

In [None]:
from sklearn.preprocessing import StandardScaler
X_sc = StandardScaler()
y_sc = StandardScaler()
X_train_new = X_sc.fit_transform(X_train)
X_test_new = X_sc.transform(X_test)
y_train_new = y_sc.fit_transform(y_train)
y_test_new = y_sc.transform(y_test)

1. Fitting the **linear regression model** for training

---



In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
# Predicting the Test Set Results using the trained regressor model
y_pred = regressor.predict(X_test)

In [None]:
y_test

Unnamed: 0,charges
578,9724.53000
610,8547.69130
569,45702.02235
1034,12950.07120
198,9644.25250
...,...
574,13224.05705
1174,4433.91590
1327,9377.90470
817,3597.59600


Finding the mean absolute error of the prediction

---



In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

3939.780806966829

Finding the R squared error of the trained model on the test set

---



In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.7978644236809904

Applying features scaling on the train and test set to get better prediction model if possible

---



In [None]:
from sklearn.preprocessing import StandardScaler
X_sc = StandardScaler()
y_sc = StandardScaler()
X_train_new = X_sc.fit_transform(X_train)
X_test_new = X_sc.transform(X_test)
y_train_new = y_sc.fit_transform(y_train)
y_test_new = y_sc.transform(y_test)

Now, using **support vector regression** on the new train set

---



In [None]:
# Fitting SVR to the task dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'linear')
regressor.fit(X_train_new,y_train_new)

  y = column_or_1d(y, warn=True)


SVR(kernel='linear')

Predicting on new test set of the trained SVR model

---



In [None]:
y_pred = regressor.predict(X_test_new)
r2_score(y_test_new, y_pred)

0.7652932982951577

3. Now, using **decision tree regression** on the train set

---



In [None]:
# Fitting the decision tree regression with the task dataset
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train_new,y_train_new)

DecisionTreeRegressor(random_state=0)

Predicting on the test set using the trained decision tree model

---



In [None]:
y_pred = regressor.predict(X_test_new)
r2_score(y_test_new, y_pred)

0.7428540217721777

4.Now, using **random forest regression** on the train set

In [None]:
# Fitting the Random Forest Regression with the task dataset
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10,random_state = 0) # n estiamator is the number of decision trees
regressor.fit(X_train_new,y_train_new)

  """


RandomForestRegressor(n_estimators=10, random_state=0)

In [None]:
# Fitting the Random Forest Regression with the task dataset
from sklearn.ensemble import RandomForestRegressor
regressor1 = RandomForestRegressor(n_estimators = 8,random_state = 2) # n estiamator is the number of decision trees
regressor1.fit(X_train_new,y_train_new)

  after removing the cwd from sys.path.


RandomForestRegressor(n_estimators=8, random_state=2)

Predicting on the test set using the regressor1(random forest) model with hyper-parameter change of RF

---



In [None]:
y_pred = regressor1.predict(X_test_new)
r2_score(y_test_new, y_pred)

0.8714002105702081

Predicting on the test set using the trained model

---



In [None]:
y_pred = regressor.predict(X_test_new)
r2_score(y_test_new, y_pred)

0.8642235685901574

Using all columns as features except the target column

---



In [None]:
labelencoder_f = LabelEncoder()
#the gender,smoker and region column is represented by numeric value
features['sex'] = labelencoder_f.fit_transform(features['sex'])
features['smoker'] = labelencoder_f.fit_transform(features['smoker'])
features['region'] = labelencoder_f.fit_transform(features['smoker'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [None]:
features

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,0,27.900,0,1,1
1,18,1,33.770,1,0,0
2,28,1,33.000,3,0,0
3,33,1,22.705,0,0,0
4,32,1,28.880,0,0,0
...,...,...,...,...,...,...
1333,50,1,30.970,3,0,0
1334,18,0,31.920,0,0,0
1335,18,0,36.850,0,0,0
1336,21,0,25.800,0,0,0


In [None]:
X_train,X_test,y_train,y_test=train_test_split(features,target,test_size = 0.2,random_state = 0)
# random_state = 0 is select to get the same result
print(X_train.shape)
print(X_test.shape)
X_test
target

(1070, 6)
(268, 6)


Unnamed: 0,charges
0,16884.92400
1,1725.55230
2,4449.46200
3,21984.47061
4,3866.85520
...,...
1333,10600.54830
1334,2205.98080
1335,1629.83350
1336,2007.94500


In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
# Predicting the Test Set Results using the trained regressor model
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.7978644236809904

In [None]:
# Fitting the Random Forest Regression with the task dataset
from sklearn.ensemble import RandomForestRegressor
regressor1 = RandomForestRegressor(n_estimators = 8,random_state = 2) # n estiamator is the number of decision trees
regressor1.fit(X_train,y_train)

  after removing the cwd from sys.path.


RandomForestRegressor(n_estimators=8, random_state=2)

In [None]:
y_pred = regressor1.predict(X_test)
r2_score(y_test, y_pred)

0.8703298737047637

**All r2 squared error altogether with the regression models**

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train_new,y_train_new)
# Predicting the Test Set Results using the trained regressor model
y_pred = regressor.predict(X_test_new)

from sklearn.metrics import mean_absolute_error
error=mean_absolute_error(y_test_new, y_pred)
print("For linear regression model(MSE):",error)
from sklearn.metrics import r2_score
score=r2_score(y_test_new, y_pred)
print("For linear regression model(r2 squared):",score)

# Fitting SVR to the task dataset
from sklearn.svm import SVR
regressor1 = SVR(kernel = 'linear')
regressor1.fit(X_train_new,y_train_new)
y_pred1 = regressor1.predict(X_test_new)
score1=r2_score(y_test_new, y_pred1)
print("For SVR regression model(r2 squared):",score1)

# Fitting the decision tree regression with the task dataset
from sklearn.tree import DecisionTreeRegressor
regressor2 = DecisionTreeRegressor(random_state = 0)
regressor2.fit(X_train_new,y_train_new)
y_pred2 = regressor2.predict(X_test_new)
score2=r2_score(y_test_new, y_pred2)
print("For decision tree regression model(r2 squared):",score2)

# Fitting the Random Forest Regression with the task dataset
from sklearn.ensemble import RandomForestRegressor
regressor3 = RandomForestRegressor(n_estimators = 10,random_state = 0) # n estiamator is the number of decision trees
regressor3.fit(X_train_new,y_train_new)
y_pred3 = regressor3.predict(X_test_new)
score3=r2_score(y_test_new, y_pred3)
print("For Random forest regression model(r2 squared):",score3)

# Fitting the Random Forest Regression with the task dataset
from sklearn.ensemble import RandomForestRegressor
regressor4 = RandomForestRegressor(n_estimators = 8,random_state = 2) # n estiamator is the number of decision trees
regressor4.fit(X_train_new,y_train_new)
y_pred4 = regressor4.predict(X_test_new)
score4=r2_score(y_test_new, y_pred4)
print("For random forest tuned regression model(r2 squared):",score4)

For linear regression model(MSE): 0.3290400391484149
For linear regression model(r2 squared): 0.7978644236809906
For SVR regression model(r2 squared): 0.7652932982951577
For decision tree regression model(r2 squared): 0.7428540217721777
For Random forest regression model(r2 squared): 0.8642235685901574
For random forest tuned regression model(r2 squared): 0.8714002105702081


  y = column_or_1d(y, warn=True)


**Results Analysis:**
1. From the given task data, using different regression algorithms, several prediction models are built.
2. From the above code, it is obserevd that the highest r2 squared error(indicating how well the regression model fits the data) is 87.14% which is found to be from the random forest (RF) regression model.The parameters of RF are set as no_of_trees=8 and random_state=2. The features (new_features) are selected after one-hot encoding and feature scaling is performed on the prediction model.
3. The prediction model is trained using different criteria of features. The features selection are given below:


*   First of all, without the region column the others 5 columns (age, sex, children, smoker,bmi) are selected as features. Then, both the encoding (label and one-hot) techniques are utilized.

*    Then, the features are selected using the map function to create the bmi_update column in which (bmi-bmi-mean) is applied to get the new features column.The r2 squared for the model (linear regression) found to be 79.87%.
*   Lastly, including all the columns (except target column) as features column, the prediction model is trained.The r2 squared error found to be 79.78% (linear regression model) and 87.03% (random forest regression model).

4. Thus, it can be concluded that among the 4 regression algorithms (linear, support vector(linear) regressor, decision tree and random forest), random forest regressor (no_trees=8, random_state=2) obtained the highest performance (87.14%). In which the features are selected after one-hot encoding without the region column(new_features) and feature standardization is performed.This is due to the randomness of feature selection during trianing process than the decision tree. Again, the random forest regressor averages all the prediction results from each tree which is known as an ensemble algorithm.




**Discussion:**
1.   In data pre-processing, first of all to check if any missing value is present in the categorical column, isnull and notnull is utilized.

2. To get the summary of a particular column, describe function is utilized.

3. To create the new feature column bmi_update, apply function is used.

4. After one-hot encoding, the new features columns are concatened to the existing columns of the dataset.

5. In train-test splitting of data, random state is selected as 0 to get the same splitting of data to ensure same result after each training.

6. In feature scaling (standardization), the training set is scaled using the fit_transform and the testing set is scaled using only transform.This is due to the distribution change by the different value of mean and standard deviation of the test set.

