https://www.kaggle.com/code/gloriousc/insurance-forecast-by-using-linear-regression/data

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 29 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 50.5 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=e7f2c07504fba8fcc5d809f6d48a97b76ea070d67e9fbc90767d4859ebd0c1fd
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Introduction

In this article we will be working to predict the insurance charges that will be imposed on a customer who is willing to take the **health inusrance** and for predicting the same PySpark's MLIB library is the driver to execute the whole process of this Machine learning pipeline.

We are gonna work with the real world insurance dataset that I've downloaded from **[Kaggle](https://https://www.kaggle.com/code/gloriousc/insurance-forecast-by-using-linear-regression/data)**. I'll be providing the link for your reference, so without any further wait let's get started.

## Importing all the libraries from PySpark

Our very first step is to import all the libraries from PySpark which will be required in a complete machine learning process i.e. from the **data preprocessing** to **model building phase**.

In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.stat import Correlation
import pyspark.sql.functions as func

## Setting up the Spark Session for PySpark package

In this step the Spark object will be created through which we can access all the **deliverables, functions and libraries** that Spark has to offer and the new **virtual environment will be created** so that we can do all the steps that are involved in ML pipeline.

In [3]:
spark  = SparkSession.builder.appName("Insuarance cost prediction").getOrCreate()
spark

## Reading the insurance dataset using Pyspark

In this section we will be reading the dataset using **read.csv()** function of PySpark before moving forward with further process let's know more about the dataset.

So in this particular dataset we have all the essential high priority features that could help us to detect the life insurance charges which are possible to embrase on card holder.

**Features are as follows:**

1. **Age**: Age of the customer.
2. **BMI**: Body Mass Index of the individual.
3. **Childeren**: Number of childeren he/she have.
4. **Smoker**: Is that individual a frequent Smoker or not.
5. **Region**: Which region/part he/she lives in and,

**Charges** column is our **target/independent** feature.

In [4]:
data = spark.read.csv("insurance.csv", inferSchema = True, header = True)
data.show()

+---+------+------+--------+------+---------+-----------+
|age|   sex|   bmi|children|smoker|   region|    charges|
+---+------+------+--------+------+---------+-----------+
| 19|female|  27.9|       0|   yes|southwest|  16884.924|
| 18|  male| 33.77|       1|    no|southeast|  1725.5523|
| 28|  male|  33.0|       3|    no|southeast|   4449.462|
| 33|  male|22.705|       0|    no|northwest|21984.47061|
| 32|  male| 28.88|       0|    no|northwest|  3866.8552|
| 31|female| 25.74|       0|    no|southeast|  3756.6216|
| 46|female| 33.44|       1|    no|southeast|  8240.5896|
| 37|female| 27.74|       3|    no|northwest|  7281.5056|
| 37|  male| 29.83|       2|    no|northeast|  6406.4107|
| 60|female| 25.84|       0|    no|northwest|28923.13692|
| 25|  male| 26.22|       0|    no|northeast|  2721.3208|
| 62|female| 26.29|       0|   yes|southeast| 27808.7251|
| 23|  male|  34.4|       0|    no|southwest|   1826.843|
| 56|female| 39.82|       0|    no|southeast| 11090.7178|
| 27|  male| 4

**Code breakdown:**

As discussed we have read the insurance dataset which was in the **CSV** format using **read.csv** function keeping the **inferSchema** parameter as **True** which will return the real data-type of each column along with that we kept the **header** as **True** so that first row would be treated as header of the column.

At the last we have displayed the data using the PySpark's show method.

Now it's time to **statistically analyze our dataset** for that we will start by extracting the total number of records which are present in it.

In [5]:
data.count()

1338

**Inference:** So from the above output we can state that there are total of **1338** records in the dataset which we were able to get with the help of **count** function.

In [6]:
print("Total number of columns are: {}".format(len(data.columns)))
print("And those columns are: {}".format(data.columns))

Total number of columns are: 7
And those columns are: ['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']


**Inference:** In the above code we tried to get as much information about the columns present for the analysis and conclusively we got to know that it have total **7** columns and there **name** as well.

Now let's look at the **Schema** i.e. the structure of the dataset where we would be able to know about **type of each column**.

In [7]:
data.printSchema()

root
 |-- age: integer (nullable = true)
 |-- sex: string (nullable = true)
 |-- bmi: double (nullable = true)
 |-- children: integer (nullable = true)
 |-- smoker: string (nullable = true)
 |-- region: string (nullable = true)
 |-- charges: double (nullable = true)



**Inference:** So from the above output we can see that adjacent to each feature we can see it's **data type** and also a flag condition where it states whether the column can have **null values or not**.

**Note:** If you closely look at the result then one can see the **sex,childeren, smoker and region features are in the string format** but they are actually **categorical variables** hence in the coming discussion we will convert them.

Now it's time to see the statistical inferences of the dataset so that we can get some information related to the **count, mean, standard deviation, minimum and maximum values** of the corresponding features.

In [8]:
data.describe().show()

+-------+------------------+------+------------------+-----------------+------+---------+------------------+
|summary|               age|   sex|               bmi|         children|smoker|   region|           charges|
+-------+------------------+------+------------------+-----------------+------+---------+------------------+
|  count|              1338|  1338|              1338|             1338|  1338|     1338|              1338|
|   mean| 39.20702541106129|  null|30.663396860986538|  1.0949177877429|  null|     null|13270.422265141257|
| stddev|14.049960379216147|  null| 6.098186911679012|1.205492739781914|  null|     null|12110.011236693992|
|    min|                18|female|             15.96|                0|    no|northeast|         1121.8739|
|    max|                64|  male|             53.13|                5|   yes|southwest|       63770.42801|
+-------+------------------+------+------------------+-----------------+------+---------+------------------+



**Inference:** So from the above output we can draw multiple inferenes where we can see each feature **count is same i.e. 1338** that means there are **no Null values** and **maximum age** in the data is **64** while the **minimum** is **18** similarly **max** number of **childeren** are **5** and **minumum** is **0** i.e. no child. So we can put into the note that describe function can give ample of information.

There is one more way to see the records and this one is similar to the ones who have previously came across pandas data processsing i.e. **head** function.

In [9]:
data.head(5)

[Row(age=19, sex='female', bmi=27.9, children=0, smoker='yes', region='southwest', charges=16884.924),
 Row(age=18, sex='male', bmi=33.77, children=1, smoker='no', region='southeast', charges=1725.5523),
 Row(age=28, sex='male', bmi=33.0, children=3, smoker='no', region='southeast', charges=4449.462),
 Row(age=33, sex='male', bmi=22.705, children=0, smoker='no', region='northwest', charges=21984.47061),
 Row(age=32, sex='male', bmi=28.88, children=0, smoker='no', region='northwest', charges=3866.8552)]

**Inference:** As one can notice that it return the row object so from here we can assume that if we want to analyze the data per records i.e. of **each tuple** then grabbing them using **head** function could be a better approach.

## Correlation in variables

Correlation is one of a kind technique which helps us in getting the **more accurate predictions** as it helps us know the **relationship between two or more variables** and return how likely they are **positively** or **negatively** related to it.


In this particular problem statement we will be finding the correlation between all the **dependent variables** (continous/integer one only) and **independent variable**.

In [10]:
age = data.corr("age", "charges")
BMI = data.corr("bmi", "charges")

print("Correlation between Age of the person and charges is : {}".format(age))
print("Correlation between BMI of the person and charges is : {}".format(BMI))

Correlation between Age of the person and charges is : 0.299008193330648
Correlation between BMI of the person and charges is : 0.19834096883362903


**Inference:** Correlation between the Age and insurance charges is equivalent to **0.30** while BMI and insurance charges are related to each other with **0.20** value.

**Note:** We have only used two variables though logically we have multiple options but they are in the **string format so we can't put them in this analysis for now** but you guys can repeat the same process for them after the conversion step.

## String Indexer

In this section of the article we will be converting the **string type features to valid categorical variables** so that our machine learning model should understand those features as we know that ML model only works when we have all the dependent variables in the numerical format.

Hence, it is very important step to proceed for this **StringIndexer** function will come to rescue.

**Note:** First we will be converting all the required fields to categorical variables and then look at the line by line explanation.

1. Changing the string type "**sex**" column to categorical variable using **String Indexer**.

In [11]:
from pyspark.ml.feature import StringIndexer

category = StringIndexer(inputCol = "sex", outputCol = "gender_categorical")
categorised = category.fit(data).transform(data)

In [12]:
categorised.show()

+---+------+------+--------+------+---------+-----------+------------------+
|age|   sex|   bmi|children|smoker|   region|    charges|gender_categorical|
+---+------+------+--------+------+---------+-----------+------------------+
| 19|female|  27.9|       0|   yes|southwest|  16884.924|               1.0|
| 18|  male| 33.77|       1|    no|southeast|  1725.5523|               0.0|
| 28|  male|  33.0|       3|    no|southeast|   4449.462|               0.0|
| 33|  male|22.705|       0|    no|northwest|21984.47061|               0.0|
| 32|  male| 28.88|       0|    no|northwest|  3866.8552|               0.0|
| 31|female| 25.74|       0|    no|southeast|  3756.6216|               1.0|
| 46|female| 33.44|       1|    no|southeast|  8240.5896|               1.0|
| 37|female| 27.74|       3|    no|northwest|  7281.5056|               1.0|
| 37|  male| 29.83|       2|    no|northeast|  6406.4107|               0.0|
| 60|female| 25.84|       0|    no|northwest|28923.13692|               1.0|

2. Changing the string type "**smoker**" column to categorical variable using **String Indexer**

In [13]:
category = StringIndexer(inputCol = "smoker", outputCol = "smoker_categorical")
categorised = category.fit(categorised).transform(categorised) # Note that here I have used categorised in place of data as our updated data is in the new DataFrame i.e. "categorised"

In [14]:
categorised.show()

+---+------+------+--------+------+---------+-----------+------------------+------------------+
|age|   sex|   bmi|children|smoker|   region|    charges|gender_categorical|smoker_categorical|
+---+------+------+--------+------+---------+-----------+------------------+------------------+
| 19|female|  27.9|       0|   yes|southwest|  16884.924|               1.0|               1.0|
| 18|  male| 33.77|       1|    no|southeast|  1725.5523|               0.0|               0.0|
| 28|  male|  33.0|       3|    no|southeast|   4449.462|               0.0|               0.0|
| 33|  male|22.705|       0|    no|northwest|21984.47061|               0.0|               0.0|
| 32|  male| 28.88|       0|    no|northwest|  3866.8552|               0.0|               0.0|
| 31|female| 25.74|       0|    no|southeast|  3756.6216|               1.0|               0.0|
| 46|female| 33.44|       1|    no|southeast|  8240.5896|               1.0|               0.0|
| 37|female| 27.74|       3|    no|north

3. Changing the string type "**region**" column to categorical variable using **String Indexer**.


In [15]:
category = StringIndexer(inputCol = "region", outputCol = "region_categorical")
categorised = category.fit(categorised).transform(categorised)

In [16]:
categorised.show()

+---+------+------+--------+------+---------+-----------+------------------+------------------+------------------+
|age|   sex|   bmi|children|smoker|   region|    charges|gender_categorical|smoker_categorical|region_categorical|
+---+------+------+--------+------+---------+-----------+------------------+------------------+------------------+
| 19|female|  27.9|       0|   yes|southwest|  16884.924|               1.0|               1.0|               2.0|
| 18|  male| 33.77|       1|    no|southeast|  1725.5523|               0.0|               0.0|               0.0|
| 28|  male|  33.0|       3|    no|southeast|   4449.462|               0.0|               0.0|               0.0|
| 33|  male|22.705|       0|    no|northwest|21984.47061|               0.0|               0.0|               1.0|
| 32|  male| 28.88|       0|    no|northwest|  3866.8552|               0.0|               0.0|               1.0|
| 31|female| 25.74|       0|    no|southeast|  3756.6216|               1.0|    

**Code breakdown:**

1. Firstly we imported the **StringIndexer** function from the **ml.feature** package of Pyspark.
2. Then for converting the **sex column** to relevant categorical feature we took up the imported object and in the **inpuCol** parameter **original feature** was passed while in **outputCol** feature we passed the name of the **converted feature**.
3. Similarly we did the same for all the columns that were required to be converted to categorical variable.

## Vector Assembler 

As we already know that PySpark needs the combined level of features i.e. **all the features should be piled up in single column** and they all will be treated as one single entity in the form of **list**.

In [17]:
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler

**Inference:** So we have imported the **Vector** from the **linalg** package of ML and **VectorAssembler** from **feature** package of ML so that we can pile up all the dependent fields.

In [18]:
categorised.columns

['age',
 'sex',
 'bmi',
 'children',
 'smoker',
 'region',
 'charges',
 'gender_categorical',
 'smoker_categorical',
 'region_categorical']

**Inference:** The columns which one can see in the above output are the total columns but we don't need the string ones instead the one which were changed to integer type.

In [36]:
concatenating  = VectorAssembler(inputCols=["age","bmi", "children", "gender_categorical", "smoker_categorical", "region_categorical"],
                                 outputCol="features")
results = concatenating.transform(categorised)

**Inference:** While combining all the features we pass in all the relavant column names in the form of list as the **inputCol** paramater and then to see the changes as well we need to **transform** our original dataset as well.

In [37]:
for_model = results.select("features", "charges")
for_model.show(truncate=False)

+-----------------------------+-----------+
|features                     |charges    |
+-----------------------------+-----------+
|[19.0,27.9,0.0,1.0,1.0,2.0]  |16884.924  |
|[18.0,33.77,1.0,0.0,0.0,0.0] |1725.5523  |
|[28.0,33.0,3.0,0.0,0.0,0.0]  |4449.462   |
|[33.0,22.705,0.0,0.0,0.0,1.0]|21984.47061|
|[32.0,28.88,0.0,0.0,0.0,1.0] |3866.8552  |
|[31.0,25.74,0.0,1.0,0.0,0.0] |3756.6216  |
|[46.0,33.44,1.0,1.0,0.0,0.0] |8240.5896  |
|[37.0,27.74,3.0,1.0,0.0,1.0] |7281.5056  |
|[37.0,29.83,2.0,0.0,0.0,3.0] |6406.4107  |
|[60.0,25.84,0.0,1.0,0.0,1.0] |28923.13692|
|[25.0,26.22,0.0,0.0,0.0,3.0] |2721.3208  |
|[62.0,26.29,0.0,1.0,1.0,0.0] |27808.7251 |
|[23.0,34.4,0.0,0.0,0.0,2.0]  |1826.843   |
|[56.0,39.82,0.0,1.0,0.0,0.0] |11090.7178 |
|[27.0,42.13,0.0,0.0,1.0,0.0] |39611.7577 |
|[19.0,24.6,1.0,0.0,0.0,2.0]  |1837.237   |
|[52.0,30.78,1.0,1.0,0.0,3.0] |10797.3362 |
|[23.0,23.845,0.0,0.0,0.0,3.0]|2395.17155 |
|[56.0,40.3,0.0,0.0,0.0,2.0]  |10602.385  |
|[30.0,35.3,0.0,0.0,1.0,2.0]  |3

**Inference:** Now we are creating a new DataFrame which we will send it across to our model for model creation.

**Note:** In the show function we have used **truncate=False** that means now all the features in the list will show up.

## Train Test Split

So by far we have done each step which is reqired in the **model building** phase hence now it's time to split out the dataset into **training** and **testing** form so that one will be used for training the model and other one will be to test the same.

In [25]:
train_data, test_data = for_model.randomSplit([0.7,0.3])

**Inference:** In the above randomsplit() function we can see that training data is of **70% (0.7)** and the testing data is of 30% (0.3).

In [26]:
train_data.describe().show()

+-------+------------------+
|summary|           charges|
+-------+------------------+
|  count|               951|
|   mean|13568.140120188222|
| stddev|12425.885456579752|
|    min|         1121.8739|
|    max|       63770.42801|
+-------+------------------+



In [27]:
test_data.describe().show()

+-------+------------------+
|summary|           charges|
+-------+------------------+
|  count|               387|
|   mean|12538.821024444438|
| stddev|11278.423261784697|
|    min|         1149.3959|
|    max|       48885.13561|
+-------+------------------+



**Inference:** Describe function is used on top of both the splitted dataset and we can see multiple information about them like the total count of training set is **951** while the other one have **387**.

## Model Building

In this phase of the article we will be building our model using the **Linear Regression** algorithm as **we are dealing with continous group of features** so this is the best and go to choice for us in the current possible problem statement.

In [28]:
from pyspark.ml.regression import LinearRegression

In [29]:
lr_model = LinearRegression(featuresCol= "features",
                            labelCol="charges")
lr_model

LinearRegression_d7bb227324ac

Firstly we are embedding a **Linear Regression** object by passing in the **features** column that we have already seperated and the **label** column is our target feature i.e. "**charges**"

In [30]:
training_model = lr_model.fit(train_data)
training_model

LinearRegressionModel: uid=LinearRegression_d7bb227324ac, numFeatures=6

Then the LR object which we have created is being **fit with the training data** that we got from **randomSplit**() method, in the output one can see it returned the valid information about the model i.e. **number of features it holds - 6.**

## Model Evaluation

So by far we are now done with model development phase so it's time to evaluate the model and get the inference of the same about whether it is worthy model or not.

In [31]:
output = training_model.evaluate(train_data)

**Evaluate** function is to call all the metrics that are involved in the **model evaluation** phase so that we can make a decision regarding **accuracy of the model**. 

In [32]:
print(output.r2)
print(output.meanSquaredError)
print(output.meanAbsoluteError)

0.7477904623279588
38900867.48971935
4324.864277057295


Here are the results of all the valid evaluation metrics available:

1. **R-squarred:** This metric explains about how much variance of the data is explained by the model.
2. **Mean Squarred Error:** This basically returns the residual values of a regression fit line and moreover magnifies the large errors
3. **Mean Absolute Error:** Does the same thing but this one focussed on minimizing the small errors which MSE might ignore.

**It's time to make preductions!!**

In [33]:
features_unlabelled_data = test_data.select("features")

In [34]:
final_pred = training_model.transform(features_unlabelled_data)

In [35]:
final_pred.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|(6,[0,1],[18.0,43...| 6310.418188362031|
|(6,[0,1],[23.0,26...| 1989.316776915317|
|(6,[0,1],[23.0,41...|7156.7732642268475|
|(6,[0,1],[27.0,23...|1817.4615575488478|
|(6,[0,1],[28.0,38...| 7080.368253071751|
|(6,[0,1],[33.0,30...| 5675.188716607967|
|(6,[0,1],[36.0,29...| 6219.912165607651|
|(6,[0,1],[40.0,41...|11215.513433552713|
|(6,[0,1],[41.0,33...|  8727.23539810832|
|(6,[0,1],[41.0,40...|10978.770010436912|
|(6,[0,1],[49.0,35...|  11447.0884482815|
|(6,[0,1],[49.0,36...|11779.282079608669|
|(6,[0,1],[50.0,25...|8146.7815362595775|
|(6,[0,1],[52.0,34...|11585.797458992425|
|(6,[0,1],[53.0,31...|10906.129194107065|
|(6,[0,1],[56.0,34...|12668.895957973038|
|(6,[0,1],[62.0,39...|15972.967064820206|
|(6,[0,1],[63.0,41...|16732.804535685922|
|(6,[0,1],[64.0,40...|16643.702726493306|
|[18.0,20.79,0.0,1...|-689.4691767537097|
+--------------------+------------

**Code breakdown:** 
This is the final step where we will be doing the predictions based on the model we have built and compare the actual result with the predicted one.

1. Created a DataFrame for the unlabbelled data i.e. our features column
2. Then we have transform the unlabbelled data using the same function we used before.
3. At the last we show the prediction results which one can see in the above output.


## Conclusion

Finally we are able to predict the insuarance charges with the help of Pyspark's MLIB library we have performed every step from the ground level i.e. from reading the dataset to the making the predictions from the evaluated model. 

Let's discuss in a descriptive way whatever we have learnt so far!

1. First of all we read the insurance dataset which was real-world dataset from Kaggle
2. Then we performed the data preprocessing steps where we got to know about the dataset's columns, statistics and changing the string type columns to categorical variables.
3. Then comes the Model building phase where we build the Linear Regression model using Pyspark's MLIB library.
4. At the last we evaluated the model and made the predictions from the same.