<a href="https://colab.research.google.com/github/Sanket3002/ML_Projects/blob/main/Insurance_Premium_Prediction_using_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Insurance Premium Prediction using Linear Regression***

In [None]:
import pandas as pd
import numpy as np

In [None]:
ins = pd.read_csv(r"https://raw.githubusercontent.com/ybifoundation/Dataset/112b24fdc5b7cb055a1decf860c896797cfe2b86/Insurance%20Premium.csv")

## ***PreProcessing Data***

In [None]:
ins.head()

Unnamed: 0,ID,Age,Gender,BMI,Children,Smoker,Region,Premium
0,1,19,female,27.9,0,yes,south,16885
1,2,18,male,33.77,1,no,east,1726
2,3,28,male,33.0,3,no,east,4449
3,4,33,male,22.705,0,no,west,21984
4,5,32,male,28.88,0,no,west,3867


In [None]:
ins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   ID        1338 non-null   int64  
 1   Age       1338 non-null   int64  
 2   Gender    1338 non-null   object 
 3   BMI       1338 non-null   float64
 4   Children  1338 non-null   int64  
 5   Smoker    1338 non-null   object 
 6   Region    1338 non-null   object 
 7   Premium   1338 non-null   int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 83.8+ KB


For every Dataset we have to check the following scenarios: 

1.   Missing value
2.   Categorical Data
3.   Scale effect
4.   Outliers













In [None]:
ins.describe()

Unnamed: 0,ID,Age,BMI,Children,Premium
count,1338.0,1338.0,1338.0,1338.0,1338.0
mean,669.5,39.207025,30.663397,1.094918,13270.414798
std,386.391641,14.04996,6.098187,1.205493,12110.012882
min,1.0,18.0,15.96,0.0,1122.0
25%,335.25,27.0,26.29625,0.0,4740.0
50%,669.5,39.0,30.4,1.0,9382.0
75%,1003.75,51.0,34.69375,2.0,16640.0
max,1338.0,64.0,53.13,5.0,63770.0


In [None]:
ins.describe(include = 'all')
# By default it only shows numerical data
# 50% 2nd Qurtile is Median

Unnamed: 0,ID,Age,Gender,BMI,Children,Smoker,Region,Premium
count,1338.0,1338.0,1338,1338.0,1338.0,1338,1338,1338.0
unique,,,2,,,2,4,
top,,,male,,,no,east,
freq,,,676,,,1064,364,
mean,669.5,39.207025,,30.663397,1.094918,,,13270.414798
std,386.391641,14.04996,,6.098187,1.205493,,,12110.012882
min,1.0,18.0,,15.96,0.0,,,1122.0
25%,335.25,27.0,,26.29625,0.0,,,4740.0
50%,669.5,39.0,,30.4,1.0,,,9382.0
75%,1003.75,51.0,,34.69375,2.0,,,16640.0


In [None]:
ins.columns

Index(['ID', 'Age', 'Gender', 'BMI', 'Children', 'Smoker', 'Region',
       'Premium'],
      dtype='object')

In [None]:
ins[['Gender']].value_counts()

Gender
male      676
female    662
dtype: int64

# *Types of Data:*


*   Nominal e.g., Name,ID, etc 
      we don't do any operation on Nominal Data 


*   Ordinal: Data is in Order e.g., good<better<best   or hot<mild<cold 

*  Interval  e.g., celsius, farehnite this don't have absolute zero value we can't perform any operation on it 
*  Ratio e.g. Kelvin has absolute zero value and we also perform operation on it





# *Three types of Encoding:*

If Target varible is categorical the use **labelencoder**

if independent variable has ordinalencoder and dummy variable





In [None]:
# In Gender attribute always encode male as 0 and female as 1
ins.replace({'Gender' : {'male' : 0, 'female' : 1}},inplace = True)

In [None]:
ins[['Gender']].value_counts()

Gender
0         676
1         662
dtype: int64

In [None]:
ins[['Smoker']].value_counts()

Smoker
no        1064
yes        274
dtype: int64

In [None]:
ins.replace({'Smoker' : {'no' : 0,'yes' : 1}}, inplace = True)

In [None]:
ins[['Smoker']].value_counts()

Smoker
0         1064
1          274
dtype: int64

In [None]:
ins[['Region']].value_counts()

Region
east      364
south     325
west      325
north     324
dtype: int64

In [None]:
ins.replace({'Region':{'east' : 0,'south' : 1, 'west' : 2,'north' : 3}},inplace = True)

In [None]:
ins[['Region']].value_counts()

Region
0         364
1         325
2         325
3         324
dtype: int64

In [None]:
ins.head()

Unnamed: 0,ID,Age,Gender,BMI,Children,Smoker,Region,Premium
0,1,19,1,27.9,0,1,1,16885
1,2,18,0,33.77,1,0,0,1726
2,3,28,0,33.0,3,0,0,4449
3,4,33,0,22.705,0,0,2,21984
4,5,32,0,28.88,0,0,2,3867


In [None]:
ins.sample()
# Randomly gives any Row

Unnamed: 0,ID,Age,Gender,BMI,Children,Smoker,Region,Premium
228,229,41,1,31.635,1,0,3,7358


## ***Training Data***

In [25]:
y = ins['Premium']

In [26]:
X = ins.drop(['ID','Premium'],axis = 1)

In [27]:
X

Unnamed: 0,Age,Gender,BMI,Children,Smoker,Region
0,19,1,27.900,0,1,1
1,18,0,33.770,1,0,0
2,28,0,33.000,3,0,0
3,33,0,22.705,0,0,2
4,32,0,28.880,0,0,2
...,...,...,...,...,...,...
1333,50,0,30.970,3,0,2
1334,18,1,31.920,0,0,3
1335,18,1,36.850,0,0,0
1336,21,1,25.800,0,0,1


In [28]:
from sklearn.model_selection import train_test_split

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3 , random_state = 2529)

In [30]:
from sklearn.linear_model import LinearRegression

In [31]:
lr = LinearRegression()

In [32]:
lr.fit(X_train,y_train)

LinearRegression()

In [33]:
X.columns

Index(['Age', 'Gender', 'BMI', 'Children', 'Smoker', 'Region'], dtype='object')

In [34]:
lr.coef_

array([  261.3185468 ,  -135.47210091,   346.38152286,   422.00094529,
       24142.12837949,   471.32252643])

In [35]:
lr.intercept_

-13764.125066748444

In [37]:
y_pred = lr.predict(X_test)

In [38]:
from sklearn.metrics import mean_absolute_error