## **Medical Insurance Cost Prediction + EDA ⚕️**

<img src="https://img.freepik.com/free-vector/killing-destroying-coronavirus-covid-19-concept-background_1017-24424.jpg?t=st=1650470888~exp=1650471488~hmac=7feba0843b06a780a2acdab1268847dde7629aecea74955fd608b43692c52cf7&w=996">

### **Import Packages & Data 🐶**

In [1]:
!pip install plotly --quiet

#### **Packages 🦊**

In [35]:
# Main Library
import pandas as  pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Machine Learning Libarary
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Special Library
import missingno as msno

# Seaborn Style
sns.set(color_codes = True)
sns.set_style("white")

#### **Import Dataset 🗺️**

In [3]:
df = pd.read_csv("../input/insurance/insurance.csv")
df.head(5)

### **Data Exploration & Cleaning 🦇**

#### **Data Shape & Structure 💎**

In [4]:
df.shape

#### **Observations** 🦕
* Row count is 1338 ⚔️.
* Column count is 7 🚬.

#### **Data Inspection** 🎭

In [5]:
# Let's inspect the missing values 🐢
data_info= pd.DataFrame()
data_info['Column Names']= df.columns
data_info['Datatype'] = df.dtypes.to_list()
data_info['num_NA']= data_info['Column Names'].apply(lambda x: df[x].isna().sum())
data_info['%_NA']= data_info['Column Names'].apply(lambda x: df[x].isna().mean())
data_info

#### **Observations** 🦕
* Fortunately there are no missing values ⚔️.
* Datatypes are also appropreiate 🚬.

In [6]:
# Let's inspect what is the ratio of male to female data points 💣

def ratioInspect(dl,df):
    
    fig = px.bar(df, x = ["Male","Female"], y = dl, 
                 text_auto = '0.2s',
                 template="plotly_white", 
                title = "Data Distribution on basis of Sex", labels={'x':'Gender', 'y':'Data Points'})
    
    fig.update_layout( width=500,
    height=500,)
    
    return fig.show()

if __name__ == "__main__":
    
    datalist = []
    datalist.append(df[df["sex"] == "male"].shape[0])
    datalist.append(df[df["sex"] == "female"].shape[0])
    
    ratioInspect(datalist,df)

    

#### **Observations** 🦕
* Gladly, we have a fair distribution of Male to Female in our dataset ⚔️.

In [7]:
# Let's inspect what is the ratio of region data points 💣

def ratioInspect(dl,df):
    
    countData = []
    for i in range(len(datalist)):
        countData.append(df[df["region"] == dl[i]].shape[0])
        
    
    fig = px.bar(df, x = dl, y = countData, 
                 text_auto = '0.2s',
                 template="plotly_white", 
                title = "Data Distribution on basis of Region", labels={'x':'Region', 'y':'Data Points'})
    
    fig.update_layout( width=500,height=500,)
    
    return fig.show()

if __name__ == "__main__":
    
    datalist = list(df["region"].unique())
    ratioInspect(datalist,df)

#### **Observations** 🦕
* Gladly, we have a fair distribution of regions in our dataset ⚔️.

### **EDA - Exploratory Data Analysis 🐙**

#### **Data Correlations 🔮**

In [8]:
# Correlation in Dataset 🐋
sns.heatmap(df.corr(), cmap='viridis',linewidths=2)
plt.savefig('Correlation_Heatmap.png')
plt.show()

#### **Observations** 🦕
* No. of children and charges has a direct relation ⚔️.

In [9]:
# Mathematical relationships b/w Numerical columns 🦏
df.describe()

In [10]:
# Distplot for data distribution 🐹

def dataDist(data):
    
    distList = []
    for i in range(len(data)):
        distList.append(sns.displot(x = df[data[i]], bins = 20, kde = True,color = "blue"))
        sns.despine(offset=5 , trim=True)
    
    return distList

if __name__ == "__main__":
    numData = ["age", "bmi", 'charges']
    r = dataDist(numData)
    
    for i in range(len(numData)):
        r[i]

#### **Observations** 🦕
* Average Age - 39 , Average Childrens - 1 ⚔️.
* Age data has a uniform distribution 🛡️.
* BMI has a normal distribution with few outliers ⛓️.
* Charges has an unequal distribution with outliers 🧪.

#### **Outlier Inspection ⚖️**

In [11]:
fig = make_subplots(rows=1, cols=3)

fig.add_trace(
    go.Box(x=df["age"]),
    row=1, col=1
)

fig.add_trace(
    go.Box(x=df["bmi"]),
    row=1, col=2
)

fig.add_trace(
    go.Box(x=df["charges"]),
    row=1, col=3
)

# Update xaxis properties
fig.update_xaxes(title_text="Age Data", row=1, col=1)
fig.update_xaxes(title_text="BMI Data", row=1, col=2)
fig.update_xaxes(title_text="Charges Data", row=1, col=3)

fig.update_layout(title = "Outlier Inspection", showlegend=False)
fig.show()


#### **Observations** 🦕
* Age Data has nop outliers ⚔️.
* BMI has few outliers 🧬.
* Charges have alot of outliers 🪀.

### **Data Preprocessing 🐿️**

#### **Data Nomenclature 🐘**
---------------------------------
##### **Sex Column** 🏈
* male = 0
* female = 1

##### **Smoker Column** 🎿
* yes = 0
* no = 1

##### **Region Column** 🕯️
* southeast = 0
* southwest = 1
* northeast = 2
* northwest = 3

In [14]:
# Encoding "sex" column 🏡
df.replace({'sex': {'male' : 0, 'female' : 1}}, inplace = True)
# Encoding "smoker" column ☘️
df.replace({'smoker': {'yes' : 0, 'no' : 1}}, inplace = True)
# Encoding "region" column ☀️
df.replace({'region': {'southeast' : 0, 'southwest' : 1, 'northeast' : 2, 'northwest' : 3}}, inplace = True)

### **Recognizing Features & Targets** 🦉

In [21]:
# Splitting features & targets 🌱
# Feature Data 
x = df.iloc[:,0:6]
# Target Data 
y = df.iloc[:,-1] 

### **Data Splitting** 🦅

In [22]:
# Splitting data into train and test datasets 🍂
X_train, X_test, Y_train, Y_test = train_test_split(x,y,test_size=0.2,random_state=2)

### **Model Training & Predition 🦚**

#### **Model Training** 🚀

In [30]:
# Training data on Linear Regression Algorithm 🛍️
lr = LinearRegression()
# Model Training 📞
lr.fit(X_train, Y_train)

#### **Model Evaluation** 🌄

In [40]:
# Model prediction 💎
pred = lr.predict(X_test)
# R Square value to check credibility of the model 🚧
r2Value = r2_score(Y_test,pred)
print("r2 score value : ",r2Value)

#### **Observations** 🦕
* Model is working well as r2_score is near to 1.

### **Building a Predictive System** 🐋

In [58]:
class Pred:
    
    def __init__ (self):
        self.age = None
        self.sex = None
        self.bmi = None
        self.children = None
        self.smoker = None
        self.region = None
        
    def model(self, age, sex, bmi, children, smoker, region):
        
        self.age = age
        self.sex = sex
        self.bmi = bmi
        self.children = children
        self.smoker = smoker
        self.region = region
        
        data = np.array([self.age, self.sex, self.bmi, self.children, self.smoker, self.region])
        data = data.reshape(1,-1)
        predCost = lr.predict(data)
        
        return predCost
    
    
if __name__ == '__main__':
        obj = Pred()
        r = obj.model(37,0,22,3,0,3)
        print(r)
        