# Predicting Insurance Premiums using Machine Learning

---

The problem covered in this notebook is quite a common one in Data Science. Suppose we work at an insurance company as analysts; our task is to allocate appropriate premium charges for our clients; given the indicators (and response variable) in our historical dataset -  `insurance.csv`. 

---

### Essential Libraries

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [2]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
sb.set()


## Dataset Cleaning and Preprocessing
---


Before we begin, it is important to remember that out goal is to gain insights which could lead to a reliable prediction of 'charges', given the other indicators. With that, we can proceed to plan our intitial steps by manipulating the dataset into giving us favourable data types which are not misleading.

In [3]:
df = pd.read_csv('insurance.csv.xls')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


2 non-null values must be filled. As it is a discrete numeric data type, we made the assumption for the number children to be 0, which should not jeopardise our dataset integrity.

In [5]:
# NULL data points filled with standard value of zero(0)
df = df.fillna(0)

# Changing the dtypes appropriately for data exploration

df['children'] = df['children'].astype(int)

unique_values = df['smoker'].unique()
value_map = {}
for i, value in enumerate(unique_values):
    value_map[value] = i + 1
df['smoker_bool'] = df['smoker'].replace(value_map)

unique_values = df['sex'].unique()
value_map = {}
for i, value in enumerate(unique_values):
    value_map[value] = i + 1
df['sex_bool'] = df['sex'].replace(value_map)

unique_values = df['region'].unique()
value_map = {}
for i, value in enumerate(unique_values):
    value_map[value] = i + 1
df['region_cat'] = df['region'].replace(value_map)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          1338 non-null   int64  
 1   sex          1338 non-null   object 
 2   bmi          1338 non-null   float64
 3   children     1338 non-null   int64  
 4   smoker       1338 non-null   object 
 5   region       1338 non-null   object 
 6   charges      1338 non-null   float64
 7   smoker_bool  1338 non-null   int64  
 8   sex_bool     1338 non-null   int64  
 9   region_cat   1338 non-null   int64  
dtypes: float64(2), int64(5), object(3)
memory usage: 104.7+ KB


Unnamed: 0,age,sex,bmi,children,smoker,region,charges,smoker_bool,sex_bool,region_cat
0,19,female,27.9,0,yes,southwest,16884.924,1,1,1
1,18,male,33.77,1,no,southeast,1725.5523,2,2,2
2,28,male,33.0,3,no,southeast,4449.462,2,2,2
3,33,male,22.705,0,no,northwest,21984.47061,2,2,3
4,32,male,28.88,0,no,northwest,3866.8552,2,2,3



After cleaning the dataset and replacing '/Users/kauthar/Desktop/insurance.csv', it is now prepared for further exploration. 

The EXTENDED dataset now consists of 9 columns. The 'smoker_bool', 'sex_bool' and 'region_cat' columns were appended to allow for further analysis below (EDA, correlation, etc.)

The new data types could also be used to predict our response variable (charges, dtype->float). 

