# Data retrieval and Data cleaning

<b> About Dataset </b>
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking
- region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
- charges: Individual medical costs billed by health insurance

<b> To Predict : </b> Insurance Cost



<b> Data Retrieval </b>


In [33]:
# Importing libraries

import pandas as pd
import numpy as np


In [34]:
data = pd.read_csv('insurance.csv')

<b> Data Exploration and Cleaning </b>

In [35]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [36]:
data.shape

(1338, 7)

In [37]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [38]:
data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Checking for the null values in the dataset

In [39]:
data.isnull().sum()/len(data)

age         0.0
sex         0.0
bmi         0.0
children    0.0
smoker      0.0
region      0.0
charges     0.0
dtype: float64

<b> Inference 1 :</b> There is no null data remaining in the datasset

In [40]:
# Creating a column list 

column_list = data.columns.tolist()

In [41]:
column_list

['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']

Handling the data for the outliers

In [42]:
# creating a function for checking wether the data lies in the iqr range or not

def detect_outlier_percentage(field):
    q1 = np.quantile(data[field],0.25)
    q3 = np.quantile(data[field],0.75)
    iqr = q3-q1
    threshold = 1.5
    ub = q3+iqr*threshold
    lb = q1-iqr*threshold
    outliers = []
    for i in data[field]:
        if i<lb or i>ub:
            outliers.append(i)
    return len(outliers)/len(data)*100


In [43]:
numerical_features = [feature for feature in data.columns if data[feature].dtype!='O']

In [44]:
numerical_features

['age', 'bmi', 'children', 'charges']

In [47]:
outlier_dict = {}
for i in numerical_features:
    outlier_dict[i] = detect_outlier_percentage(i)


In [48]:
outlier_dict

{'age': 0.0,
 'bmi': 0.672645739910314,
 'children': 0.0,
 'charges': 10.388639760837071}

<b> Inference 2:</b> The data in age and children are having no outliers in them, the data in the bmi is having very low percentage of outliers. Charges is also having less than 30% of the outlier fields.

<b> Objective : <b> We can ignore this above percentage of outliers because the effect they will have will be minimal