# Medical Insurance Cost Prediction

We have a dataset that contains several informations for an individual and based on them we are going to predict the cost of the medical insurance cost. 

The dataset has informations for:

-Age(age) of the individual

-Gender(sex) of the individual (female/male)

-Body Mass Index(bmi) of the individual (ideally 18.5 to 24.9)

-Smoking habit(smoker) of the individual (yes/no)

-Region(region) from where the individual is from (southwest, northwest, southeast, northeast) in the US

-Individual medical costs(charges) billed by health insurance


In [15]:
# Importing all the modules I will need for this project 
import pandas as pd
import numpy as np


# Data Analysis

In [5]:
# Looking at the data so we can see with what we are working with
df = pd.read_csv('insurance.csv')
print(df.head(5))

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520


In [6]:
# The number of rows and columns in the dataframe
df.shape

(1338, 7)

In [7]:
# Transforming the region into a numerical value so later we can train the model
df['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [8]:
# Creating a new dataframe with the region as a numerical value
region_numbers = pd.get_dummies(df['region'], drop_first=True)
region_numbers.head(5)

Unnamed: 0,northwest,southeast,southwest
0,0,0,1
1,0,1,0
2,0,1,0
3,1,0,0
4,1,0,0


In [9]:
# Concatenating the region_numbers dataframe with the original dataframe and plus removing the region column
df_region = pd.concat([df, region_numbers], axis=1)
df_region.drop(['region'], axis=1, inplace=True)
df_region.head(5)

Unnamed: 0,age,sex,bmi,children,smoker,charges,northwest,southeast,southwest
0,19,female,27.9,0,yes,16884.924,0,0,1
1,18,male,33.77,1,no,1725.5523,0,1,0
2,28,male,33.0,3,no,4449.462,0,1,0
3,33,male,22.705,0,no,21984.47061,1,0,0
4,32,male,28.88,0,no,3866.8552,1,0,0


In [11]:
# Looking for null values in the dataframe and if we have we need to drop them
df_region.isnull().sum()

age          0
sex          0
bmi          0
children     0
smoker       0
charges      0
northwest    0
southeast    0
southwest    0
dtype: int64

In [12]:
# Grouping the dataframe by the age column with mean
df_age = df_region.groupby('age').mean()
print(df_age)

           bmi  children       charges  northwest  southeast  southwest
age                                                                    
18   31.326159  0.449275   7086.217556   0.000000   0.536232   0.000000
19   28.596912  0.426471   9747.909335   0.500000   0.044118   0.455882
20   30.632759  0.862069  10159.697736   0.241379   0.275862   0.275862
21   28.185714  0.785714   4730.464330   0.250000   0.250000   0.250000
22   31.087679  0.714286  10012.932802   0.250000   0.285714   0.214286
23   31.454464  1.000000  12419.820040   0.250000   0.250000   0.250000
24   29.142679  0.464286  10648.015962   0.250000   0.250000   0.250000
25   29.693929  1.285714   9838.365311   0.250000   0.250000   0.250000
26   29.428929  1.071429   6133.825309   0.250000   0.250000   0.250000
27   29.333571  0.964286  12184.701721   0.214286   0.321429   0.214286
28   29.482143  1.285714   9069.187564   0.214286   0.285714   0.250000
29   29.383148  1.259259  10430.158727   0.259259   0.259259   0

In [13]:
# Describing the dataframe to see the mean, std, max, etc.
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801
