### Predicting Health Insurance Premiums (based on customer charges). 


We are using a dataset that contains information about potential health insurance customers such as age, smoking history and bmi. We will use the 'cost' column to predict how much a potential customer may spend on health care needs. This spending trend could be used by health insurance companies to determine what an appropriate health insurance premium should be. 

In [13]:
#Import dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import plotly.express as px


In [6]:
#Load csv file into Pandas DataFrame
h_data = pd.read_csv("Resource/Health_insurance.csv")

#View DataFrame
h_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [7]:
#Check if DataFrame contains any null values
h_data.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [11]:
#Check DataFrame info
h_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [10]:
h_data["sex"].value_counts()

male      676
female    662
Name: sex, dtype: int64

In [12]:
h_data["smoker"].value_counts()

no     1064
yes     274
Name: smoker, dtype: int64

In [26]:
#Create some basic visuals using plotly.express to view some of the data
smoker_by_sex_chart = px.histogram(h_data, x = "smoker", color = "sex", title = "Number of Smokers")
smoker_by_sex_chart.show()

In [15]:
#Finding values of age column
h_data["age"].value_counts()

18    69
19    68
50    29
51    29
47    29
46    29
45    29
20    29
48    29
52    29
22    28
49    28
54    28
53    28
21    28
26    28
24    28
25    28
28    28
27    28
23    28
43    27
29    27
30    27
41    27
42    27
44    27
31    27
40    27
32    26
33    26
56    26
34    26
55    26
57    26
37    25
59    25
58    25
36    25
38    25
35    25
39    25
61    23
60    23
63    23
62    23
64    22
Name: age, dtype: int64

In [19]:
#Create a visual that shows the range of ages in the data set (youngest age is 18, oldest age is 64 --> create 47 bins for the histogram)
age_distribution = px.histogram(h_data, x = 'age', nbins = 47, title = 'Distribution of Age')
age_distribution.update_layout(bargap=0.1)
age_distribution.show()

In [23]:
#Check use .describe() to see the range of values of the bmi column
h_data["bmi"].describe()
#From national institutes of health https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmi_tbl.pdf
#"Normal" is 19-24, "Overweight" is 25-29, "Obese" is 30-39, "Extreme Obesity" is 40+


count    1338.000000
mean       30.663397
std         6.098187
min        15.960000
25%        26.296250
50%        30.400000
75%        34.693750
max        53.130000
Name: bmi, dtype: float64

In [25]:
#Visual for the BMI distribution
bmi_chart = px.histogram(h_data, x = 'bmi', title = 'Distribution of BMI (Body Mass Index)', color_discrete_sequence= ["purple"])
bmi_chart.update_layout(bargap=0.1)
bmi_chart.show()