# U.S. Medical Insurance Costs

Goal:
To analyze, explore and compare the relationship between different variables that contribute to formulating the insurance charges for an individual.

Data used:
Medical Insurance Costs dataset from Kaggle.com
The information points we are given: age, sex, BMI, number of children, smoker?, region within USA, insurance charges

Analysis:
1.  Determine the number of males vs females in the data. Is there an approximate 50-50 split?
2.  What is the average age of the insured indivuals?
3.1 Find out the average insurance charge for smokers vs non-smokers.
3.2 Find out the average insurance charge for females vs males.
4.1 Which region has the most insured individuals?
4.2 Which region has the lowest and highest average insurance charges?
5.1 What is the average bmi for smokers vs non-smokers?
5.2 What is the average bmi for females vs males?
6.1 Analyze insurance charges for an individual with no children compared to an individual with 3 or more children
6.2 Find out the average insurance charge for individuals in the healthy BMI range (18.5 - 24.9)



In [52]:
#Import csv library 
import csv

In [53]:
# Create empty lists for the different columns in insurance.csv
ages = []
sexes = []
bmis = []
num_children = []
smoker_status = []
regions = []
ins_charges = []

In [54]:
# Create helper function to load csv data into lists
def create_list_data(new_list, csv_file, col_name):
   
    with open(csv_file, newline='') as csv_file_info:
        csv_dict = csv.DictReader(csv_file_info)
        for row in csv_dict:
            new_list.append(row[col_name])
        return new_list 



In [55]:
create_list_data(ages, 'insurance.csv', 'age')
create_list_data(sexes, 'insurance.csv', 'sex')
create_list_data(bmis, 'insurance.csv', 'bmi')
create_list_data(num_children, 'insurance.csv', 'children')
create_list_data(smoker_status, 'insurance.csv', 'smoker')
create_list_data(regions, 'insurance.csv', 'region')
create_list_data(ins_charges, 'insurance.csv', 'charges')

#Check if a list is created
print("BMI list", bmis[:15])  
print("Sexes:", sexes[:15])  

BMI list ['27.9', '33.77', '33', '22.705', '28.88', '25.74', '33.44', '27.74', '29.83', '25.84', '26.22', '26.29', '34.4', '39.82', '42.13']
Sexes: ['female', 'male', 'male', 'male', 'male', 'female', 'female', 'female', 'male', 'female', 'male', 'female', 'male', 'female', 'male']


Now that the lists are created, I can analyze the data.

1. Determine the number of males vs females in the data. Is there an approximate 50-50 split?

In [63]:
# Total number of individuals in data set:
total_pop= len(sexes)

print(f"The total number of individuals in our data set is:",total_pop)

The total number of individuals in our data set is: 1338


In [64]:
# There are 1338 individuals in the data set
num_of_females = sexes.count('female')
num_of_males = sexes.count('male')

print(f"The number of females in the data set is: ", num_of_females)
print(f"Percentage female:", round(num_of_females/len(sexes)*100),1)
print(f"The number of males in the data set is: ", num_of_males)
print(f"Percentage male:", round(num_of_males/len(sexes)*100,1))
num_of_females + num_of_males == 1338

The number of females in the data set is:  662
Percentage female: 49 1
The number of males in the data set is:  676
Percentage male: 50.5


True

There seems to be a good distribution between males vs females in the data, with approx 51% males and 49% females.

Next, let's determine the average age of the insured indivuals.

In [65]:
#First we need to convert each age which is a string into a float
ages = [float(i) for i in ages]

ave_age = sum(ages)/len(ages)
print(f"The average age of the insured individuals is:", round(ave_age,1))

The average age of the insured individuals is: 39.2


This is a fairly young age. 
Let's see how many males & females aged 50 years and above, are insured.

In [66]:
male_50_over = 0
female_50_over = 0

for i in range(len(ages)):
    if ages[i] >= 50:
        if sexes[i] == 'female':
            female_50_over += 1
        elif sexes[i] == 'male':
            male_50_over +=1

print(f"The number of females aged 50 and over that are insured: ", female_50_over)    
print(f"The number of males aged 50 and over that are insured: ", male_50_over)           

The number of females aged 50 and over that are insured:  195
The number of males aged 50 and over that are insured:  190


This is quite low considering that there are over 1300 people in the data set.
Is insurance too high for older individuals to afford?
Or do they not feel it is necessary at their age?
Or is there some other reason?
These questions would need to be explored using additional data.

Next, let's explore how smoking affects insurance costs.

We'll find out the average insurance charge for smokers vs non-smokers.

In [76]:
# First, let's get the smoker numbers
male_smoker = 0
female_smoker = 0

for i in range(len(sexes)):
    if smoker_status[i] == 'yes':
        if sexes[i] == 'female':
            female_smoker += 1
        elif sexes[i] == 'male':
            male_smoker +=1

print(f"The number of female smokers that are insured: ", female_smoker)    
print(f"The number of male smokers that are insured: ", male_smoker)     
print(f"Total smokers:", smoker_status.count('yes'))
print(f"Percentage of smokers in data set:" , round(smoker_status.count('yes')/len(smoker_status)*100),2)

The number of female smokers that are insured:  115
The number of male smokers that are insured:  159
Total smokers: 274
Percentage of smokers in data set: 20 2


In [84]:
#We need to convert each insurance cost from a string into a float
ins_charges = [float(i) for i in ins_charges]

tot_ins_charge_smoker = 0
tot_ins_charge_non_smoker = 0

for i in range(len(ins_charges)):
    if smoker_status[i] == 'yes':
        tot_ins_charge_smoker += ins_charges[i]
    elif smoker_status[i] == 'no':
        tot_ins_charge_non_smoker += ins_charges[i]

ave_ins_charge_smoker = round(tot_ins_charge_smoker/smoker_status.count('yes'),2)
ave_ins_charge_non_smoker = round(tot_ins_charge_non_smoker/smoker_status.count('no'),2)   

print(f"The average insurance costs for a smoker:",ave_ins_charge_smoker)
print(f"The average insurance costs for a non-smoker",ave_ins_charge_non_smoker)
print(f"The average insurance costs for the total data set:", round(sum(ins_charges)/len(ins_charges)),2)

The average insurance costs for a smoker: 32050.23
The average insurance costs for a non-smoker 8434.27
The average insurance costs for the total data set: 13270 2


Smoking considerably increases an individual's insurance costs!
In order to pay less in insurance -> STOP smoking!

Next, we'll look at the average insurance costs of females vs males

In [105]:
tot_ins_charge_female = 0
tot_ins_charge_male = 0

for i in range(len(ins_charges)):
    if sexes[i] == 'female':
        tot_ins_charge_female += ins_charges[i]
    elif sexes[i] == 'male':
        tot_ins_charge_male += ins_charges[i]

ave_ins_charge_female = round(tot_ins_charge_female/num_of_females,2)
ave_ins_charge_male = round(tot_ins_charge_male/num_of_males,2)   

print(f"The average insurance costs for a female:",ave_ins_charge_female)
print(f"The average insurance costs for a male:",ave_ins_charge_male)

The average insurance costs for a female: 12569.58
The average insurance costs for a male: 13956.75


In our data set, the average insurance cost for a male is slightly higher than for a female.

Next, we will analyze the data by regions, and determine which region has the most insured individuals?

In [88]:
# First we need to determine how many regions there are in our data set

unique_regions = []
for region in regions:
    if region not in unique_regions:
        unique_regions.append(region)

print(f"Regions list:",unique_regions)        

Regions list: ['southwest', 'southeast', 'northwest', 'northeast']


In [91]:
total_ins_sw = 0
total_ins_se = 0
total_ins_nw = 0
total_ins_ne = 0

for i in range(len(regions)):
    if regions[i] == 'southwest':
        total_ins_sw += 1
    elif regions[i] == 'southeast':
        total_ins_se += 1
    elif regions[i] == 'northwest':
        total_ins_nw += 1  
    elif regions[i] == 'northeast':
        total_ins_ne += 1      

print(str(total_ins_sw) + " individuals in our insurance dataset live in the South West US.")
print(str(total_ins_se) + " individuals in our insurance dataset live in the South East US.")
print(str(total_ins_nw) + " individuals in our insurance dataset live in the North West US.")
print(str(total_ins_ne) + " individuals in our insurance dataset live in the North East US.")

325 individuals in our insurance dataset live in the South West US.
364 individuals in our insurance dataset live in the South East US.
325 individuals in our insurance dataset live in the North West US.
324 individuals in our insurance dataset live in the North East US.


The South East US has the most insured individuals with 364. This is not extraordinarily higher than the other regions, which have a very similar amount of individuals.

Next, we'll determine which region has the highest & lowest average insurance charges?

In [93]:
total_ins_charges_sw = 0
total_ins_charges_se = 0
total_ins_charges_nw = 0
total_ins_charges_ne = 0

for i in range(len(regions)):
    if regions[i] == 'southwest':
        total_ins_charges_sw += ins_charges[i]
        ave_ins_charge_sw = total_ins_charges_sw/total_ins_sw
    elif regions[i] == 'southeast':
        total_ins_charges_se += ins_charges[i]
        ave_ins_charge_se = total_ins_charges_se/total_ins_se
    elif regions[i] == 'northwest':
        total_ins_charges_nw += ins_charges[i]
        ave_ins_charge_nw = total_ins_charges_nw/total_ins_nw
    elif regions[i] == 'northeast':
        total_ins_charges_ne += ins_charges[i]
        ave_ins_charge_ne = total_ins_charges_ne/total_ins_ne       

print(f"The average insurance costs for the individuals living in the SW is $", round(ave_ins_charge_sw,2))
print(f"The average insurance costs for the individuals living in the SE is $", round(ave_ins_charge_se,2))
print(f"The average insurance costs for the individuals living in the NW is $", round(ave_ins_charge_nw,2))
print(f"The average insurance costs for the individuals living in the NE is $", round(ave_ins_charge_ne,2))


The average insurance costs for the individuals living in the SW is $ 12346.94
The average insurance costs for the individuals living in the SE is $ 14735.41
The average insurance costs for the individuals living in the NW is $ 12417.58
The average insurance costs for the individuals living in the NE is $ 13406.38


The SE again has the highest average insurance costs.
The SW has the lowest average insurance costs.

Next, we'll look at BMI' of smokers vs non-smokers:

In [94]:
#We need to convert each bmi from a string into a float
bmis = [float(i) for i in bmis]

tot_bmis_smoker = 0
tot_bmis_non_smoker = 0

for i in range(len(bmis)):
    if smoker_status[i] == 'yes':
        tot_bmis_smoker += bmis[i]
    elif smoker_status[i] == 'no':
        tot_bmis_non_smoker += bmis[i]

ave_bmis_smoker = round(tot_bmis_smoker/smoker_status.count('yes'),2)
ave_bmis_non_smoker = round(tot_bmis_non_smoker/smoker_status.count('no'),2)   

print(f"The average BMI for a smoker:",ave_bmis_smoker)
print(f"The average BMI for a non-smoker",ave_bmis_non_smoker)


The average BMI for a smoker: 30.71
The average BMI for a non-smoker 30.65


This is interesting, as the average BMI for smokers vs non-smokers is very similar.

According to the NHS in the UK:

If your BMI is:
below 18.5 – you're in the underweight range
between 18.5 and 24.9 – you're in the healthy weight range
between 25 and 29.9 – you're in the overweight range
between 30 and 39.9 – you're in the obese range

The average BMI of smokers AND non-smokers is in the obese range. This also increases their insurance costs.

Next, we'll explore females vs males' BMI's.

In [95]:
tot_bmis_female = 0
tot_bmis_male = 0

for i in range(len(bmis)):
    if sexes[i] == 'female':
        tot_bmis_female += bmis[i]
    elif sexes[i] == 'male':
        tot_bmis_male += bmis[i]

ave_bmis_female = round(tot_bmis_female/num_of_females,2)
ave_bmis_male = round(tot_bmis_male/num_of_males,2)   

print(f"The average BMI for a female is:",ave_bmis_female)
print(f"The average BMI for a male is:",ave_bmis_male)

The average BMI for a female is: 30.38
The average BMI for a male is: 30.94


The males in the data set have a slightly higher BMI than the females, but they are both above 30, and hence in the obese range.
Certainly their diets need to be changed in order to lose weight, reducing their BMI's, and therefore reducing their insurance costs.

Let's find out the average insurance charge for the healthy BMI range (18.5 - 24.9)

Now, let's analyze average insurance charges for an individual with no children compared to an individual with 3 or more children.

In [103]:
# First convert number of children to float from string
num_children = [float(i) for i in num_children]

# Number of individuals with NO children
num_no_children = num_children.count(0) 

# Number of individuals with 3 or more children
num_3_more_children = 0
for i in range(len(num_children)):
    if num_children[i] >= 3:
        num_3_more_children += 1


print(f"Number of individuals with no children:", num_no_children)   
print(f"Number of individuals with 3 or more children:", num_3_more_children)      


Number of individuals with no children: 574
Number of individuals with 3 or more children: 200


In [104]:
tot_ins_no_children = 0
tot_ins_3_more_children = 0

for i in range(len(ins_charges)):
    if num_children[i] == 0:
        tot_ins_no_children += ins_charges[i]
    elif num_children[i] >= 3:
        tot_ins_3_more_children += ins_charges[i]

ave_ins_no_children = round(tot_ins_no_children/num_no_children,2)
ave_ins_3_more_children = round(tot_ins_3_more_children/num_3_more_children,2)   

print(f"The average insurance costs for an individual with no children:",ave_ins_no_children)
print(f"The average insurance costs for an individual with 3 or more children:",ave_ins_3_more_children)

The average insurance costs for an individual with no children: 12365.98
The average insurance costs for an individual with 3 or more children: 14576.0


We can see that the average insurance cost for an individual with no children is about $2000 less than an individual with 3 or more children.

Lastly, let's find out the average insurance charge for an individual in the healthy BMI range (18.5 - 24.9)

In [107]:
num_healthy_bmis = 0
tot_ins_charges_healty_bmis = 0

for i in range(len(bmis)):
    if bmis[i] >= 18.5 and bmis[i] <= 24.9:
        num_healthy_bmis += 1
        tot_ins_charges_healty_bmis += ins_charges[i]


percent_healthy_bmis = round((num_healthy_bmis/len(bmis))*100,2)
ave_ins_healthy_bmis = tot_ins_charges_healty_bmis/num_healthy_bmis

print("The percentage of our data set who have a healthy BMI of between 18.5 and 24.9 is: " + str(percent_healthy_bmis) + "%" )
print(f"The average insurance cost for an individual with a healthy BMI is: $",round(ave_ins_healthy_bmis),2)

The percentage of our data set who have a healthy BMI of between 18.5 and 24.9 is: 16.59%
The average insurance cost for an individual with a healthy BMI is: $ 10379 2


A very small percentage of the data set have a healthy BMI - at only 16.59%!
A healthy BMI contributes to a lower insurance cost.

Conclusions:

The biggest factor contributing to high insurance costs is whether someone is a smoker or not.