# U.S. Medical Insurance Costs

### Initial Thoughts
The Data is organised in CSV file, with the headings: Age, Sex, BMI, Children, Smoker, Region, Charges
The Dataset is clean - No missing data, no noticeable typos. 4 numerical variables, 3 categorical. Each variable represents one charateristic and each characteristic appears once


### Scope

1. Average characteristics of people in the dataset (Average Age, Mode location, Average Age of those who have children)
2. May want to investigate what influences the cost of insurance
3. The relationship between smoking and BMI (Average BMI of smokers vs Average BMI of non-smokers)
4. Relationship between sex and BMI (Average BMI of Men vs Average BMI of women)

In [37]:
#Import modules
import csv

In [96]:
### TO USE NUMPY IF REQUIRED
###np_age_list = np.asarray(age_list).astype(float)
###print(np.mean(np_age_list))

39.20702541106129


In [135]:
with open(r"C:\Users\Josh\Documents\Projects\US Medical Costs\insurance.csv", newline = '') as CSVfile:
    writer = csv.DictReader(CSVfile)
    age_list = []
    sex_list = []
    bmi_list = []
    children_list = []
    smoker_list = []
    region_list = []
    charges_list = []
    for row in writer:
        age_list.append(row["age"])
        sex_list.append(row["sex"])
        bmi_list.append(row["bmi"])
        children_list.append(row["children"])
        smoker_list.append(row["smoker"])
        region_list.append(row["region"])
        charges_list.append(row["charges"])
   

### Exploring the Dataset

In [112]:
#Number of People
num_people = len(age_list)
print("There are {people} people in this dataset.".format(people = num_people))

There are 1338 people in this dataset.


In [156]:
# Average Age
total_age = 0.0
minimum_age = 100000000
maximum_age = 0
age_list = [int(age) for age in age_list]
for age in age_list:
    total_age += float(age)
    if age>maximum_age:
        maximum_age = age
    if age < minimum_age:
        minimum_age = age
avg_age = total_age/num_people
print("The average age of people in this dataset is {age} years old.".format(age=round(avg_age,2)))
print("The youngest person is {min} years old.".format(min=minimum_age))
print("The oldest person is {max} years old.".format(max = maximum_age))

The average age of people in this dataset is 39.21 years old.
The youngest person is 18 years old.
The oldest person is 64 years old.


In [102]:
#Most common location
unique_locations = {}
for location in region_list:
    if location not in unique_locations:
        unique_locations[location] = 1
    else:
        unique_locations[location] +=1
ranked_locations = sorted([(value, key.title()) for key,value in unique_locations.items()], reverse=True)

print("The most common location of people in this dataset is the {location}, with {people} people.".format(location = ranked_locations[0][-1], people = ranked_locations[0][0]))

The most common location of people in this dataset is the Southeast, with 364 people.


In [159]:
#Average BMI
total_bmi = 0.0
max_bmi = 0.0
min_bmi = 1000000000
bmi_list = [float(bmi) for bmi in bmi_list]
for bmi in bmi_list:
    total_bmi += bmi
    if bmi>max_bmi:
        max_bmi = bmi
    if bmi<min_bmi:
        min_bmi = bmi
avg_bmi = total_bmi/num_people
print("The average BMI of people in this dataset is {bmi}.".format(bmi = round(avg_bmi, 2)))
print("The lowest BMI of a person in this dataset is {bmi}.".format(bmi = min_bmi))
print("The highest BMI of a person in this dataset is {bmi}.".format(bmi=max_bmi))

The average BMI of people in this dataset is 30.66.
The lowest BMI of a person in this dataset is 15.96.
The highest BMI of a person in this dataset is 53.13.


In [114]:
#Number of smokers
num_smokers = 0
for smoker in smoker_list:
    if smoker =="yes":
        num_smokers +=1
num_non_smokers = num_people - num_smokers
perc_smokers = round(num_smokers/num_people * 100, 2)
perc_non_smokers = round(num_non_smokers/num_people * 100, 2)
print("There are {smokers} smokers ({smoker_perc}%) and {non_smokers} non-smokers ({non_perc}%) in this dataset.".format(smokers = num_smokers, non_smokers = num_non_smokers, smoker_perc = perc_smokers, non_perc = perc_non_smokers))

There are 274 smokers (20.48%) and 1064 non-smokers (79.52%) in this dataset.


In [256]:
# Number of children per person
total_children = 0
max_children = 0
num_children = {}
children_list = [int(children) for children in children_list]
for children in sorted(children_list):
    total_children += int(children)
    if children not in num_children:
        num_children[children] = 1
    else:
        num_children[children] += 1

#num_children = sorted([(key, value) for key, value in num_children.items()])
print("Each person has an average of {children} children.".format(children = round(total_children/num_people, 2)))
print("Frequency of different numbers of children: {children}".format(children=num_children))



Each person has an average of 1.09 children.
Frequency of different numbers of children: {0: 574, 1: 324, 2: 240, 3: 157, 4: 25, 5: 18}


In [115]:
#Proportions of males to females
num_male = 0
num_female = 0
for sex in sex_list:
    if sex == "male":
        num_male+=1
    else:
        num_female +=1
perc_men = round(num_male/num_people * 100,2)
perc_women = round(num_female/num_people * 100,2)
print("There are {men} males ({mPerc}%) and {females} females ({fPerc}%) in this dataset.".format(men=num_male, females=num_female, mPerc = perc_men, fPerc = perc_women))

There are 676 males (50.52%) and 662 females (49.48%) in this dataset.


In [164]:
#Average insurance cost
total_cost = 0.0
charges_list = [float(charge) for charge in charges_list]
max_charge = 0
min_charge = 1e20
for cost in charges_list:
    total_cost += cost
    if cost > max_charge:
        max_charge = cost
    if cost < min_charge:
        min_charge = cost
avg_cost = round(total_cost/num_people, 2)
print("The average insurance cost is ${cost} per person.".format(cost = avg_cost))
print("The lowest insurance cost is ${cost}.".format(cost = round(min_charge, 2)))
print("The highest insurance cost is ${cost}.".format(cost = round(max_charge, 2)))

The average insurance cost is $13270.42 per person.
The lowest insurance cost is $1121.87.
The highest insurance cost is $63770.43.


In [170]:
#Different Ages
tens_charges = []
twenties_charges = []
thirties_charges = []
fourties_charges = []
fifties_charges = []
sixties_charges = []
tens_total = 0.0
twenties_total = 0.0
thirties_total = 0.0
fourties_total = 0.0
fifties_total = 0.0
sixties_total = 0.0

for index in range(num_people):
    age = age_list[index]
    cost = charges_list[index]
    if age >=10 and age < 20:
        tens_charges.append(cost)
        tens_total+=cost
    elif age >=20 and age <30:
        twenties_charges.append(cost)
        twenties_total+=cost
    elif age >=30 and age<40:
        thirties_charges.append(cost)
        thirties_total+=cost
    elif age >=40 and age < 50:
        fourties_charges.append(cost)
        fourties_total+=cost
    elif age >=50 and age <60:
        fifties_charges.append(cost)
        fifties_total +=cost
    elif age>=60 and age < 70:
        sixties_charges.append(cost)
        sixties_total+=cost
    else:
        print("Error with a person aged {}".format(age))
tens_average = round(tens_total/len(tens_charges), 2)
twenties_average = round(twenties_total/len(twenties_charges), 2)
thirties_average = round(thirties_total/len(thirties_charges), 2)
fourties_average = round(fourties_total/len(fourties_charges), 2)
fifties_average = round(fifties_total/len(fifties_charges), 2)
sixties_average = round(sixties_total/len(sixties_charges), 2)
print("The average insurance cost for someone in their 10's is ${}.".format(tens_average))
print("The average insurance cost for someone in their 20's is ${}.".format(twenties_average))
print("The average insurance cost for someone in their 30's is ${}.".format(thirties_average))
print("The average insurance cost for someone in their 40's is ${}.".format(fourties_average))
print("The average insurance cost for someone in their 50's is ${}.".format(fifties_average))
print("The average insurance cost for someone in their 60's is ${}.".format(sixties_average))

The average insurance cost for someone in their 10's is $8407.35.
The average insurance cost for someone in their 20's is $9561.75.
The average insurance cost for someone in their 30's is $11738.78.
The average insurance cost for someone in their 40's is $14399.2.
The average insurance cost for someone in their 50's is $16495.23.
The average insurance cost for someone in their 60's is $21248.02.


So we can see that insurance cost increases with age, as we would expect.

In [173]:
male_charges = []
male_total = 0.0
female_charges = []
female_total = 0.0
for index in range(num_people):
    sex = sex_list[index]
    cost = charges_list[index]
    if sex == "male":
        male_charges.append(cost)
        male_total+=cost
    elif sex == "female":
        female_charges.append(cost)
        female_total+=cost
    else:
        print("Error with index {}: sex is {}.".format(index, sex))
avg_male = round(male_total/len(male_charges), 2)
avg_female = round(female_total/len(female_charges), 2)
print("The average insurance cost for a male is ${}.".format(avg_male))
print("The average insurance cost for a female is ${}.".format(avg_female))

The average insurance cost for a male is $13956.75.
The average insurance cost for a female is $12569.58.


Females have a lower average insurance cost than men! Could this be causal, or are there other variables at work?
Do men have a higher bmi? More children? Smoke more? Are they older? Live in different regions?

In [246]:
male_bmi = []
male_bmi_tot = 0.0
female_bmi = []
female_bmi_tot = 0.0
male_kids = []
male_kids_tot = 0.0
female_kids = []
female_kids_tot = 0.0
male_smokers_costs = 0.0
male_smokers = 0
male_nonsmoker_cost = 0.0
female_smokers_costs= 0.0
female_smokers = 0
female_nonsmoker_cost = 0.0
male_age = []
male_age_tot = 0.0
female_age = []
female_age_tot = 0.0
male_region = []
female_region = []
n=0
m=0
for index in range(num_people):
    sex = sex_list[index]
    region = region_list[index]
    bmi = bmi_list[index]
    children = children_list[index]
    age = age_list[index]
    smoker = smoker_list[index]
    if sex == 'male':
        male_bmi.append(bmi)
        male_kids.append(children)
        male_age.append(age)
        male_region.append(region)
        male_age_tot+=age
        male_kids_tot+=children
        male_bmi_tot +=bmi
        if smoker == 'yes':
            male_smokers +=1
            male_smokers_costs += cost
            n+=1
        elif smoker == 'no':
            male_nonsmoker_cost += cost
            m+=1
    elif sex == 'female':
        female_bmi.append(bmi)
        female_kids.append(children)
        female_age.append(age)
        female_region.append(region)
        female_age_tot+=age
        female_kids_tot+=children
        female_bmi_tot +=bmi
        if smoker == 'yes':
            female_smokers +=1
            female_smokers_costs += cost
        elif smoker == 'no':
            female_nonsmoker_cost += cost
    else:
        print("Issue occurred at index {}.".format(index))

avg_bmi_m = round(male_bmi_tot/num_male, 2)
avg_age_m = round(male_age_tot/num_male, 2)
avg_children_m = round(male_kids_tot/num_male, 2)
avg_bmi_f = round(female_bmi_tot/num_female, 2)
avg_age_f = round(female_age_tot/num_female, 2)
avg_children_f = round(female_kids_tot/num_female, 2)

unique_locations_males = {}
unique_locations_females = {}

for location in male_region:
    if location not in unique_locations_males:
        unique_locations_males[location] = 1
    else:
        unique_locations_males[location] +=1

for location in female_region:
    if location not in unique_locations_females:
        unique_locations_females[location] = 1
    else:
        unique_locations_females[location] +=1

print("Average BMI for males is {male} and for females is {females}.".format(male = avg_bmi_m, females = avg_bmi_f))
print("Average age for males is {male} and for females is {females}.".format(male = avg_age_m, females = avg_age_f))
print("Average number of children for males is {male} and for females is {females}.".format(male = avg_children_m, females = avg_children_f))
locations = unique_locations.keys()
location_males = []


for locations in unique_locations:
    location_males.append((locations.title(), unique_locations_males[locations], str(round(unique_locations_males[locations]/num_male*100, 2))+'%'))

location_females = []
for locations in unique_locations:
    location_females.append((locations.title(), unique_locations_females[locations], str(round(unique_locations_females[locations]/num_female*100, 2))+'%'))

print("Location, number, and percentage for males: {}".format(location_males))
print("Location, number, and percentage for females: {}".format(location_females))

pc_smokers_m = round(male_smokers / num_male * 100, 2) 
pc_smokers_f = round(female_smokers / num_female * 100, 2)
print("{}% of males smoke and {}% of females smoke.".format(pc_smokers_m, pc_smokers_f))

Average BMI for males is 30.94 and for females is 30.38.
Average age for males is 38.92 and for females is 39.5.
Average number of children for males is 1.12 and for females is 1.07.
Location, number, and percentage for males: [('Southwest', 163, '24.11%'), ('Southeast', 189, '27.96%'), ('Northwest', 161, '23.82%'), ('Northeast', 163, '24.11%')]
Location, number, and percentage for females: [('Southwest', 162, '24.47%'), ('Southeast', 175, '26.44%'), ('Northwest', 164, '24.77%'), ('Northeast', 161, '24.32%')]
23.52% of males smoke and 17.37% of females smoke.


Therefore the difference may be that more males smoke than females!

In [255]:
smoker_total = 0.0
nonsmoker_total = 0.0
m_smoker_total = 0.0
f_smoker_total = 0.0
m_smokers = 0
f_smokers = 0
m_nonsmoker_total = 0.0
f_nonsmoker_total = 0.0
m_nonsmokers = 0
f_nonsmokers = 0
for index in range(num_people):
    smoker = smoker_list[index]
    cost = charges_list[index]
    sex = sex_list[index]
    if smoker =='yes':
        smoker_total += cost
        if sex=='male':
            m_smoker_total+=cost
            m_smokers +=1
        elif sex =='female':
            f_smoker_total += cost
            f_smokers +=1
    elif smoker == 'no':
        nonsmoker_total+=cost
        if sex=='male':
            m_nonsmoker_total += cost
            m_nonsmokers+=1
        elif sex == 'female':
            f_nonsmoker_total+=cost
            f_nonsmokers+=1
    else:
        print("Issue arised at index {}.".format(index))
avg_smoker_cost = round(smoker_total/num_smokers, 2)
avg_nonsmoker_cost = round(nonsmoker_total/num_non_smokers, 2)
avg_m_smoker_cost = round(m_smoker_total/m_smokers, 2)
avg_f_smoker_cost = round(f_smoker_total/f_smokers, 2)
avg_m_nonsmoker_cost = round(m_nonsmoker_total/m_nonsmokers, 2)
avg_f_nonsmoker_cost = round(f_nonsmoker_total/f_nonsmokers, 2)
print("The average insurance cost is ${} for a smoker and ${} for a non-smoker.".format(avg_smoker_cost, avg_nonsmoker_cost))
print("The average cost is ${} for a male smoker and ${} for a male non-smoker.".format(avg_m_smoker_cost, avg_m_nonsmoker_cost))
print("The average cost is ${} for a female smoker and ${} for a female non-smoker.".format(avg_f_smoker_cost, avg_f_nonsmoker_cost))

The average insurance cost is $32050.23 for a smoker and $8434.27 for a non-smoker.
The average cost is $33042.01 for a male smoker and $8087.2 for a male non-smoker.
The average cost is $30679.0 for a female smoker and $8762.3 for a female non-smoker.


Smoking appears to lead to a massive increase in insurance costs for males and females alike - but while the average cost for a male smoker is higher than the average for a female smoker, the average for a male non-smoker is lower than for an a female non-smoker!

### Investigating smokers

In [260]:
bmi_smoker = 0.0
bmi_nonsmoker = 0.0
age_smoker = 0.0
age_nonsmoker = 0.0
for index in range(num_people):
    smoker = smoker_list[index]
    bmi = bmi_list[index]
    age = age_list[index]
    if smoker=='yes':
        bmi_smoker+=bmi
        age_smoker += age
    elif smoker == 'no':
        bmi_nonsmoker += bmi
        age_nonsmoker += age
    else:
        print("Error on index {}.".format(index))
avg_smoker_bmi = round(bmi_smoker/num_smokers, 2)
avg_nonsmoker_bmi = round(bmi_nonsmoker/num_non_smokers, 2)

avg_smoker_age = round(age_smoker/num_smokers, 2)
avg_nonsmoker_age = round(age_nonsmoker/num_non_smokers)
print("The average BMI is {} for a smoker and {} for a non-smoker.".format(avg_smoker_bmi, avg_nonsmoker_bmi))
print("The average age of smokers is {} and of {} for non-smokers.".format(avg_smoker_age, avg_nonsmoker_age))

The average BMI is 30.71 for a smoker and 30.65 for a non-smoker.
The average age of smokers is 38.51 and of 39 for non-smokers.


No big difference!