# U.S. Medical Insurance Costs

 ##### In this project, I'll be looking at some key attributes of individual insurance policy holders, and analyze the accompanying data to determine which atrributes are greater factors in determining insurance costs. The questions I aim to answer are as follows:

* Age: Which age group pays the most in medical bills?
* Sex: Does gender have an effect on costs?
* Health: Do smokers typically pay more than non-smokers? What about high vs low BMI?
* Children: What is the average cost increase per additional child?
* Region: Does any region pay significanlty more than others? If so, why? Are certain regions more prone to health issues or unhealthy habits?


In [1]:
#import csv library
import csv

### To start, after importing the csv library, empty lists are created in order to store every instance of each attribute

In [2]:
ages = []
sexes = []
bmis = []
nums_of_children = []
smoker_statuses = []
regions = []
costs = []

In [3]:
#defining a function to read through each row from insurance.csv
def populate_lists(list_name, csv_file, col_name):
    with open(csv_file) as csv_info:
        csv_dict = csv.DictReader(csv_info)
        #add each atttribute to its designated list
        for row in csv_dict:
            list_name.append(row[col_name])

### Above, a function is created in order to populate each attribute list. The function reads each line of the provided csv file, and appends each associated value to its relative list defined before.

### For the next step below, the function is ran for each of the 7 lists. As a resut, the lists are now fully populated

In [4]:
populate_lists(ages, 'insurance.csv', 'age')
populate_lists(sexes, 'insurance.csv', 'sex')
populate_lists(bmis, 'insurance.csv', 'bmi')
populate_lists(nums_of_children, 'insurance.csv', 'children')
populate_lists(smoker_statuses, 'insurance.csv', 'smoker')
populate_lists(regions, 'insurance.csv', 'region')
populate_lists(costs, 'insurance.csv', 'charges')

### Now that each column from the csv is placed into lists, we can go ahead and construct a dictionary labeling each row starting with 1, and assigning each row it's own dictionary

In [5]:
#create a diciontary holding each individual patient's dictionary 
patient_dict = {}
for i in range(0, 1338):
    patient_dict[i + 1] = {'Age': ages[i], 'Sex': sexes[i], 'BMI': bmis[i], 'Children': nums_of_children[i], 'Smoker': smoker_statuses[i], 'Region': regions[i], 'Charges': costs[i]}
    

### With everything organized we can begin to analyze the data to answer each of the questions introduced. I start below by searching through the ages list to find each unique age and sort it. This will help to determine the ranges of age groups in the data captured

In [6]:
#define a list that finds how many unique ages are recorded
unique_ages = []
for age in ages:
    if age not in unique_ages:
        unique_ages.append(age)
#sort list to easily identify age groups
unique_ages = sorted(unique_ages)
print(unique_ages)

['18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64']


### Now I set up a separate dictionary for each age group that will capture how many inidividuals are in that group, and what the average charge is

In [7]:

#declare dictionaries for each age group, containing count and average
teens = {'Count': 0, 'Average': 0}
twenties = {'Count': 0, 'Average': 0}
thirties = {'Count': 0, 'Average': 0}
forties = {'Count': 0, 'Average': 0}
fifties = {'Count': 0, 'Average': 0}
sixties = {'Count': 0, 'Average': 0}
teen_total = 0
twenties_total = 0
thirties_total = 0
forties_total = 0
fifties_total = 0
sixties_total = 0
#loop to count individuals in age groups and determine average cost
for i in range(1, 1339):
    current_age = int(patient_dict[i]['Age'])
    if current_age < 20:
        teens['Count'] +=1
        teen_total += float(patient_dict[i]['Charges'])
        teens['Average'] = round((teen_total / teens['Count']), 2)
    elif current_age >= 20 and current_age <30:
        twenties['Count'] += 1
        twenties_total += float(patient_dict[i]['Charges'])
        twenties['Average'] = round((twenties_total / twenties['Count']), 2)
    elif current_age >= 30 and current_age < 40:
        thirties['Count'] += 1
        thirties_total += float(patient_dict[i]['Charges'])
        thirties['Average'] = round((thirties_total / thirties['Count']), 2)
    elif current_age >= 40 and current_age < 50:
        forties['Count'] += 1
        forties_total += float(patient_dict[i]['Charges'])
        forties['Average'] = round((forties_total / forties['Count']), 2)
    elif current_age >= 50 and current_age < 60:
        fifties['Count'] += 1
        fifties_total += float(patient_dict[i]['Charges'])
        fifties['Average'] = round((fifties_total / fifties['Count']), 2)
    elif current_age >= 60:
        sixties['Count'] += 1
        sixties_total += float(patient_dict[i]['Charges'])
        sixties['Average'] = round((sixties_total / sixties['Count']), 2)
#print out each dictionary for results        
print('Ages 18-19:', teens)
print('Ages 20-29:', twenties)
print('Ages 30-39:', thirties)
print('Ages 40-49:', forties)
print('Ages 50-59:', fifties)
print('Ages 60-64:', sixties)


Ages 18-19: {'Count': 137, 'Average': 8407.35}
Ages 20-29: {'Count': 280, 'Average': 9561.75}
Ages 30-39: {'Count': 257, 'Average': 11738.78}
Ages 40-49: {'Count': 279, 'Average': 14399.2}
Ages 50-59: {'Count': 271, 'Average': 16495.23}
Ages 60-64: {'Count': 114, 'Average': 21248.02}


### From these results, we can conclude that as people age, they pay more and more on average for medical bills, with the largest jump in increase from 50s to 60s

### Using the same idea as above, we can also determine which gender pays more on average

In [8]:
males = {'Count': 0, 'Average': 0}
females = {'Count': 0, 'Average': 0}
male_total = 0
female_total = 0
for i in range(1,1339):
    if patient_dict[i]['Sex'] == 'male':
        males['Count'] += 1
        male_total += float(patient_dict[i]['Charges'])
        males['Average'] = round((male_total / males['Count']), 2)
    elif patient_dict[i]['Sex'] == 'female':
        females['Count'] += 1
        female_total += float(patient_dict[i]['Charges'])
        females['Average'] = round((female_total / females['Count']), 2)

print('Males:', males)
print('Females:', females)


Males: {'Count': 676, 'Average': 13956.75}
Females: {'Count': 662, 'Average': 12569.58}


### From above, we can conclude that males, on average, pay slightly more in medical costs than females

### Again, we can use the same formulas to determine what the affect smoking has on medical costs

In [9]:
#define dictionaries and keys with empty values to start
non_smokers = {'Count': 0, 'Total': 0, 'Average': 0}
smokers = {'Count': 0, 'Total': 0, 'Average': 0}
#define loop to check each patient and update totals from above as necessary
for i in range(1,1339):
    if patient_dict[i]['Smoker'] == 'no':
        non_smokers['Count'] += 1
        non_smokers['Total'] += float(patient_dict[i]['Charges'])
        non_smokers['Average'] = round((non_smokers['Total'] / non_smokers['Count']), 2)
    else:
        smokers['Count'] += 1
        smokers['Total'] += float(patient_dict[i]['Charges'])
        smokers['Average'] = round((smokers['Total'] / smokers['Count']), 2)
        
#print to see results 
print('Non-Smokers:', non_smokers)
print('Smokers:', smokers)

Non-Smokers: {'Count': 1064, 'Total': 8974061.468918996, 'Average': 8434.27}
Smokers: {'Count': 274, 'Total': 8781763.52184, 'Average': 32050.23}


In [10]:
#calculate the difference on avg that smokers pay over non-smokers
smoker_additional = round(((smokers['Average'] - non_smokers['Average']) / non_smokers['Average']) * 100, 0)
print(('Smokers, on average, pay {}% more on medical costs than non-smokers').format(smoker_additional))

Smokers, on average, pay 280.0% more on medical costs than non-smokers


### Above answers the next question, and it is no surprise that smokers pay significantly more on average than non-smokers. Next, we'll check how an individual's BMI may affect their medical costs.

In [11]:
#check the min and max values of BMI to get a sense of range
print(max(bmis))
print(min(bmis))

53.13
15.96


In [12]:
#define dictionaries over 4 ranges of BMI
bmi_1 = {'Count': 0, 'Total': 0, 'Average': 0}
bmi_2 = {'Count': 0, 'Total': 0, 'Average': 0}
bmi_3 = {'Count': 0, 'Total': 0, 'Average': 0}
bmi_4 = {'Count': 0, 'Total': 0, 'Average': 0}
#loop through each patient to update totals for the BMI groups they are in
for i in range(1,1339):
    if float(patient_dict[i]['BMI']) < 25:
        bmi_1['Count'] += 1
        bmi_1['Total'] += float(patient_dict[i]['Charges'])
        bmi_1['Average'] = round((bmi_1['Total'] / bmi_1['Count']), 2)
    elif float(patient_dict[i]['BMI']) >= 25 and float(patient_dict[i]['BMI']) < 35:
        bmi_2['Count'] += 1
        bmi_2['Total'] += float(patient_dict[i]['Charges'])
        bmi_2['Average'] = round((bmi_2['Total'] / bmi_2['Count']), 2)
    elif float(patient_dict[i]['BMI']) >= 35 and float(patient_dict[i]['BMI']) < 45:
        bmi_3['Count'] += 1
        bmi_3['Total'] += float(patient_dict[i]['Charges'])
        bmi_3['Average'] = round((bmi_3['Total'] / bmi_3['Count']), 2)
    elif float(patient_dict[i]['BMI']) >= 45:
        bmi_4['Count'] += 1
        bmi_4['Total'] += float(patient_dict[i]['Charges'])
        bmi_4['Average'] = round((bmi_4['Total'] / bmi_4['Count']), 2)

#print to check results         
print('BMI less than 25:', bmi_1)
print('BMI between 25 and 34:', bmi_2)
print('BMI between 35 and 44:', bmi_3)
print('BMI 45 and above:', bmi_4)

BMI less than 25: {'Count': 245, 'Total': 2519144.996220001, 'Average': 10282.22}
BMI between 25 and 34: {'Count': 777, 'Total': 9879271.731198989, 'Average': 12714.64}
BMI between 35 and 44: {'Count': 296, 'Total': 5006449.728329997, 'Average': 16913.68}
BMI 45 and above: {'Count': 20, 'Total': 350958.53501, 'Average': 17547.93}


### From the results above, it does appear that the higher BMI, the higher the costs. The largest jump in costs is from BMI group 25-34 to BMI group 35-44

### Next we take a look at additional costs per child

In [13]:
#check max num of children to get range
print(max(nums_of_children))

5


In [14]:
#define dictionaries for each amount of children
no_children = {'Count': 0, 'Total': 0, 'Average': 0}
one_child = {'Count': 0, 'Total': 0, 'Average': 0}
two_children = {'Count': 0, 'Total': 0, 'Average': 0}
three_children = {'Count': 0, 'Total': 0, 'Average': 0}
four_children = {'Count': 0, 'Total': 0, 'Average': 0}
five_children = {'Count': 0, 'Total': 0, 'Average': 0}
#define function to loop through patients
def children_cost(child_dict, child_num):
    for i in range(1,1339):
        child_count = int(patient_dict[i]['Children'])
        if child_count == child_num:
            child_dict['Count'] += 1
            child_dict['Total'] += float(patient_dict[i]['Charges'])
            child_dict['Average'] = round((child_dict['Total'] / child_dict['Count']), 2)
    return child_dict

#run the function for each dictionary and print the results
print('No children:', children_cost(no_children, 0))
print('One child:', children_cost(one_child, 1))
print('Two children:', children_cost(two_children, 2))
print('Three children:', children_cost(three_children, 3))
print('Four children:', children_cost(four_children, 4))
print('Five children:', children_cost(five_children, 5))

No children: {'Count': 574, 'Total': 7098069.995338997, 'Average': 12365.98}
One child: {'Count': 324, 'Total': 4124899.673449997, 'Average': 12731.17}
Two children: {'Count': 240, 'Total': 3617655.296149999, 'Average': 15073.56}
Three children: {'Count': 157, 'Total': 2410784.983589999, 'Average': 15355.32}
Four children: {'Count': 25, 'Total': 346266.40777999995, 'Average': 13850.66}
Five children: {'Count': 18, 'Total': 158148.63445, 'Average': 8786.04}


### It appears the average cost for patients increases about 300 dollars per additional child. However, our numbers skew once the child count gets to be 4 and over. These averages are inconclusive since the sample sizes are significantly lower than the others

### Lastly, we will take a look at which regions are paying the most for their medical bills on average

In [15]:
#first double-check to see which regions are defined in the dataset
unique_regions = []
for region in regions:
    if region not in unique_regions:
        unique_regions.append(region)
print(unique_regions)

['southwest', 'southeast', 'northwest', 'northeast']


In [16]:
#create dictionaries for each region
southwest_reg = {'Count': 0, 'Total': 0, 'Average': 0}
southeast_reg = {'Count': 0, 'Total': 0, 'Average': 0}
northwest_reg = {'Count': 0, 'Total': 0, 'Average': 0}
northeast_reg = {'Count': 0, 'Total': 0, 'Average': 0}
#define function to loop through patients and update region totals
def costs_by_region(region_dict, region):
    for i in range(1,1339):
        current_reg = patient_dict[i]['Region']
        if current_reg == region:
            region_dict['Count'] += 1
            region_dict['Total'] += float(patient_dict[i]['Charges'])
            region_dict['Average'] = round((region_dict['Total'] / region_dict['Count']), 2)
    return region_dict

#run function for each region and print results
print('Southwest Region:', costs_by_region(southwest_reg, 'southwest'))
print('Southeast Region:', costs_by_region(southeast_reg, 'southeast'))
print('Northwest Region:', costs_by_region(northwest_reg, 'northwest'))
print('Northeast Region:', costs_by_region(northeast_reg, 'northeast'))            
    

Southwest Region: {'Count': 325, 'Total': 4012754.647620001, 'Average': 12346.94}
Southeast Region: {'Count': 364, 'Total': 5363689.763290002, 'Average': 14735.41}
Northwest Region: {'Count': 325, 'Total': 4035711.9965399993, 'Average': 12417.58}
Northeast Region: {'Count': 324, 'Total': 4343668.583308999, 'Average': 13406.38}


### Now that we can see the Southeast region pays more on average than other regions, we can dig a little deeper. Let's take a look at the number of smokers and the average BMI of each region to determine if those could play a factor. We can also check if any region typically has higher child count than others

In [17]:
#define dictionaries for region stats
sw_stats = {'Smokers': 0, 'Avg BMI': 0, 'Avg Child Count': 0}
se_stats = {'Smokers': 0, 'Avg BMI': 0, 'Avg Child Count': 0}
nw_stats = {'Smokers': 0, 'Avg BMI': 0, 'Avg Child Count': 0}
ne_stats = {'Smokers': 0, 'Avg BMI': 0, 'Avg Child Count': 0}
#define function to loop through patient dictionary and update stats
def stats_by_region(region_dict, region):
    total_bmi = 0
    total_count = 0
    total_child = 0
    smoker_count = 0
    for i in range(1,1339):
        current_reg = patient_dict[i]['Region']
        current_smoke = patient_dict[i]['Smoker']
        current_bmi = float(patient_dict[i]['BMI'])
        current_child = float(patient_dict[i]['Children'])
        if current_reg == region:
            total_count += 1
            total_bmi += current_bmi
            total_child += current_child
            if current_smoke == 'yes':
                smoker_count += 1
    region_dict['Avg BMI'] = round(total_bmi / total_count, 2)
    region_dict['Avg Child Count'] = round(total_child / total_count, 1)
    region_dict['Smokers'] = str(round(float((smoker_count / total_count)*100), 2)) + '%'
    return region_dict
#run function for each region and print results
print('Southwest Stats:', stats_by_region(sw_stats, 'southwest'))
print('Southeast Stats:', stats_by_region(se_stats, 'southeast'))
print('Northwest Stats:', stats_by_region(nw_stats, 'northwest'))
print('Northeast Stats:', stats_by_region(ne_stats, 'northeast'))
            

Southwest Stats: {'Smokers': '17.85%', 'Avg BMI': 30.6, 'Avg Child Count': 1.1}
Southeast Stats: {'Smokers': '25.0%', 'Avg BMI': 33.36, 'Avg Child Count': 1.0}
Northwest Stats: {'Smokers': '17.85%', 'Avg BMI': 29.2, 'Avg Child Count': 1.1}
Northeast Stats: {'Smokers': '20.68%', 'Avg BMI': 29.17, 'Avg Child Count': 1.0}


### As we can see from these results, 25% of the patients from the Southeast region are smokers, which is significantly higher than the other regions. The average BMI for patients in the Southeast is also higher than the others, at 33.36. Thus it is fitting that the Southeast region would be paying more on average in medical bills. The child count is the same on average for each region, so this is likely not a factor in determing costs per region