# U.S. Medical Insurance Costs

For this project, a CSV file, **insurance.csv**, will be analyzed. It contains the medical cost for each patient, along with other attributes (i.e. age, sex, bmi).

The goal of this project is to analyze each attribute, learn some key information, and then see if there is any insight that can be obtained from each analyzation

In [299]:
import csv

# Lists for each attribute in Insurance.csv
age_list = []
sexes_list = []
bmi_list = []
children_list = []
smoker_list = []
region_list = []
charges_list = []

We start off by importing the csv library so that we can read from the csv file, **insurance.csv**

Next empty lists are created for each attribute. When gathering each patient's attributes, it will be placed in the corresponding list. This will make analyzing each attribute easier.

In [530]:
# Function that adds values of each attribute from the file to the corresponding list
def read_file(list , attribute,):
    with open('insurance.csv') as insurance_file:
        insurance_reader = csv.DictReader(insurance_file)
        
        # For loop that goes through each row of the file
        for row in insurance_reader:
                list.append(row[attribute])

    return list

# Save values of each attribute to its corresponding list
read_file(age_list , 'age')
read_file(sexes_list , 'sex')
read_file(bmi_list , 'bmi')
read_file(children_list , 'children')
read_file(smoker_list , 'smoker')
read_file(region_list, 'region')
read_file(charges_list , 'charges')

print(age_list[400] + age_list[450])

5139


A function is created to read the csv file. It takes in two parameters, *list* and *attribute*. Using the csv library, the **insurance.csv** file is read, saved as ***insurance_file***. A for loop goes through each row of patient information and adds the *attribute* to the corresponding *list*.

We then call the function for each attribute list.

In [448]:
# Function for finding the average
def average(list, datatype):
    sum = 0
    for i in list:
        sum += datatype(i)
    avg = round(sum / len(list),2)
    return avg

This function helps find the average of the values inside a list. It takes in two parameters, ***list*** and ***datatype***. The reason for including the ***datatype*** is because not all values are whole numbers, some are floats (charges, bmi).

Using a for loop, a ***sum*** variable is incremented with each value in the list. Finally the average, ***avg***, is found by taking the ***sum*** and divding by the length of the ***list***. The function then returns ***avg***.

In [449]:
# Averages of age, bmi, & charges
age_avg = int(average(age_list, int))
bmi_avg = round(average(bmi_list, float),2)
charge_avg = round(average(charges_list, float),2)

print("The average age is: " + str(avg_age))
print("The average bmi is: " + str(bmi_avg))
print("The average charge is: " + str(charge_avg))

The average age is: 39.31891891891892
The average bmi is: 30.66
The average charge is: 13270.42


This shows the average age, bmi, and charge of the insurance record as a whole.

It doesn't show much, but an overall idea can be formed of the data as a whole. The average bmi is on the high end, indicating that many on the insurance records are overweight, which can increase insurance cost.

In [367]:
# Find ration of males to females
def sexes_ratio(sexes):

    # Male, female variables to keep count
    males = 0
    females = 0
    
    # For loop gathering counts for males and females
    for i in sexes:
        if i == 'male':
            males += 1
        elif i == 'female':
            females += 1
    
    # Find ratio of males
    ratio = round(males / females,2)
    
    print("There are " + str(males) + " males")
    print("There are " + str(females) + " females")
    
    if ratio < 1:
        print("There are " + str(ratio) + " less males than females.")
    elif ratio > 1:
        print("There are " + str(ratio) + " more males than females.")
        
    return ratio

sex_ratio = sexes_ratio(sexes_list)
    

There are 2028 males
There are 1986 females
There are 1.02 more males than females


The function sexes_ratio takes in the list ***sexes_list*** as parameter. A for loop then counts the number of instances of ***'male'*** and ***'female'***. Afterwards it finds the ratio of men to women, returning ***ratio***.

There are 1.02 more males than females in the insurance record. This can be expected as it correlates with statistics, showing that the number of males and females can be about the same, but if not, then there are slightly more males than females in a population. 

(Web link of correlating statistic):

https://www.ined.fr/en/everything_about_population/demographic-facts-sheets/faq/more-men-or-women-in-the-world/#:~:text=The%20number%20of%20men%20and,100%20women%20(in%202020).

In [372]:
# Find ratio of smokers to non-smokers
def smoker_ratio(smoker_list):
    smoker = 0
    non_smoker = 0
    
    for i in smoker_list:
        if i == 'yes':
            smoker += 1
        elif i == 'no':
            non_smoker += 1
    
    ratio = round(smoker / non_smoker,2)
    
    print("There are " + str(smoker) + " people who smoke.")
    print("There are " + str(non_smoker) + " people who don't smoke.")
    
    if ratio < 1:
         print("There are " + str(ratio) + " less smokers than non-smokers.")
    elif ratio < 1:
         print("There are " + str(ratio) + " less smokers than non-smokers.")
            
    return ratio

smoker_ratio = smoker_ratio(smoker_list)

There are 1096 people who smoke.
There are 4256 people who don't smoke.
There are 0.26 less smokers than non-smokers.


A function was also created to find the ratio of smokers to non-smokers, taking in ***smoker_list*** as a parameter.
A ***smoker_count*** and ***non_smoker_count*** was created to keep track of the counts.

The function works similarly like the sexes_ratio function, except in this case the values inside the list are strings.

In the for loop, if a value in ***smoker_list*** was *'yes'*, then smoker_count was incremented by 1. If it was *'no"* then ***non_smoker_count*** was incremented by 1. A ***ratio*** of smokers to non-smokers is found and returned by the function.

There are actually less smokers to non-smokers, which correlates with findings that the rates of smoking is decreasing in the US.

(Web link that shows smoking rate in the U.S. is decreasing):

https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm

In [388]:
# Create a list of ages of smokers and non-smokers
ages_of_smokers = []
ages_of_non_smokers = []

# length of list; all list have the same length
length = len(age_list)

#for loop that places ages in corresponding smoking and non-smoking lists
for i in range(0,(length-1)):
    if smoker_list[i] == 'yes':
        ages_of_smokers.append(age_list[i])
    elif smoker_list[i] == 'no':
        ages_of_non_smokers.append(age_list[i])

# Find average age of smokers and non-smokers
avg_age_smoker = average(ages_of_smokers, int)
avg_age_non_smoker = average(ages_of_non_smokers, int)

print("The average age of a smoker is: " + str(avg_age_smoker))
print("The average age of a non-smoker is: " + str(avg_age_non_smoker))

The average age of a smoker is: 38.49
The average age of a non-smoker is: 39.39


Two lists are created, ***ages_of_smokers*** and ***ages_of_non_smokers***. 

A for loop goes through each patient row in the insurance file. It checks if that patient is smokes. If they do, their age is placed in the ***ages_of_smokers***. If not, then their age is placed in ***ages_of_non_smokers***.

The average age of each list is found using the previous ***average*** function.

It's interesting that the average ages of both the smoker and non-smokers are nearly identical, and are close to the average age of the insurance record as a whole. This could indicate that there are outliers or a large amount of values that could be skewing the data. The **age_list** requires further analysis.

In [413]:
# Function that merges values with its corresponding region
def region_values(region_list, values_list, northeast_values, northwest_values, southeast_values, southwest_values ):

# For loop placing each respective value in the corresponding region list    
    for i in range(0,length-1):
        if region_list[i] == 'northeast':
            northeast_values.append(values_list[i])
        elif region_list[i] == 'northwest':
            northwest_values.append(values_list[i])
        elif region_list[i] == 'southeast':
            southeast_values.append(values_list[i])
        elif region_list[i] == 'southwest':
            southwest_values.append(values_list[i])
        

This function takes 6 parameters: region_list, values_list, and the list for each different region.

The purpose of this function is to go through each patient row of that insurance. If a patient is from a particular region, their corresponding attribute (***values_list***) will be placed in the list for that corresponding region.

This allows to anaylze each ***region*** by each ***attribute*** (i.e. average bmi of the southwest). 

In [416]:
# Find average BMI by region

# Create an empty list for each region's BMIs
northeast_bmis = []
northwest_bmis = []
southeast_bmis = []
southwest_bmis = []

region_values(region_list, bmi_list, northeast_bmis,northwest_bmis,southeast_bmis,southwest_bmis)

print(northeast_values)


# Find average BMI for each region
avg_bmis_northeast = round(average(northeast_bmis, float),2)
avg_bmis_northwest = round(average(northwest_bmis, float),2)
avg_bmis_southeast = round(average(southeast_bmis, float),2)
avg_bmis_southwest = round(average(southwest_bmis, float),2)

print("The average BMI in the northeast region is: " + str(avg_bmis_northeast))
print("The average BMI in the northwest region is: " + str(avg_bmis_northwest))
print("The average BMI in the southeast region is: " + str(avg_bmis_southeast))
print("The average BMI in the southwest region is: " + str(avg_bmis_southwest))

[]
The average BMI in the northeast region is: 29.17
The average BMI in the northwest region is: 29.2
The average BMI in the southeast region is: 33.36
The average BMI in the southwest region is: 30.6


Here the BMI for each region is found using the ***region_values*** function.

Afterwards, the average BMI for each region is found using the ***average*** function. 

It's to be expected that the average BMIs in the southern regions are higher than the northern regions. Studies show that the south has higher rates of obesity than the north.

(Web link showing that the south has higher rates of obesity):

https://www.cdc.gov/obesity/data/prevalence-maps.html

In [419]:
# Find average charges in each region

# Empty list for each region
northeast_charges = []
northwest_charges = []
southeast_charges = []
southwest_charges = []

# Calling region_values function to get values for each corresponding list
region_values(region_list, charges_list, northeast_charges, northwest_charges, southeast_charges, southwest_charges)

# Find average charges in each region
avg_charge_northeast = round(average(northeast_charges, float),2)
avg_charge_northwest = round(average(northwest_charges, float),2)
avg_charge_southeast = round(average(southeast_charges, float),2)
avg_charge_southwest = round(average(southwest_charges, float),2)

print("The average charge in the northeast region is: " + str(avg_charge_northeast))
print("The average charge in the northwest region is: " + str(avg_charge_northwest))
print("The average charge in the southeast region is: " + str(avg_charge_southeast))
print("The average charge in the southwest region is: " + str(avg_charge_southwest))

The average charge in the northeast region is: 13406.38
The average charge in the northwest region is: 12404.7
The average charge in the southeast region is: 14735.41
The average charge in the southwest region is: 12346.94


I made a prediction that based on the average BMIs in each region, the average charge in each region would have a reflect their average BMIs. High BMIs would mean higher charges, while lower BMIs would mean lower charges.

Using the same method for finding the average BMIs, the average charge for each region was found.

It correlated almost with my hypothesis. The southeast having a high BMI average, also had a higher average charge. However, the northeast, which had the lowest average BMI, had the second highest BMI.

Perhaps other attibutes in the northeast affected the average charge. It could also be that there are outliers within the northeastern group.

In [650]:
# Function that counts number of smokers in each region
def smoker_count(region_smokers):
    smoker_count = 0
    
    for i in region_smokers:
        if i == 'no':
            continue
        elif i == "yes":
            smoker_count += 1
    return smoker_count


# Empty list for each region of smokers
northeast_smokers = []
northwest_smokers = []
southeast_smokers = []
southwest_smokers = []

# Calling the function region_values to store each value to the corresponding region
region_values(region_list, smoker_list, northeast_smokers, northwest_smokers, southeast_smokers,
              southwest_smokers)

# Getting the smoker_count for each region
NE_smoker_count = smoker_count(northeast_smokers)
NW_smoker_count = smoker_count(northwest_smokers)
SE_smoker_count = smoker_count(southeast_smokers)
SW_smoker_count = smoker_count(southeast_smokers)

print("There are " + str(NE_smoker_count) + ' smokers in the northeast')
print("There are " + str(NW_smoker_count)  + ' smokers in the northwest')
print("There are " + str(SE_smoker_count) + ' smokers in the southeast')
print("There are " + str(SW_smoker_count) + ' smokers in the southwest')


There are 268 smokers in the northeast
There are 231 smokers in the northwest
There are 364 smokers in the southeast
There are 364 smokers in the southwest


As shown, the northern regions have less smokers than the southern regions, which doesn't account for why the northeast has the second highest, average charge. Further anaylsis of each region's attributes is required.

It would be interesting to further analyize the smokers in each region, perhaps seeing the amount in each age bracket. For example, if a large amount of people between the ages 18-20 were smokers in each region, this could indicate anti-smoking PSA's are perhaps not as effective in deterring the youth from smoking.

In [441]:
# Function counting number of children per region
def children_count(region_children):
    children_count = 0
    
    for i in region_children:
        children_count += int(i)

    return children_count

# Empty list for each region of smokers
northeast_children = []
northwest_children = []
southeast_children = []
southwest_children = []

region_values(region_list, children_list, northeast_children, northwest_children, southeast_children,
              southwest_children)

# Getting the children_count for each region
NE_children_count = children_count(northeast_children)
NW_children_count = children_count(northwest_children)
SE_children_count = children_count(southeast_children)
SW_children_count = children_count(southeast_children)

print("There are " + str(NE_children_count) + ' children in the northeast')
print("There are " + str(NW_children_count)  + ' children in the northwest')
print("There are " + str(SE_children_count) + ' children in the southeast')
print("There are " + str(SW_children_count) + ' children in the southwest')

There are 1356 children in the northeast
There are 1492 children in the northwest
There are 1528 children in the southeast
There are 1528 children in the southwest


The number of children in each region were found. There was any significant difference to indicate that the northeast would have a higher average charge.

Also, the number of children in a region doesn't really reflect much on the cost, considering families with more children are more likely to have higher costs than those with 1 child. Perhaps one region has families with larger numbers than another region. It might be beneficial to create a dictionary with the number of children as a key and have the corresponding charges as the value for each key.

In [445]:
# Function counting number of males and females per region
def sexes_count(region_sexes):
    male_count = 0
    female_count = 0
    
    for i in region_sexes:
        if i == 'male':
            male_count += 1
        elif i == 'female':
            female_count += 1
    return male_count, female_count

# Empty list for sexes of each region
northeast_sexes = []
northwest_sexes = []
southeast_sexes = []
southwest_sexes = []

region_values(region_list, sexes_list, northeast_sexes, northwest_sexes, southeast_sexes, southwest_sexes)


# Getting the sexes_count for each region
NE_male_count, NE_female_count = sexes_count(northeast_sexes)
NW_male_count, NW_female_count = sexes_count(northwest_sexes)
SE_male_count, SE_female_count = sexes_count(southeast_sexes)
SW_male_count, SW_female_count = sexes_count(southwest_sexes)

print("There are " + str(NE_male_count) + " and " + str(NE_female_count) + " in the northeast")
print("There are " + str(NW_male_count) + " and " + str(NW_female_count) + " in the northwest")
print("There are " + str(SE_male_count) + " and " + str(SE_female_count) + " in the southeast")
print("There are " + str(SW_male_count) + " and " + str(SW_female_count) + " in the southwest")

There are 652 and 644 in the northeast
There are 644 and 655 in the northwest
There are 756 and 700 in the southeast
There are 652 and 648 in the southwest


There wasn't a hugh significant difference between the number of each sex in a region, which correlates with the ratio of males to females.

Perhaps one could analyze the average charge between the two sexes and see which one has the higher average.

It's also interesting to note that the southeast also has a higher population, which could play a role in why it has a higher average charge.

In [583]:
# Find average age in each region

# Empty list of ages in each region
northeast_ages = []
northwest_ages = []
southeast_ages = []
southwest_ages = []


# Calling region_values function to get values for each corresponding list
region_values(region_list, age_list, northeast_ages, northwest_ages, southeast_ages, southwest_ages)

# Find average charges in each region
avg_age_northeast = int(average(northeast_ages, int))
avg_age_northwest = int(average(northwest_ages, int))
avg_age_southeast = int(average(southeast_ages, int))
avg_age_southwest = int(average(southwest_ages, int))

print("The average age in the northeast region is: " + str(avg_age_northeast))
print("The average age in the northwest region is: " + str(avg_age_northwest))
print("The average age in the southeast region is: " + str(avg_age_southeast))
print("The average age in the southwest region is: " + str(avg_age_southwest))

The average age in the northeast region is: 39
The average age in the northwest region is: 39
The average age in the southeast region is: 38
The average age in the southwest region is: 39


The average age in each region is equivalent or nearly identical to the average bmi of the insurance record as a whole.

It might be best to ensure that the ages are not skewed or contain any outliers within the age_list.

In [691]:
import numpy as np
# Function finding the median and IQR of an age_list
def Median_IQR(age_list):
    # Sorting the age_list from youngest to oldest

    sorted_age_list = sorted(age_list)

    res = [eval(i) for i in sorted_age_list]
    
    #calculate interquartile range 
    q3, q1 = np.percentile(res, [75 ,25])
    iqr = q3 - q1
    
    return q1, q3, iqr

This function using the numpy library to find the Q1, Q3, and IQR.

The IQR can be used to see if there are any outliers within a list of values.

In [706]:
# Function that finds outliers
def outliers(IQR, Q1, Q3, list, datatype):
    outliers = []
    
    Q1_gate = Q1 - (1.5*IQR)

    Q3_gate = Q3 + (1.5*IQR)

    for i in list:
        if datatype(i) < Q1_gate or datatype(i) > Q3_gate:
            if i not in outliers:
                outliers.append(datatype(i))
        else:
            return print("No outliers")
            
    return outliers

This function finds any outliers. Anything below the Q1_gate value or above the Q3_gate value is considered an outlier. If an outlier is found in the list, that value is placed in the outliers list and returned by the function. Otherwise, function returns "No outliers".

In [703]:
# Find average age in each region

# Empty list of ages in each region
northeast_ages = []
northwest_ages = []
southeast_ages = []
southwest_ages = []

# Calling region_values function to get values for each corresponding list
region_values(region_list, age_list, northeast_ages, northwest_ages, southeast_ages, southwest_ages)

# Find average charges in each region
northeast_Q1_age, northeast_Q3_age, northeast_age_IQR = Median_IQR(northeast_ages)
northwest_Q1_age, northwest_Q3_age, northwest_age_IQR = Median_IQR(northwest_ages)
southeast_Q1_age, southeast_Q3_age, southeast_age_IQR = Median_IQR(southeast_ages)
southwest_Q1_age, southwest_Q3_age, southwest_age_IQR = Median_IQR(southwest_ages)

northeast_outliers = outliers(northeast_age_IQR, northeast_Q1_age, northeast_Q3_age, northeast_ages, int)
northwest_outliers = outliers(northwest_age_IQR, northwest_Q1_age, northwest_Q3_age, northwest_ages, int)
southeast_outliers = outliers(southeast_age_IQR, southeast_Q1_age, southeast_Q3_age, southeast_ages, int)
southwest_outliers = outliers(southwest_age_IQR, southwest_Q1_age, southwest_Q3_age, southwest_ages, int)


No outliers
No outliers
No outliers
No outliers


There are no outliers affecting the average, but it's still possible that certain reoccuring values are skewing the data. Finding the mode of ages can show if that is true or not.

In [728]:
# Function that finds mode of ages in each region
def mode(list):
    dictionary = {}
    
    for i in list:
        if i not in dictionary:
            dictionary[i] = 1
        else:
            dictionary[i] += 1
    
    
    for l in dictionary:
        if dictionary[l] == max(dictionary.values()):
            return l, dictionary[l]
    

Function that finds the mode in a list using a dictionary.

It then returns the highest occuring value.

In [730]:
# Find the mode of the ages in each region
northeast_age_mode = mode(northeast_ages)
northwest_age_mode = mode(northwest_ages)
southeast_age_mode = mode(southeast_ages)
southwest_age_mode = mode(southwest_ages)

print(northeast_age_mode)
print(northwest_age_mode)
print(southeast_age_mode)
print(southwest_age_mode)

('18', 128)
('19', 136)
('18', 148)
('19', 124)


It's evident that in each region, there is a disportionate amount of younger people to those that are older.
This may skew the data and may misrepresent certain age groups. 

People at different age groups live different lives. A 19 year old and 45 year old are going to have very different BMIs, charges, number of children, etc.

Perhaps an analysis can be made for certain age groups rather than analyzing all ages as a whole.

In [619]:
# Find the mode of the BMIs in each region
northeast_bmi_mode = mode(northeast_bmis)
northwest_bmi_mode = mode(northwest_bmis)
southeast_bmi_mode = mode(southeast_bmis)
southwest_bmi_mode = mode(southwest_bmis)

print(northeast_bmi_mode)
print(northwest_bmi_mode)
print(southeast_bmi_mode)
print(southwest_bmi_mode)

('32.3', 28)
('28.31', 28)
('38.06', 28)
('34.8', 28)


It's evident that the northeast has more high-level BMIs than the northwest, which had more reoccuring low-level BMIs.
This might explain a bit why the northeast has the second highest average charge.


We see that in the southeast, there is a high number of BMIs that are considered morbidly obese, which will result in higher insurance charges.

It's strange that the southwest had higher levels of BMIs than the northeast, but the southwest's average charge was near the northwest's average charge. This still indicates that other factors in the northeast must be driving the cost. This region requires further anaylsis.



In [644]:
# Find the mode of the BMIs in each region
northeast_smoker_mode = mode(northeast_smokers)
northwest_smoker_mode = mode(northwest_smokers)
southeast_smoker_mode = mode(southeast_smokers)
southwest_smoker_mode = mode(southwest_smokers)

print(northeast_smoker_mode)
print(northwest_smoker_mode)
print(southeast_smoker_mode)
print(southwest_smoker_mode)

('no', 1028)
('no', 1068)
('no', 1092)
('no', 1068)


Again it's evident that there are more non-smokers to smokers, but it can be furher analyzed.

It was shown that in the list of ages for each region, there is a large amount of 18-19 year olds. It would be interesting to analyze that age range to see how many smokers there are. Again, as stated previously, if there were a large amount of smokers prevelant in that age range, it could indicate that smoking could be on the rise again among the youth.

In [720]:
# Find the mode of the BMIs in each region
northeast_children_mode = mode(northeast_children)
northwest_children_mode = mode(northwest_children)
southeast_children_mode = mode(southeast_children)
southwest_children_mode = mode(southwest_children)

print(northeast_children_mode)
print(northwest_children_mode)
print(southeast_children_mode)
print(southwest_children_mode)

{'2': 204, '0': 588, '1': 308, '3': 156, '5': 12, '4': 28}
{'0': 527, '3': 184, '2': 264, '1': 296, '4': 24, '5': 4}
{'1': 380, '3': 140, '0': 628, '2': 264, '4': 20, '5': 24}
{'0': 552, '1': 312, '2': 228, '5': 32, '3': 148, '4': 28}
0
0
0
0


Just like the mode of ages was skewed, we see the mode of the number of children is also skewed. There are a large number of people included without children. This can misrepresent an analysis of patients with children, it's probably best to exclude those without children.

It would also be interesting to see the comparison of the number of children to the charge each patient has.

In [634]:
# Function for creating dictionaries
def create_dictionary(dictionary, attribute, attribute_list):
    dictionary[attribute] = attribute_list
    return dictionary

In [642]:
# Creating dictionaries for each region
northeast_records = {}
northwest_records = {}
southeast_records = {}
southwest_records = {}

# Attribute names intended to be used as keys in the dictionary and arranged in a list to be used in the for loop
attributes = ['age', 'sex', 'bmi', 'children', 'smoker', 'charges']

# List of each region's attribute lists, which will be used in the for loop.
NE_attributes_list = [northeast_ages, northeast_sexes, northeast_bmis, northeast_children,
                      northeast_smokers, northeast_charges]
NW_attributes_list = [northwest_ages, northwest_sexes, northwest_bmis, northwest_children,
                      northwest_smokers, northwest_charges]
SE_attributes_list = [southeast_ages, southeast_sexes, southeast_bmis, southeast_children,
                      southeast_smokers, southeast_charges]
SW_attributes_list = [southwest_ages, southwest_sexes, southwest_bmis, southwest_children,
                      southwest_smokers, southwest_charges]

# For loop created for each region dictionary
for a,b in zip(attributes, NE_attributes_list):
    create_dictionary(northeast_records, a, b)
    
for a,b in zip(attributes, NW_attributes_list):
    create_dictionary(northwest_records, a, b)

for a,b in zip(attributes, SE_attributes_list):
    create_dictionary(southeast_records, a, b)

for a,b in zip(attributes, SW_attributes_list):
    create_dictionary(southwest_records, a, b)


{'age': ['19', '23', '19', '56', '30', '30', '31', '22', '19', '28', '26', '60', '55', '48', '19', '61', '53', '44', '37', '56', '64', '54', '38', '41', '34', '19', '55', '37', '44', '19', '52', '38', '53', '19', '22', '19', '54', '32', '20', '49', '35', '63', '54', '46', '30', '41', '36', '39', '46', '63', '20', '45', '41', '34', '32', '19', '19', '59', '55', '40', '19', '63', '27', '58', '45', '50', '22', '52', '28', '25', '19', '57', '28', '50', '19', '27', '34', '29', '64', '52', '64', '24', '50', '30', '46', '35', '19', '21', '26', '19', '43', '64', '51', '31', '47', '25', '36', '48', '19', '42', '60', '35', '33', '45', '40', '35', '39', '24', '59', '56', '42', '60', '19', '54', '51', '48', '19', '19', '61', '21', '31', '45', '62', '29', '57', '19', '39', '58', '35', '51', '33', '46', '34', '54', '63', '62', '28', '27', '49', '31', '52', '19', '58', '43', '56', '30', '38', '49', '55', '30', '37', '49', '58', '53', '24', '51', '33', '26', '49', '56', '21', '19', '39', '33', '47', '

The ***create_dictionary*** function is used, taking in the parameters ***dictionary***, ***attribute***, ***attribute_list***. The function creates a dictionary of lists. Each attribute is a key, and the value of that key is the corresponding attribute list.

In the case directly above, dictionaries are created for each region using for loops and the ***create_dictionary*** function. This will allow an analysis of particular region if one ever wanted to do so in the future.

In [643]:
# Creating a dictionary of the medical insurance record
insurance_records = {}

attributes = ['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
attributes_list = [age_list, sexes_list, bmi_list, children_list, smoker_list, charges_list]


for a,b in zip(attributes, attributes_list):
    create_dictionary(insurance_records, a, b)

print(insurance_records)

{'age': ['19', '18', '28', '33', '32', '31', '46', '37', '37', '60', '25', '62', '23', '56', '27', '19', '52', '23', '56', '30', '60', '30', '18', '34', '37', '59', '63', '55', '23', '31', '22', '18', '19', '63', '28', '19', '62', '26', '35', '60', '24', '31', '41', '37', '38', '55', '18', '28', '60', '36', '18', '21', '48', '36', '40', '58', '58', '18', '53', '34', '43', '25', '64', '28', '20', '19', '61', '40', '40', '28', '27', '31', '53', '58', '44', '57', '29', '21', '22', '41', '31', '45', '22', '48', '37', '45', '57', '56', '46', '55', '21', '53', '59', '35', '64', '28', '54', '55', '56', '38', '41', '30', '18', '61', '34', '20', '19', '26', '29', '63', '54', '55', '37', '21', '52', '60', '58', '29', '49', '37', '44', '18', '20', '44', '47', '26', '19', '52', '32', '38', '59', '61', '53', '19', '20', '22', '19', '22', '54', '22', '34', '26', '34', '29', '30', '29', '46', '51', '53', '19', '35', '48', '32', '42', '40', '44', '48', '18', '30', '50', '42', '18', '54', '32', '37', '

The medical records from ***insurance.csv*** are conviently organized into a dictionary, ***insurance_records***.
Each row in ***insurance_records*** represents a patient in the file. 

It might also be beneficial to organize a dictionary by ages and their corresponding attributes. You could for example, analyze the average bmi for 30 year olds, or the average charge for 64 years.

In [760]:
# Function for finding average insurance charge for smokers in each region
def average_smoker_charge(region_dictionary, smoker_count):
    
    sum_charge = 0
    length = len(region_dictionary['smoker'])
    
    for i in range(0,length-1):
        if region_dictionary['smoker'][i] == 'yes':
            sum_charge += float(region_dictionary['charges'][i])
        
    average = round(sum_charge / smoker_count,2)
    
    return average

In [761]:
NE_smoker_avg = average_smoker_charge(northeast_records, NE_smoker_count)
NW_smoker_avg = average_smoker_charge(northwest_records, NW_smoker_count)
SE_smoker_avg = average_smoker_charge(southeast_records, SE_smoker_count)
SW_smoker_avg = average_smoker_charge(southwest_records, SW_smoker_count)

print("The average insurance charge for smokers in the northeast is: " + str(NE_smoker_avg))
print("The average insurance charge for smokers in the northwest is: " + str(NW_smoker_avg))
print("The average insurance charge for smokers in the southeast is: " + str(SE_smoker_avg))
print("The average insurance charge for smokers in the southwest is: " + str(SW_smoker_avg))

The average insurance charge for smokers in the northeast is: 29673.54
The average insurance charge for smokers in the northwest is: 30196.55
The average insurance charge for smokers in the southeast is: 34845.0
The average insurance charge for smokers in the southwest is: 20567.1


The function ***average_smoker_charge*** is used to find the average insurace charge among smokers in each region.

Recall that the both the southeast and the southwest had an equal number of smokers. However, we see here that the averages differ greatly (almost a $14k difference).

There must be other factors affecting the insurance charge in the regions. However, the two regions barely differed in attributes except for BMI and population size. There could also be values skewing the average insurance charge.

In [759]:
# Function for finding average insurance charge for each sex in each region
def average_sexes_charge(region_dictionary, male_count, female_count):
    
    male_sum = 0
    female_sum = 0
    length_male = 0
    length_female = 0
    
    for i in range(0,len(region_dictionary['sex'])-1):
        if region_dictionary['sex'][i] == 'male':
            male_sum += float(region_dictionary['charges'][i])
            length_male += 1
    
    
    for i in range(0,len(region_dictionary['sex'])-1):
        if region_dictionary['smoker'][i] == 'yes':
            female_sum += float(region_dictionary['charges'][i])
            length_female += 1
    
    average_male = round(male_sum / length_male,2)
    average_female = round(female_sum / length_female,2)
    
    return average_male, average_female

In [749]:
NE_male_avg, NE_female_avg = average_sexes_charge(northeast_records, NE_male_count, NE_female_count)
NW_male_avg, NW_female_avg = average_sexes_charge(northwest_records, NW_male_count, NW_female_count)
SE_male_avg, SE_female_avg = average_sexes_charge(southeast_records, SE_male_count, SE_female_count)
SW_male_avg, SW_female_avg = average_sexes_charge(southwest_records, SW_male_count, SW_female_count)

print("Average charge for males in the northeast was: " + str(NE_male_avg) +
      ". Average charge for females: " + str(NE_female_avg))
print("Average charge for males in the northwest was: " + str(NW_male_avg) +
      ". Average charge for females: " + str(NW_female_avg))
print("Average charge for males in the southeast was: " + str(SE_male_avg) +
      ". Average charge for females: " + str(SE_female_avg))
print("Average charge for males in the southwest was: " + str(SW_male_avg) +
      ". Average charge for females: " + str(SW_female_avg))

Average charge for males in the northeast was: 13854.01. Average charge for females: 29673.54
Average charge for males in the northwest was: 12354.12. Average charge for females: 30196.55
Average charge for males in the southeast was: 15879.62. Average charge for females: 34845.0
Average charge for males in the southwest was: 13412.88. Average charge for females: 32269.06


The average charge for each sex in each region was found using the ***average_sexes_charge*** function.
It's clear that females are charged higher than males. The region with the highest charge for both males and females was the southeast region.

In [845]:
# Function for finding average insurance charge for each age in each region
def age_to_avg(region_dictionary):
    
    age_to_charges = {}
 
    age_count = {}
    
    age_to_avg = {}
    
    min_avg = 1000000
    min_age = 0
    
    max_avg = 0
    max_age = 0
    
    length = len(region_dictionary['age'])
    
    # For loop that creates a dictionary that has the ages as keys and the sum of corresponding charges as values
    # It also creates a dictionary that has ages as keys and their occurences as values
    for i in range(0,length-1):
        if (region_dictionary['age'][i]) not in age_to_charges:
            age_to_charges[region_dictionary['age'][i]] = float(region_dictionary['charges'][i])
            age_count[region_dictionary['age'][i]] = 1
        else:
            age_to_charges[region_dictionary['age'][i]] += float(region_dictionary['charges'][i])
            age_count[region_dictionary['age'][i]] += 1
    
    
    for i in range(0, len(age_to_charges)-1):
        age_to_avg[region_dictionary['age'][i]] = round(float(age_to_charges[region_dictionary['age'][i]]) / 
                                                        int(age_count[region_dictionary['age'][i]]),2)
   
    # For loop that goes through age_to_avg dictionary to find the min and max average with their corresponding ages    
    for i in age_to_avg:
        if min_avg > age_to_avg[i]:
            min_avg = age_to_avg[i]
            min_age = i
        
        if max_avg < (age_to_avg[i]):
            max_avg = age_to_avg[i]
            max_age = i
    
    return max_age, max_avg, min_age, min_avg

In [846]:
NE_max_age_avg = age_to_avg(northeast_records)
NW_max_age_avg = age_to_avg(northwest_records)
SE_max_age_avg = age_to_avg(southeast_records)
SW_max_age_avg = age_to_avg(southwest_records)

print(NE_max_age_avg)
print(NW_max_age_avg)
print(SE_max_age_avg)
print(SW_max_age_avg)

('59', 21787.62, '22', 2952.24)
('60', 22740.42, '26', 3157.81)
('61', 30238.46, '21', 4056.66)
('64', 27669.87, '35', 5298.23)


Here we can see that the older you are, the higher your insurance charge will be, while the younger you are, the lower the charge will be.

Again we see that the southeast has the highest numbers, while also having the highest BMIs out of the regions.

In [729]:
# Find the mode of the BMIs in each region
northeast_charge_mode = mode(northeast_charges)
northwest_charge_mode = mode(northwest_charges)
southeast_charge_mode = mode(southeast_charges)
southwest_charge_mode = mode(southwest_charges)

print(northeast_charge_mode)
print(northwest_charge_mode)
print(southeast_charge_mode)
print(southwest_charge_mode)


northeast_Q1_charge, northeast_Q3_charge, northeast_charge_IQR = Median_IQR(northeast_charges)
northwest_Q1_charge, northwest_Q3_charge, northwest_charge_IQR = Median_IQR(northwest_charges)
southeast_Q1_charge, southeast_Q3_charge, southeast_charge_IQR = Median_IQR(southeast_charges)
southwest_Q1_charge, southwest_Q3_charge, southwest_charge_IQR = Median_IQR(southwest_charges)

northeast_outliers = outliers(northeast_Q1_charge, northeast_Q3_charge, northeast_charge_IQR, northeast_charges, float)
northwest_outliers = outliers(northwest_Q1_charge, northwest_Q3_charge, northwest_charge_IQR, northwest_charges, float)
southeast_outliers = outliers(southeast_Q1_charge, southeast_Q3_charge, southeast_charge_IQR, southeast_charges, float)
southwest_outliers = outliers(southwest_Q1_charge, southwest_Q3_charge, southwest_charge_IQR, southwest_charges, float)

print(northeast_outliers)
print(northwest_outliers)
print(southeast_outliers)
print(southwest_outliers)

('6406.4107', 4)
('1639.5631', 8)
('1725.5523', 4)
('16884.924', 4)
No outliers
No outliers
No outliers
No outliers
None
None
None
None


There are no outliers for the charges within each region. It's interesting to note that for the northwest region, there are 8 occurences of low insurance charges. Lower charges could skew the averages to be lower which could explain why the northwest had lower averages out of the other three regions.

It's evident that BMI has a high influence on the insurance charge.

It was shown that both the southern regions had the same number of smokers, but the southwest had a significantly lower average insurance charge for smokers.

We saw that several times that the average charge for each attribute, in each region, was always highest in the southeast, which also had the highest BMI.

I would predict that if the average charge for the number of children, in each region were taken, we would see again that the southeast would have a higher average insurance charge, out of the other regions.

It's safe to say that a patient who lowers their BMI will see a signficant decrease in their insurance charge. The same can be said for smoking. This is because health risks increase your risks of diseases, which drive up insurance costs.

The most interesting analysis found is that there is a significant difference between the average charge for males versus the average charge for females. It would be fascinating to understand why insurance companies charge more for women. Are insurance companies sexist, or are there health factors that differ between men and women that lead to higher insurance costs?

As you can see, there is still room for further anaylsis.