# U.S. Medical Insurance Costs

Looking at the insurance.csv file in Microsoft Notepad, I can see that each line is a separate entry, and that values are separated by commas. There are 7 fields, and I can notice a few changes that might be made to the data to make it easier to work with.

For example, values in the "smoker" field are "yes" or "no". To calculate insurance costs, it might be easier to have these as 1 or 0. The same applies to the "sex" field. I could change "male" and "female" to 1 and 0 respectively.

In [2]:
import csv
import operator as op

**Project Goals:**  
1. Work out whether there is any correlation between a patient's smoking status, and other variables
   * Compare smoking status and age, sex, and bmi.
<br/><br/>
2. Work out where the majority of patients in this dataset are from.
   * This will mean grouping records into categories for every possible value of "region", and performing a count for each.
<br/><br/>
3. Work out whether a patient's region bears any correlation to their insurance cost
   * Similarly to how I paired up smoking status with other fields, I will do the same with "region".

In [12]:
original_data_dict = {}
with open("insurance.csv") as insurance:
    original_data = csv.DictReader(insurance)
    row_indices = -1
    for row in original_data:
        row_indices += 1
        original_data_dict["Patient{}".format(row_indices)] = row
    print("There are {} records in this dataset:\nHere is a sample of the first 10:\n".format(row_indices + 1))
rows = 0
for key, value in original_data_dict.items():
    if rows < 10:
        print(key, ": ", value, "\n")
        rows += 1
    else:
        break

There are 1338 records in this dataset:
Here is a sample of the first 10:

Patient0 :  {'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'} 

Patient1 :  {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'} 

Patient2 :  {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'} 

Patient3 :  {'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'} 

Patient4 :  {'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'} 

Patient5 :  {'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'} 

Patient6 :  {'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'n

The numerical values in this dataset are all stored as strings. This could make them difficult to work with, so I'm going to convert them all to integer or float types:

In [4]:
for data in original_data_dict.values(): #converting all numerical strings into int or float.
    data["age"] = int(data.get("age"))
    data["bmi"] = float(data.get("bmi"))
    data["children"] = int(data.get("children"))
    data["charges"] = float(data.get("charges"))
#print(original_data_dict)

I might want to filter my data based on certain parameters. For example, I might want to view only the people in the northeast, or people who are female. I'm going to create a filter function that uses the operators module (I have imported it as op) to return a new dictionary containing only the results that I want.

I will pass in the dictionary, and then a list of lists. the inner lists will be the search criteria, in the format \[parameter, operator, value\]. The parameter is the name of the field I want to refine (e.g. "age"). The operator is a function from the operators module, (e.g. op.lt for "less than), and the value is the value I want to compare my results to.

So for example, if I call filter_dataset(original_data_dict, \[\["age", op.gt, 25\], \["smoker", op.eq, "yes"\]\]), that will give me all the patients aged over 25 who smoke.

In [5]:
def filter_dataset(dataset, povs): #pov is a list like[parameter, operator, value], povs is a list of lists
    rtn = {}
    for k, v in dataset.items():
        fail = 0
        success = 0
        while fail + success < len(povs):
            for pov in povs:
                parameter = pov[0]
                operator = pov[1]
                value = pov[2]
                if operator(v.get(parameter), value) == False:
                    fail += 1
                else:
                    success += 1
        if fail == 0:
            rtn[k] = v
    return rtn

#testing the function:
test = filter_dataset(original_data_dict, [["age", op.gt, 25], ["smoker", op.eq, "yes"]])
rows = 0
for i in test.items():
    if rows < 5:
        print(i)
        rows += 1
    else:
        break

('Patient11', {'age': 62, 'sex': 'female', 'bmi': 26.29, 'children': 0, 'smoker': 'yes', 'region': 'southeast', 'charges': 27808.7251})
('Patient14', {'age': 27, 'sex': 'male', 'bmi': 42.13, 'children': 0, 'smoker': 'yes', 'region': 'southeast', 'charges': 39611.7577})
('Patient19', {'age': 30, 'sex': 'male', 'bmi': 35.3, 'children': 0, 'smoker': 'yes', 'region': 'southwest', 'charges': 36837.467})
('Patient23', {'age': 34, 'sex': 'female', 'bmi': 31.92, 'children': 1, 'smoker': 'yes', 'region': 'northeast', 'charges': 37701.8768})
('Patient29', {'age': 31, 'sex': 'male', 'bmi': 36.3, 'children': 2, 'smoker': 'yes', 'region': 'southwest', 'charges': 38711.0})


In [6]:
print(op.eq(original_data_dict.get("Patient0").get("smoker"), "yes"))

True


I also want a function for calculating averages:

In [7]:
def get_average(dataset, parameter):
    total = 0
    count = 0
    for k, v in dataset.items():
        count += 1
        total += v.get(parameter)
    ave = round(total/ count, 2)
    return round(ave, 2)

#testing the function:
dataset_sample = {}
rows = 0
for k, v in original_data_dict.items():
    if rows < 5:
        dataset_sample[k] = v
        rows += 1
    else:
        break
print([v.get("age") for v in dataset_sample.values()])
print((19+18+28+33+32)/5)
get_average(dataset_sample, "age")

[19, 18, 28, 33, 32]
26.0


26.0

...and a function for calculating percentages:

In [8]:
def percentage(a, b):
    return round((a / b) * 100, 2)

#testing the function:
print(percentage(50, 100))

50.0


**Task 1: Work out if there is any correlation between a patient's smoking status and other variables:**  
Here is where I will pair up the "smoker" field with other fields, to see if there is any correlation.

**Task 1a: Smoking Status vs. Age:**

In [225]:
#Finding max and min age groups:
all_ages = [value.get("age") for value in original_data_dict.values()]
print(max(all_ages))
print(min(all_ages))

64
18


In [226]:
#Creating lists of age groups, members of each age group, members of each age group who are smokers, and percentages.

age_groups = ["under_20s", "_20s", "_30s", "_40s", "_50s", "_60s"]
age_group_members = [
    filter_dataset(original_data_dict, [["age", op.lt, 20]]),
    filter_dataset(original_data_dict, [["age", op.ge, 20], ["age", op.lt, 30]]),
    filter_dataset(original_data_dict, [["age", op.ge, 30], ["age", op.lt, 40]]),
    filter_dataset(original_data_dict, [["age", op.ge, 40], ["age", op.lt, 50]]),
    filter_dataset(original_data_dict, [["age", op.ge, 50], ["age", op.lt, 60]]),
    filter_dataset(original_data_dict, [["age", op.ge, 60], ["age", op.lt, 70]])
]
age_group_smokers = [
    filter_dataset(original_data_dict, [["age", op.lt, 20], ["smoker", op.eq, "yes"]]),
    filter_dataset(original_data_dict, [["age", op.ge, 20], ["age", op.lt, 30], ["smoker", op.eq, "yes"]]),
    filter_dataset(original_data_dict, [["age", op.ge, 30], ["age", op.lt, 40], ["smoker", op.eq, "yes"]]),
    filter_dataset(original_data_dict, [["age", op.ge, 40], ["age", op.lt, 50], ["smoker", op.eq, "yes"]]),
    filter_dataset(original_data_dict, [["age", op.ge, 50], ["age", op.lt, 60], ["smoker", op.eq, "yes"]]),
    filter_dataset(original_data_dict, [["age", op.ge, 60], ["age", op.lt, 70], ["smoker", op.eq, "yes"]])
]
percentages = [
    percentage(len(age_group_smokers[n]), len(age_group_members[n])) for n in range(len(age_group_members))
]

Now the data needs to be displayed in an easy- to- analyse way:

In [227]:
for i in range(len(age_groups)):
    print("For the age group '{0}', the percentage of people who are smokers is {1}%.".format(age_groups[i], percentages[i]))

For the age group 'under_20s', the percentage of people who are smokers is 21.9%.
For the age group '_20s', the percentage of people who are smokers is 20.0%.
For the age group '_30s', the percentage of people who are smokers is 22.57%.
For the age group '_40s', the percentage of people who are smokers is 22.22%.
For the age group '_50s', the percentage of people who are smokers is 15.13%.
For the age group '_60s', the percentage of people who are smokers is 23.68%.


In [228]:
#It looks like the age group "_50s" is an anomaly. Let's see by how much:
print(round(((21.9 + 20 + 22.57 + 22.22 + 23.68)/ 5) - 15.13, 2))

6.94


So now we have the percentage of people in each age group who smoke (the last item in the list for each age group).
The only age group that stands out is the people in their 50s, who are on average 6.94 percentage points beneath other age groups.

**Task 1b: Smoking Status vs. Sex:**

In [229]:
males = filter_dataset(original_data_dict, [["sex", op.eq, "male"]])
male_smokers = filter_dataset(males, [["smoker", op.eq, "yes"]])
females = filter_dataset(original_data_dict, [["sex", op.eq, "female"]])
female_smokers = filter_dataset(females, [["smoker", op.eq, "yes"]])

percent_male_smokers = percentage(len(male_smokers), len(males))
percent_female_smokers = percentage(len(female_smokers), len(females))

print("The percentage of males in our dataset who smoke is {}%".format(percent_male_smokers))
print("The percentage of females in our dataset who smoke is {}%".format(percent_female_smokers))

The percentage of males in our dataset who smoke is 23.52%
The percentage of females in our dataset who smoke is 17.37%


From the above calculations we can see that 23.52% of males in our dataset are smokers, compared to 17.37% of females. Therefore, based only on our dataset, we can say that males are (23.52/17.37 = ) 1.35 times as likely to be smokers than females.

**Task 1c: Work out if there is a correlation between being a smoker and BMI:**

In [230]:
#Finding the range of BMI values in our dataset:
bmis = [patient["bmi"] for patient in original_data_dict.values()]
print(max(bmis))
print(min(bmis))

53.13
15.96


In [238]:
bmi_groups = ["15-20", "20-25", "25-30", "30-35", "35-40", "40-45", "45-50", "50-55"]
#the data for BMI is continuous, so where I have written "15-20", read as 15 <= BMI < 20.
bmi_group_members = [
    filter_dataset(original_data_dict, [["bmi", op.ge, 15], ["bmi", op.lt, 20]]),
    filter_dataset(original_data_dict, [["bmi", op.ge, 20], ["bmi", op.lt, 25]]),
    filter_dataset(original_data_dict, [["bmi", op.ge, 25], ["bmi", op.lt, 30]]),
    filter_dataset(original_data_dict, [["bmi", op.ge, 30], ["bmi", op.lt, 35]]),
    filter_dataset(original_data_dict, [["bmi", op.ge, 35], ["bmi", op.lt, 40]]),
    filter_dataset(original_data_dict, [["bmi", op.ge, 40], ["bmi", op.lt, 45]]),
    filter_dataset(original_data_dict, [["bmi", op.ge, 45], ["bmi", op.lt, 50]]),
    filter_dataset(original_data_dict, [["bmi", op.ge, 50], ["bmi", op.lt, 55]]),
]

bmi_group_smokers = [
    filter_dataset(bmi_group_members[i], [["smoker", op.eq, "yes"]]) for i in range(len(bmi_group_members))
]

percentages = [
    percentage(len(bmi_group_smokers[i]), len(bmi_group_members[i])) for i in range(len(bmi_group_members))
]

for i in range(len(bmi_groups)):
    print("The percentage of people in the bmi group {0} who smoke is {1}%.".format(bmi_groups[i], percentages[i]))

The percentage of people in the bmi group 15-20 who smoke is 21.95%.
The percentage of people in the bmi group 20-25 who smoke is 22.55%.
The percentage of people in the bmi group 25-30 who smoke is 19.17%.
The percentage of people in the bmi group 30-35 who smoke is 18.93%.
The percentage of people in the bmi group 35-40 who smoke is 22.22%.
The percentage of people in the bmi group 40-45 who smoke is 22.54%.
The percentage of people in the bmi group 45-50 who smoke is 23.53%.
The percentage of people in the bmi group 50-55 who smoke is 33.33%.


The most interesting bmi group is 50_to_55, with 33.33% of people in this group being smokers. According to the NHS website, for most adults the healthy bmi range is 18.5 to 24.9. However this is not the range with the lowest percentage of smokers- that goes to the 30_to_35 group, which according to the NHS is in the obese range.

I am not going to speculate about possible reasons for this, because I only have this limited data. So I will just conclude that based on this data, there is no obvious correlation between bmi and being a smoker.

**Task 1d: Smoking Status vs. Region:**

In [239]:
#Find out all the distinct values of "region" in the dataset:
regions = []
for patient in original_data_dict.values():
    if patient.get("region") not in regions:
        regions.append(patient.get("region"))
print(regions)

['southwest', 'southeast', 'northwest', 'northeast']


In [267]:
regions = ["southwest", "southeast", "northwest", "northeast"]
region_members = [
    filter_dataset(original_data_dict, [["region", op.eq, regions[i]]]) for i in range(len(regions))
]

region_smokers = [
    filter_dataset(region_members[i], [["smoker", op.eq, "yes"]]) for i in range(len(region_members))
]

percentages = [
    percentage(len(region_smokers[i]), len(region_members[i])) for i in range(len(region_members))
]

for i in range(len(regions)):
    print("The percentage of people who live in the {0} who smoke is {1}%.".format(regions[i], percentages[i]))

The percentage of people who live in the southwest who smoke is 17.85%.
The percentage of people who live in the southeast who smoke is 25.0%.
The percentage of people who live in the northwest who smoke is 17.85%.
The percentage of people who live in the northeast who smoke is 20.68%.


I thought it was strange that the percentage for "southwest" and "northwest" was exactly the same. I couldn't see a mistake in my code but I wasn't sure, so I did the following experiments to see:

In [268]:
#Create a dictionary of {region: [smokers, non_smokers, total, percentage]}
region_to_smoking_status = {
    region: [0, 0, 0, 0] for region in regions
}
for patient in original_data_dict.values():
    region_to_smoking_status[patient.get("region")][2] += 1
    if patient.get("smoker") == "yes":
        region_to_smoking_status[patient.get("region")][0] += 1
    else:
        region_to_smoking_status[patient.get("region")][1] += 1
for data in region_to_smoking_status.values():
    data[3] = round((data[0]/ data[2]) * 100, 2)
for region in region_to_smoking_status.items():
    print(region)

('southwest', [58, 267, 325, 17.85])
('southeast', [91, 273, 364, 25.0])
('northwest', [58, 267, 325, 17.85])
('northeast', [67, 257, 324, 20.68])


...It came out exactly the same...

In [269]:
count_sw = 0 #total people in southwest
for patient in original_data_dict.values():
    if patient.get("region") == "southwest":
        count_sw += 1
        
count_sw_smokers = 0 #total smokers in southwest
for patient in original_data_dict.values():
    if patient.get("region") == "southwest" and patient.get("smoker") == "yes":
        count_sw_smokers += 1
        
count_sw_children = 0 #total children in southwest
for patient in original_data_dict.values():
    if patient.get("region") == "southwest":
        count_sw_children += patient.get("children")

count_sw_smokers_30 = 0 #total 30 year old smokers in southwest
for patient in original_data_dict.values():
    if patient.get("region") == "southwest" and patient.get("age") == 30 and patient.get("smoker") == "yes":
        count_sw_smokers_30 += 1

print(count_sw, count_sw_smokers, count_sw_children, count_sw_smokers_30)

325 58 371 3


In [270]:
count_nw = 0 #total people in northwest
for patient in original_data_dict.values():
    if patient.get("region") == "northwest":
        count_nw += 1

count_nw_smokers = 0 #total smokers in northwest
for patient in original_data_dict.values():
    if patient.get("region") == "northwest" and patient.get("smoker") == "yes":
        count_nw_smokers += 1

count_nw_children = 0 #total children in northwest
for patient in original_data_dict.values():
    if patient.get("region") == "northwest":
        count_nw_children += patient.get("children")

count_nw_smokers_30 = 0 #total 30 year old smokers in northwest
for patient in original_data_dict.values():
    if patient.get("region") == "northwest" and patient.get("age") == 30 and patient.get("smoker") == "yes":
        count_nw_smokers_30 += 1

print(count_nw, count_nw_smokers, count_nw_children, count_nw_smokers_30)

325 58 373 3


The only difference is the number of children. But the total people, total smokers and total 30 year old smokers are the same. Could be coincidence, but I thought it was interesting. However if this was a real database, it would make me want to check if data was entered correctly.

**Task 2: Work out where the majority of patients in the dataset are from**

In [271]:
def region_count(dataset):
    region_count = {
        "southwest": 0,
        "southeast": 0,
        "northwest": 0,
        "northeast": 0
    }
    for patient in dataset.values():
        region_count[patient.get("region")] += 1
    print(region_count)

region_count(original_data_dict)

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


People seem to be fairly evenly spread out between regions.

**Task 3: Work out if there is a relationship between region and insurance cost**

In [272]:
region_members = [
    filter_dataset(original_data_dict, [["region", op.eq, regions[i]]]) for i in range(len(regions))
]

ave_charges = [
    round(get_average(region_members[i], "charges"), -3) for i in range(len(region_members))
]

for i in range(len(regions)):
    print("The average charge of insurance for people in the {0} is ${1}.".format(regions[i], ave_charges[i]))

The average charge of insurance for people in the southwest is $12000.0.
The average charge of insurance for people in the southeast is $15000.0.
The average charge of insurance for people in the northwest is $12000.0.
The average charge of insurance for people in the northeast is $13000.0.


I rounded all the average charges to 2sf. People in the southeast tend to have the most expensive insurance costs, at around $15000. From earlier, when analysing percentage of smokers per region, people in the southeast had the highest score, at 25 \%, which is likely to increase insurance cost. However I wonder if there are any other factors that could be contributing to this, so I'm going to test region against BMI, and region against number of children.

In [273]:
#Average BMI per region:

ave_bmi_per_region = [
    get_average(region_members[i], "bmi") for i in range(len(region_members))
]

for i in range(len(regions)):
    print("The average BMI for people in the {0} is {1}.".format(regions[i], ave_bmi_per_region[i]))

The average BMI for people in the southwest is 30.6.
The average BMI for people in the southeast is 33.36.
The average BMI for people in the northwest is 29.2.
The average BMI for people in the northeast is 29.17.


So people in the southeast also have the highest average BMI, which is likely to be contributing to higher insurance costs.

In [274]:
#Average number of children per region:

ave_children_per_region = [
    get_average(region_members[i], "children") for i in range(len(region_members))
]

for i in range(len(regions)):
    print("The average number of children for people in the {0} is {1}.".format(regions[i], ave_children_per_region[i]))

The average number of children for people in the southwest is 1.14.
The average number of children for people in the southeast is 1.05.
The average number of children for people in the northwest is 1.15.
The average number of children for people in the northeast is 1.05.


People in the southeast and the northeast are equal in having the highest average number of children. However it is by less than half a child so is unlikely to be a major contributor to the southeast's high insurance costs.