# U.S. Medical Insurance Costs

This project aims to analyse data about medical insurance cost in the U.S. It is pursued as part of a codecademy course on data analysis using python. The data has been provided by codecademy as well.

## Importing the data
The data is delivered as a csv-sheet. In a first step the data shall be importet and be put into variables (lists and dictionaries) to be evaluated. The csv-module will be imported to perform its .DictReader()-function to import the data.

In [2]:
import csv

all_data = []
with open("insurance.csv", newline='') as insurance_raw_data:
    insurance_dict = csv.DictReader(insurance_raw_data)
    for row in insurance_dict:
        all_data.append(row)

## Tidying the data
In the following step, the data shall be checked for irregularities and, if necessary, cleaned for further use.
Since the DictReader function checks for rows with more or less fields than fielnames, i could check for restval and restkey.

In [3]:
for i in range(10):
    print(all_data[i])

{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}
{'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}
{'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}
{'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}
{'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}
{'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}
{'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '8240.5896'}
{'age': '37', 'sex': 'female', 'bmi': '27.74', 'children': '3', 'smoker': 'no', 'region': 'northwest', 'charges': '7281.

The data seems to be in a very good condition. There are no empty cells and no problems have occured during the import.

## Converting the data to fitting data-types

In [4]:
# To facilitate further analysis and making operations faster, every column of the dataset will be put in a seperate list
ages = []
sexes = []
bmis = []
children = []
smokers = []
regions = []
charges = []

# The next step will convert the data into fitting data types, as for now they are all still strings
for row in all_data:
    row["age"] = int(row["age"])
    row["bmi"] = float(row["bmi"])
    row["children"] = int(row["children"])
    row["charges"] = float(row["charges"])
    
# now every value from every row will be put into their respective list-variable
for i in range(len(all_data)):
    ages.append(all_data[i]["age"])
    sexes.append(all_data[i]["sex"])
    bmis.append(all_data[i]["bmi"])
    children.append(all_data[i]["children"])
    smokers.append(all_data[i]["smoker"])
    regions.append(all_data[i]["region"])
    charges.append(all_data[i]["charges"])

## Checking data for validity
Since we don't know exactly how the data was originally acquired, we need to check if it is valid for making assumptions. To achieve that, we will check wether the data set is representing the U.S. population or wether it is biased. Special attention will be drawn with respect to the number of children and the smoker/non-smoker decision.

### The male / female ratio

In [5]:
total_sets = len(all_data)
male = sexes.count("male")
female = sexes.count("female")

print("""The data contains {total_sets} individual datasets from U.S. Citizens.
Of these, {male} are male and {female} are female.""".format(total_sets = total_sets, male = male, female = female))

The data contains 1338 individual datasets from U.S. Citizens.
Of these, 676 are male and 662 are female.


### Smoker validation
The following function will get the percentage of smokers in the dataset

In [6]:
def smokers_amount(data_as_list):
    list_len = len(data_as_list)
    count_smokers = 0
    for element in data_as_list:
        if element["smoker"] == "yes":
            count_smokers += 1
    smoker_perc = round(count_smokers * 100 / list_len, 1)
    return smoker_perc

print("The dataset contains {smoker_perc} % smokers.".format(smoker_perc = smokers_amount(all_data)))

The dataset contains 20.5 % smokers.


Since in 2022 there had been approximately 12.5 % smokers among the adult population of the U.S., the dataset seems to overweight them. Then again, the numbers changed during the last years, and it used to be 20.9 % in 2005. This could mean that the used dataset is simply some years old. For more information on U.S. smoking statistics, go to:
https://www.cdc.gov/tobacco/data_statistics/fact_sheets/adult_data/cig_smoking/index.htm

### Average number of Children
The next function should evaluate the average number of children

In [7]:
def average_child_count(data_as_list):
    list_len = len(data_as_list)
    count_children = 0
    for element in data_as_list:
        count_children += element["children"]
    avg_children = round(list_len / count_children, 1)
    return avg_children

print(average_child_count(all_data))

0.9


### Places of Origin
where do the individuals from our data live? We will now examine the regions. What regions are there in the data, and how many datasets are associated with each region.

In [8]:
def origins(regions):
    list_of_regions = []
    for i in range(len(regions)):
        if regions[i] not in list_of_regions:
            list_of_regions.append(regions[i])

    region_count = {}
    for region in list_of_regions:
        region_count.update({region: regions.count(region)})
    region_count = sorted(region_count.items(), key=lambda x:x[1], reverse = True)
    return region_count
    
print(origins(regions))

[('southeast', 364), ('southwest', 325), ('northwest', 325), ('northeast', 324)]


## Getting to know the data
### In this chapter we want to have a first look at the data. This means finding maximums and minimums and some averages, but also trying to make useful groupings like age groups
#### Who has the highest cost?
Make a list that lists the ten entrys paying the most. Sort them by cost.
Ideas for doing so:
- Sort a copy of the list and take element 1 to 10
- make a copy of the list and pop the highest row into a new list. Repeat ten times.

In [9]:
all_data_copy = list(all_data)
all_data_sorted = sorted(all_data_copy, key=lambda all_data_copy: all_data_copy["charges"], reverse=True)
for i in range(10):
    print(all_data_sorted[i])

{'age': 54, 'sex': 'female', 'bmi': 47.41, 'children': 0, 'smoker': 'yes', 'region': 'southeast', 'charges': 63770.42801}
{'age': 45, 'sex': 'male', 'bmi': 30.36, 'children': 0, 'smoker': 'yes', 'region': 'southeast', 'charges': 62592.87309}
{'age': 52, 'sex': 'male', 'bmi': 34.485, 'children': 3, 'smoker': 'yes', 'region': 'northwest', 'charges': 60021.39897}
{'age': 31, 'sex': 'female', 'bmi': 38.095, 'children': 1, 'smoker': 'yes', 'region': 'northeast', 'charges': 58571.07448}
{'age': 33, 'sex': 'female', 'bmi': 35.53, 'children': 0, 'smoker': 'yes', 'region': 'northwest', 'charges': 55135.40209}
{'age': 60, 'sex': 'male', 'bmi': 32.8, 'children': 0, 'smoker': 'yes', 'region': 'southwest', 'charges': 52590.82939}
{'age': 28, 'sex': 'male', 'bmi': 36.4, 'children': 1, 'smoker': 'yes', 'region': 'southwest', 'charges': 51194.55914}
{'age': 64, 'sex': 'male', 'bmi': 36.96, 'children': 2, 'smoker': 'yes', 'region': 'southeast', 'charges': 49577.6624}
{'age': 59, 'sex': 'male', 'bmi': 4

As was to be suspected, those who pay most are all smokers. Most of them also have a high BMI. While the majority is above the average age, roughly one third is around thirty years old.
To further investigate the cost of smoking, we will now compare the average inasurance cost of smokers and non-smokers.

In [10]:
smokers_total_cost = 0.0
count_smokers = 0
for i in range(len(all_data)):
    if all_data[i]["smoker"] == "yes":
        smokers_total_cost += all_data[i]["charges"]
        count_smokers += 1
smoker_average_cost = smokers_total_cost / count_smokers
print("The average smoker has insurance charges of {smoker_average_cost} $.".format(smoker_average_cost = round(smoker_average_cost, 2)))

nonsmokers_total_cost = 0.0
count_nonsmokers = 0
for i in range(len(all_data)):
    if all_data[i]["smoker"] == "no":
        nonsmokers_total_cost += all_data[i]["charges"]
        count_nonsmokers += 1
nonsmoker_average_cost = nonsmokers_total_cost / count_nonsmokers
print("The average non-smoker has insurance charges of {nonsmoker_average_cost} $.".format(nonsmoker_average_cost = round(nonsmoker_average_cost, 2)))

The average smoker has insurance charges of 32050.23 $.
The average non-smoker has insurance charges of 8434.27 $.


#### Who pays the least

In [11]:
all_data_copy = list(all_data)
all_data_sorted = sorted(all_data_copy, key=lambda all_data_copy: all_data_copy["charges"])
for i in range(10):
    print(all_data_sorted[i])
    
print("\nThe median charges are", all_data_sorted[int(round(len(all_data) / 2))]["charges"])

{'age': 18, 'sex': 'male', 'bmi': 23.21, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 1121.8739}
{'age': 18, 'sex': 'male', 'bmi': 30.14, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 1131.5066}
{'age': 18, 'sex': 'male', 'bmi': 33.33, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 1135.9407}
{'age': 18, 'sex': 'male', 'bmi': 33.66, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 1136.3994}
{'age': 18, 'sex': 'male', 'bmi': 34.1, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 1137.011}
{'age': 18, 'sex': 'male', 'bmi': 34.43, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 1137.4697}
{'age': 18, 'sex': 'male', 'bmi': 37.29, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 1141.4451}
{'age': 18, 'sex': 'male', 'bmi': 41.14, 'children': 0, 'smoker': 'no', 'region': 'southeast', 'charges': 1146.7966}
{'age': 18, 'sex': 'male', 'bmi': 43.01, 'children': 0, 'smoker': 

From that we can see that the persons paying the least are all young, male, non-smoking persons without children. Not all of them have a low BMI though. Interestingly, they all live in the southeast.

#### Forming age groups
Another interesting way of looking into the data could be by forming age groups. Maybe i will do this some other time.

## Questioning the Data
### Who are the smokers?
Since smoking is one factor that increases insurance cost dramatically, it would be interesting to know who the smokers are. Are they predominantely male or female? Since there a fewer smokers today than ten years ago, it would also be interesting if smokers are old or young, at what age are there the most smokers? And do people with children tend to smoke less?

In [12]:
list_of_smokers = []
for element in all_data:
    if element["smoker"] == "yes":
        list_of_smokers.append(element)


### Where do most of the smokers come from?
Since we now have a List of the smokers with the same structure as the original data, we can run our functions again, now checking for differences between smokers and non-smokers.

In [13]:
smokers_origins = []
for i in range(len(list_of_smokers)):
    smokers_origins.append(list_of_smokers[i]["region"])
print(origins(smokers_origins))

[('southeast', 91), ('northeast', 67), ('southwest', 58), ('northwest', 58)]


Most smokers live in the southeast. (Is this because this is the traditional area where tobacco was once grown and harvested?)

We can easily do the same thing for high bmis. But in this case, we need to sort the data first, because the bmi has too many different values.

### Grouping BMIs
We will group the BMIs according to the who standard:
Underweight is less than 18.5. Normal weight ranges from 18.5 unto less than 25. From 25 on, one is considered overweight and from 30 on obese. Please note that this is a classification used by the who. The BMI is not really sufficent to indicate health problems. People doing sport for example tend to have higher BMI because of their muscle weight. Also, overweight does not automatically lead to health problems.
Nevertheless, we will build 4 classes of BMI:
1 < 18.5
2 >= 18.5 and < 25
3 >= 25 and < 30
4 >= 30

In [14]:
bmi_classes = []
for i in range(len(bmis)):
    if bmis[i] < 18.5:
        bmi_classes.append(1)
    elif bmis[i] >= 18.5 and bmis[i] < 25:
        bmi_classes.append(2)
    elif bmis[i] >= 25 and bmis[i] < 30:
        bmi_classes.append(3)
    elif bmis[i] >= 30:
        bmi_classes.append(4)
print("Underweight: ", bmi_classes.count(1))
print("Healthy weight: ", bmi_classes.count(2))
print("Overweight: ", bmi_classes.count(3))
print("Obesity: ", bmi_classes.count(4))


Underweight:  20
Healthy weight:  225
Overweight:  386
Obesity:  707


A lot of the records in the database suffer from obesity! Only a minority has normal weight. This seems to be biased, as the cdc states that only 40 % of the american population has obesity. See https://www.cdc.gov/obesity/data/adult.html

#### Obesity and regions

In [15]:
#bmi_regions = list(zip(bmi_classes, regions))
obese_regions = []
for i in range(len(bmi_classes)):
    if bmi_classes[i] == 4:
        obese_regions.append(regions[i])
print(origins(obese_regions))

[('southeast', 243), ('southwest', 173), ('northwest', 148), ('northeast', 143)]


A majority of the people suffering from obesity lives in the southeast. This is quite interesting, as this is also the place where the most smokers live. We could check if there is a correlation there!