# U.S. Medical Insurance Costs

In this independent portfolio project, we aim to explore a dataset containing US Medical Insurance Costs (<b>insurance.csv</b>) using Python fundamentals.
We will start by taking a profile of the demographic characteristics within the dataset (i.e. average age, proportion of BMI categories, percentage male vs. female, profile of those with children, geographical region of the residence, smoker vs. non-smoker, etc.).
We will then explore how these demographic factors potentially impact the individual's medical insurance costs and whether there are differences based on key status like gender, smoking, area of residence, children, age, and BMI.
The goal of the project is come up with a comprehensive overview of the demographics within the dataset and how different factors potentially impact their medical insurance costs.

The first step would be to use the CSV library to load the csv data into Python as a dictionary object so that each row of the csv is a dictionary object with the keys taken from the header of the CSV file.

In [81]:
import csv

with open("insurance.csv", newline = "") as csvfile:
    insurance_dict = csv.DictReader(csvfile)
    insurance_data = [row for row in insurance_dict]

#print(insurance_data)

The <b>insurance.csv</b> contains 7 columns:
- Patient Age
- Patient Sex
- Patient's BMI values
- The number of children the patient has
- Patient's smoking status
- The region that the patient resides in
- The medical insurance cost of the patient.

Note that there are no missing data.

Now that the data is loaded into Python, we can save the variables we want to analyze later. This will be in the form of lists to enable quick dissemination of the desired variable/feature.

In [90]:
ages = [int(row["age"]) for row in insurance_data]
sexes = [row["sex"] for row in insurance_data]
bmis = [float(row["bmi"]) for row in insurance_data]
children = [int(row["children"]) for row in insurance_data]
smoking_status = [row["smoker"] for row in insurance_data]
regions = [row["region"] for row in insurance_data]
costs = [float(row["charges"]) for row in insurance_data]

#Smoking status data are non-descript by itself, so modifying
smoking_status = [x.replace("yes", "smoker") for x in smoking_status]
smoking_status = [x.replace("no", "non-smoker") for x in smoking_status]

Now that the variables are created as individual lists. We can start analyzing the data.
First off, we want to build some functions that can help us speed up the analyses. 
The functions that we make will:
- Calculate the mean of the values in the list
- Count the number of values for a variable in a given list.
- Provide a summary of the proportion of each category in the list as percentages. The summary is returned as a dictionary with each category as a key.
- Provide a total of each category in the list. The total is returned as a dictionary with each category as a key.
- Calculate the mean, based on given stratifier. The mean is returned as a dictionary with each category as a key.
- Compare the mean of two classes based on the stratifier and return a summary as a string.

These functions will help us build a summary of the data we have on hand.

In [97]:
def list_mean(var_list):
    return round(sum(var_list)/len(var_list), 2)

def count(var, list):
    count = 0
    for item in list:
        if item == var:
            count += 1
    return count

def list_summary(var_list):
    summary_dict = {}
    for x in var_list:
        if summary_dict.get(x, 0) == 0:
            summary_dict[x] = round(count(x, var_list) / len(var_list) * 100, 1)
    return dict(sorted(summary_dict.items()))

def count_summary(var_list):
    count_dict = {}
    for x in var_list:
        if count_dict.get(x, 0) == 0:
            count_dict[x] = count(x, var_list)
    return dict(sorted(count_dict.items()))

def average_by(var, stratifier):
    data = list(zip(var, stratifier))
    average_summary = {}
    for x in stratifier:
        if average_summary.get(x, 0) == 0:
            stratifier_list = [var[0] for var in data if var[1] == x]
            average_summary[x] = list_mean(stratifier_list)
    return dict(sorted(average_summary.items()))

def compare_classes(var,stratifier, class1, class2, var_label):
    mean_list = average_by(var, stratifier)
    difference = round(mean_list[class1] - mean_list[class2], 2)
    return "The difference in mean {} between {} and {} is {}.".format(var_label, class1, class2, difference)

#Print out some summary of the important means from the data
print("The average age of the population in the dataset is {}.".format(list_mean(ages)))
print("The average bmi of the population in the dataset is {}.".format(list_mean(bmis)))
print("The average number of children of the population in the dataset is {}.".format(list_mean(children)))
print("The average annual cost of medical insurance of the population in the dataset is {}.".format(list_mean(costs)))
print()

print(count_summary(sexes)) 
print(count_summary(children))
print(count_summary(smoking_status))
print(count_summary(regions))

print(list_summary(sexes)) 
print(list_summary(children))
print(list_summary(smoking_status))
print(list_summary(regions))

print(average_by(costs, sexes))
print(average_by(costs, children)) 
print(average_by(costs, smoking_status)) 
print(average_by(costs, regions)) 

print(average_by(bmis, sexes))
print(average_by(bmis, children)) 
print(average_by(bmis, smoking_status)) 
print(average_by(bmis, regions))

print(average_by(ages, sexes))
print(average_by(ages, children)) 
print(average_by(ages, smoking_status)) 
print(average_by(ages, regions))
print()

print(compare_classes(costs, regions, "northeast", "southwest", "cost of medical insurance"))
print(compare_classes(bmis, smoking_status, "smoker", "non-smoker", "BMI value"))
print(compare_classes(costs, smoking_status, "smoker", "non-smoker", "cost of medical insurance"))
print(compare_classes(costs, children, 3, 1, "cost of medical insurance"))

The average age of the population in the dataset is 39.21.
The average bmi of the population in the dataset is 30.66.
The average number of children of the population in the dataset is 1.09.
The average annual cost of medical insurance of the population in the dataset is 13270.42.

{'female': 662, 'male': 676}
{0: 574, 1: 324, 2: 240, 3: 157, 4: 25, 5: 18}
{'non-smoker': 1064, 'smoker': 274}
{'northeast': 324, 'northwest': 325, 'southeast': 364, 'southwest': 325}
{'female': 49.5, 'male': 50.5}
{0: 42.9, 1: 24.2, 2: 17.9, 3: 11.7, 4: 1.9, 5: 1.3}
{'non-smoker': 79.5, 'smoker': 20.5}
{'northeast': 24.2, 'northwest': 24.3, 'southeast': 27.2, 'southwest': 24.3}
{'female': 12569.58, 'male': 13956.75}
{0: 12365.98, 1: 12731.17, 2: 15073.56, 3: 15355.32, 4: 13850.66, 5: 8786.04}
{'non-smoker': 8434.27, 'smoker': 32050.23}
{'northeast': 13406.38, 'northwest': 12417.58, 'southeast': 14735.41, 'southwest': 12346.94}
{'female': 30.38, 'male': 30.94}
{0: 30.55, 1: 30.62, 2: 30.98, 3: 30.68, 4: 31.

These functions help us build out a summary that we can further use in later analyses. At this stage, we can make descriptive statistics, but without extensive coding we are not providing significance testing yet - which will come when we incorporate numpy and pandas into our working libraries.