# U.S. Medical Insurance Costs

### Overview

In the U.S. the cost of medical insurance can be dependent on a number of different factors such as a person's age, sex, body mass index (BMI), the number of children they have, where they are from, and whether or not they smoke. 

### Scope

In this project, I will be using Python3 fundamentals to analyze the dataset containing medical insurance data to determine how various factors can impact the cost of medical insurance. Specifically, I will examine how changing factors such as smoking status and BMI affects the cost of insurance in order to make recommendations for patients to decrease medical insurance costs. I will also analyze how insurance costs differ depending on a patient's age, sex, number of children and region they are from to determine which demographics results in the highest costs of insurance.

In [1]:
# import CSV libraries
import csv

For this project, I will be importing the `CSV` library to help read, organize, and analyze the __insurance.csv__ dataset.

In [15]:
with open('insurance.csv') as insurance_file:
    insurance_data = insurance_file.read() 

Examining the __insurance.csv__ dataset above, we can see that there are 7 given columns:

* Patient's age
* Patient's sex
* Patient's BMI
* Patient's number of children
* Patient's smoking status
* Patient's regional location
* Patient's medical insurance cost

Since the data is currently stored in a CSV file, it is difficult to manipulate and analyze. For easier analysis, I will create empty lists for each of the columns above and store the datapoints of the CSV file in the appropriate lists.

In [3]:
# Create empty lists for each column of data in csv file
ages = []
sexes = []
bmis = []
num_children = []
smoker_status = []
regions = []
charges = []

Using **DictReader** I iterated through each row in the dataset and appended the datapoints into their corresponding lists.

In [4]:
# View insurance.csv file and sort data into lists
with open('insurance.csv', newline = '') as insurance_file:
    insurance_dict = csv.DictReader(insurance_file)
    for row in insurance_dict:
        ages.append(row['age'])
        sexes.append(row['sex'])
        bmis.append(row['bmi'])
        num_children.append(row['children'])
        smoker_status.append(row['smoker'])
        regions.append(row['region'])
        charges.append(row['charges'])
    

Using the len method, I verified that there are no missing datapoints in the dataset.

In [5]:
#verify that each column does not have nulls and is not missing any information
print(len(ages) == (len(sexes)) == (len(bmis)) == (len(num_children)) == (len(smoker_status)) == (len(regions)) == (len(charges)))

True


In order to better understand our dataset, I created a function to find the average of a given variable.

In [6]:
def find_average(variable):
    total = 0
    for x in variable:
        total += float(x)
    return float(total/(len(variable)))

Using the function that I created above, I found the average age, bmi, number of children, and insurance cost for the dataset.

In [7]:
print(find_average(ages))
print(find_average(bmis))
print(find_average(num_children))
print(find_average(charges))

39.20702541106129
30.663396860986538
1.0949177877429
13270.422265141257


In our dataset, the patients have the following averages:

* Average age =  39
* Average BMI = 30.66
* Average number of children = 1
* Average insurance cost = $13270.42

By iterating through the "sexes" list, we can find the number of females vs males in our dataset.

In [8]:
male_count = 0
female_count = 0
for sex in sexes:
    if sex == "male":
        male_count += 1
    else:
        female_count += 1

print("There are {male_count} males in the dataset.".format(male_count = male_count))
print("There are {female_count} females in the dataset.".format(female_count = female_count))

There are 676 males in the dataset.
There are 662 females in the dataset.


We can examine the regions list by iterating through each item and finding the number of patients from each region.

In [9]:
region_count = {}
for region in regions:
    if region not in region_count:
        region_count[region] = 1
    else:
        region_count[region] += 1

print(region_count)

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


We can also compare the number of smokers vs non_smokers.

In [10]:
smoker_count = 0
nonsmoker_count = 0
for data in smoker_status:
    if data == "yes":
        smoker_count += 1
    else:
        nonsmoker_count += 1

print("There are {smoker_count} smokers in the dataset.".format(smoker_count = smoker_count))
print("There are {nonsmoker_count} nonsmokers in the dataset.".format(nonsmoker_count = nonsmoker_count))

There are 274 smokers in the dataset.
There are 1064 nonsmokers in the dataset.


Now that we have a general overview of the demographics data in our data set, we can analyze how various factors affect the cost of insurance. By creating a function that takes in the cost factor and the factors that we are comparing as inputs, we can compare the differences in costs.

In [11]:
def compare_costs(cost_factor, factor1, factor2):
    with open("insurance.csv") as f:
        file_dict = csv.DictReader(f)
        total_factor1_charges = 0
        factor1_count = 0
        total_factor2_charges = 0
        factor2_count = 0
        for row in file_dict:
            if row[cost_factor] == factor1:
                total_factor1_charges += float(row['charges'])
                factor1_count += 1
            elif row[cost_factor] == factor2:
                total_factor2_charges += float(row['charges'])
                factor2_count += 1
                
    average_factor1_cost = round((total_factor1_charges / factor1_count), 2)
    average_factor2_cost = round((total_factor2_charges / factor2_count), 2)
             
    print("Average insurance costs if {cost_factor} is {factor1}: ${average_factor1_cost}"
          .format(cost_factor = cost_factor, factor1 = factor1, average_factor1_cost = (average_factor1_cost)))
    print("Average insurance costs if {cost_factor} is {factor2}: ${average_factor2_cost}"
          .format(cost_factor = cost_factor, factor2 = factor2, average_factor2_cost = average_factor2_cost))
    
    cost_difference = round(average_factor1_cost - average_factor2_cost, 2)
                    
    print ("The difference in costs depending on {cost_factor} is ${cost_difference}"
           .format(cost_factor = cost_factor, cost_difference = cost_difference))

Using the function created above, we can compare price differences for male and female.

In [12]:
insurance_diff_for_sex = compare_costs('sex', 'male', 'female')


Average insurance costs if sex is male: $13956.75
Average insurance costs if sex is female: $12569.58
The difference in costs depending on sex is $1387.17


On average, male patients pay $1387.17 more for insurance annually than females.

We can also use the function to compare costs for smoker status.

In [13]:
insurance_diff_for_smoker_status = compare_costs('smoker', 'yes', 'no')

Average insurance costs if smoker is yes: $32050.23
Average insurance costs if smoker is no: $8434.27
The difference in costs depending on smoker is $23615.96


On average, patients who answer "yes" to smoker status pay $23615.96 more annually compared to patients who answer "no".

Another factor we can analyze is BMI. BMI stands for body mass index and measures a person's body fat depending on their height and weight. BMI is seperated by categories:

* BMI < 18.5 is in the underweight range
* BMI 18.5 - 24.9 is in the healthy weight range
* BMI 25.0 - 29.9 is in the overweight range
* BMI >=30.0 is in the obesity range

With this in mind, we can analyze the difference in insurance costs by BMI.

In [14]:
with open('insurance.csv') as insurance_file:
    insurance_dict = csv.DictReader(insurance_file)
    underweight_count = 0
    healthy_count = 0
    overweight_count = 0
    obese_count = 0
    
    underweight_charges = 0
    healthy_charges = 0
    overweight_charges = 0
    obese_charges = 0
    
    for row in insurance_dict:
        bmi = float(row['bmi'])
        charge = float(row['charges'])
        if bmi < 18.5:
            underweight_count += 1
            underweight_charges += charge
        elif bmi >= 18.5 and bmi < 25:
            healthy_count += 1
            healthy_charges += charge
        elif bmi >= 25 and bmi < 30:
            overweight_count += 1
            overweight_charges += charge
        else:
            obese_count += 1
            obese_charges += charge
    
    avg_underweight_charges = round(underweight_charges/ underweight_count, 2)
    avg_healthy_charges = round(healthy_charges / healthy_count, 2)
    avg_overweight_charges = round(overweight_charges / overweight_count, 2)
    avg_obese_charges = round(obese_charges / obese_count)
    
print("The average cost for patients in the underweight range is : $" + str(avg_underweight_charges))
print("The average cost for patients in the healthy range is : $" + str(avg_healthy_charges))
print("The average cost for patients in the overweight range is : $" + str(avg_overweight_charges))
print("The average cost for patients in the obese range is : $" + str(avg_obese_charges))

print("The difference in average cost for obese patients versus overweight patients is: $" + str(avg_obese_charges - avg_overweight_charges))

The average cost for patients in the underweight range is : $8852.2
The average cost for patients in the healthy range is : $10409.34
The average cost for patients in the overweight range is : $10987.51
The average cost for patients in the obese range is : $15552
The difference in average cost for obese patients versus overweight patients is: $4564.49


From our analysis above, we can see that patients in the obese range pay $4565.49 than patients in the overweight range.

## Findings and Recommendations

Based on the calculations done in this analysis, we can see that being a smoker and/or having a higher BMI can significantly increase the annual cost from insurance. While other factors such as age, sex, number of children, and region can affect these costs, the average costs of insurance rises about $4000 and $23000 for smokers and individuals with high BMI, respectively. From these findings, I would recommend for patients with high insurance costs to consider to quit smoking or lower their BMI if possible in order to reduce annual insurance costs.