# U.S. Medical Insurance Costs

This project will focus on the impact of various factors on medical insurance costs. Using Python fundamentals, we will analyze **insurance.csv** from the provided starter ZIP. We will also gain insight into the future uses of the data.

The topics that will be analyzed are listed below: 
- The average insurance cost by region
- The smoking and BMI rates by region and how they could correlate with insurance costs for those regions
- The potential correlation of sex, number of children, and aging with insurance costs for the dataset as a whole

No evidence of missing data exists, simplifying the process of preparing the data for analysis.

In [2]:
#Import CSV library
import csv

In [6]:
### Generate dictionary from CSV file

# Create lists of dictionaries for each row in file
insurance_db = []
# Open the .csv and load the data into our list
with open("insurance.csv") as csv_file:
  insurance = csv.DictReader(csv_file)
  for row in insurance:
    insurance_db.append(row)
print(insurance_db[0])

{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}


In [87]:
# Generic numerical "average" function
def average(column, dict_list):
  av = 0
  for row in dict_list:
    value = row[column]
    try:
      av += float(value)
    except: # exception: input validation against non-numeric columns
      av = 0
      print("error: column not entirely composed of numbers")
      print("first offending value: " + str(value))
      break
  return av / len(dict_list)

# Functions to return narrowed versions of the database
def narrow(db, column, value):
  def narrower(row):
    return row[column] == value
  return list(filter(narrower, db))

def narrow_numeric(db, column, min, max):
  def narrower(row):
    float_r = float(row[column]);
    return float_r >= min and float_r < max
  return list(filter(narrower, db))

# Function to get list of unique regions
def regions_list(db):
  regions = []
  for row in db:
    row_r = row["region"]
    if row_r not in regions: 
      regions.append(row_r)
  return regions

In [64]:
average_overall = round(average("charges", insurance_db), 2)
print("The average insurance charge in the dataset overall is $" + str(average_overall) + ".\n")
regionslist = regions_list(insurance_db) # save regions list for future use in this project
regionaldbs = {} # save regional databases for future use in this project
for region in regionslist:
  regionaldbs.update({region:narrow(insurance_db, "region", region)})
  average_by_region = round(average("charges", regional_dbs[region]), 2)
  print("The average insurance charge for the \"" + region + "\" region is $" + str(average_by_region) + ".")

The average insurance charge in the dataset overall is $13270.42.

The average insurance charge for the "southwest" region is $12346.94.
The average insurance charge for the "southeast" region is $14735.41.
The average insurance charge for the "northwest" region is $12417.58.
The average insurance charge for the "northeast" region is $13406.38.


There are four unique regions present in the file. The Southeastern region appears to have the highest average insurance charge. Previous data analyses have singled out the south for health problems; [the percentage of adults who have been told by a doctor that they have diabetes tends to be highest in this region.](https://www.kff.org/wp-content/uploads/2016/02/8836-figure-5.png)

In [94]:
average_smoker = round(average("charges", narrow(insurance_db, "smoker", "yes")), 2)
average_nonsmoker = round(average("charges", narrow(insurance_db, "smoker", "no")), 2)
print("The average insurance charge in the dataset for smokers is $" + str(average_smoker) + ", compared to $" + str(average_nonsmoker) + " for nonsmokers.\n")
for region in regionslist:
  smoker_by_region = round(len(narrow(regional_dbs[region], "smoker", "yes")) / len(regional_dbs[region]), 3) * 100.0
  print(str(smoker_by_region) + "% of " + region + " adults are smokers.")
print()
average_bmi = round(average("bmi", insurance_db), 2)
avg_below_bmi = round(average("charges", narrow_numeric(insurance_db, "bmi", 0, average_bmi)), 2)
avg_above_bmi = round(average("charges", narrow_numeric(insurance_db, "bmi", average_bmi, float("inf"))), 2)
print("The average BMI in the dataset is " + str(average_bmi) + ".\n")
print("The average insurance charge in the dataset for those with an above average BMI is $" + str(avg_above_bmi) + ", \ncompared to $" + str(avg_below_bmi) + " for those with a below average BMI.\n")
for region in regionslist:
  bmi_by_region = round(average("bmi", regional_dbs[region]), 2)
  print("The average BMI in the " + region + " region is " + str(bmi_by_region) + ".")

The average insurance charge in the dataset for smokers is $32050.23, compared to $8434.27 for nonsmokers.

17.8% of southwest adults are smokers.
25.0% of southeast adults are smokers.
17.8% of northwest adults are smokers.
20.7% of northeast adults are smokers.

The average BMI in the dataset is 30.66.

The average insurance charge in the dataset for those with an above average BMI is $15801.79, 
compared to $10907.33 for those with a below average BMI.

The average BMI in the southwest region is 30.6.
The average BMI in the southeast region is 33.36.
The average BMI in the northwest region is 29.2.
The average BMI in the northeast region is 29.17.


The Southeastern region has both the highest BMI and the highest smoking rate out of all the regions. Further analysis could be performed as to how *exactly* these factors increase the Southeast's insurance costs, but that is beyond the scope of this project.

In [117]:
average_male = round(average("charges", narrow(insurance_db, "sex", "male")), 2)
average_female = round(average("charges", narrow(insurance_db, "sex", "female")), 2)
print("The average insurance charge in the dataset for men is $" + str(average_male) + ", compared to $" + str(average_female) + " for women.\n")

for i in range(4):
  average_children = round(average("charges", narrow_numeric(insurance_db, "children", i, i + 1)), 2)
  print("The average insurance charge in the dataset for people with " + str(i) + " child(ren) is $" + str(average_children) + ".")
average_children = round(average("charges", narrow_numeric(insurance_db, "children", 4, float("inf"))), 2)
print("The average insurance charge in the dataset for people with 4 children or more is $" + str(average_children) + ".\n")

range_span = 8
for i in range(6):
  range_num = i * range_span
  bottom = 18 + range_num
  top = 18 + range_span + range_num
  average_age = round(average("charges", narrow_numeric(insurance_db, "age", bottom, top)), 2)
  print("The average insurance charge in the dataset for people aged " + str(bottom) + "-" + str(top) + " is $" + str(average_age) + ".")

The average insurance charge in the dataset for men is $13956.75, compared to $12569.58 for women.

The average insurance charge in the dataset for people with 0 child(ren) is $12365.98.
The average insurance charge in the dataset for people with 1 child(ren) is $12731.17.
The average insurance charge in the dataset for people with 2 child(ren) is $15073.56.
The average insurance charge in the dataset for people with 3 child(ren) is $15355.32.
The average insurance charge in the dataset for people with 4 children or more is $11730.58.

The average insurance charge in the dataset for people aged 18-26 is $9087.02.
The average insurance charge in the dataset for people aged 26-34 is $10267.61.
The average insurance charge in the dataset for people aged 34-42 is $11784.23.
The average insurance charge in the dataset for people aged 42-50 is $15283.89.
The average insurance charge in the dataset for people aged 50-58 is $16519.63.
The average insurance charge in the dataset for people aged

Men have a slightly higher average insurance charge than women. Having more children is correlated with higher charges to a point, and a higher age is correlated with higher charges as well. Further analysis could be performed as to how exactly these factors increase overall insurance costs, but that is beyond the scope of this project.