# U.S. Medical Insurance Costs:  A Python Story
By Brian Batcheldor
<br>
<br>
How do we make a csv file into a story anyone can understand?  How do we find features within a dataset that can help us draw real conclusions?  Let's look for some ways that Python can help us to understand relational data by looking at the relational data file `insurance.csv`.  First, let's import our data into Python using the csv module and the method `DictReader()`, which will convert it into a list of Python dictionaries.

In [1]:
import csv

medical_data = []

with open('insurance.csv', newline='') as insurance_csv:
    reader = csv.DictReader(insurance_csv)
    for row in reader:
        medical_data.append(row)

print(medical_data[:3])

[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}]


## Research Questions and Hypotheses:
In this exercise, I will use the imported dataset to explore the following questions and test the associated hypotheses.

### R1: How does aging affect medical costs in women vs. men?

H1:  Women under 45 will have higher average medical costs than men.

H2:  Men over 45 will have higher average medical costs than women.

### R2:  How do medical costs vary by region?
H3:  People from the Southeastern region will have the highest average medical costs.

H4:  People from the Northwest region will have the lowest average medical costs.

### R3:  What factors have the greatest effect on medical costs?
H4:  Smoking will be the single greatest factor influencing medical costs.

H5:  BMI of 30 or over will have a significant effect on medical costs.

<br>
<br>

## How does aging effect medical costs in men and women?
First, let's address the question of aging in men and women.  We hypothesize that women will have higher average medical costs than men when they are of childbearing are (under 45), and that thereafter, men will have overall higher medical costs.  To test these hypotheses, we will define a function that takes the parameters `min_age`, `max_age` and `sex` and returns the average medical cost.

In [2]:
def average_by_age_and_sex(min_age, max_age, sex):
    count = 0
    total = 0.0
    for row in medical_data:
        age = int(row['age'])
        if age < min_age or age > max_age or row['sex'] != sex:
            continue
        else:
            count += 1
            total += float(row['charges'])
    return total / count    

Now, let's use this function to see the average medical costs for men and women under 45!

In [3]:
women_under_45 = average_by_age_and_sex(0, 45, 'female')
men_under_45 = average_by_age_and_sex(0, 45, 'male')
print('The average medical costs for women under 45 was $%s, while the average costs for men under 45 was $%s' % 
    (women_under_45, men_under_45))

The average medical costs for women under 45 was $10178.259371537897, while the average costs for men under 45 was $11638.001295291382


As you can see, our hypothesis was wrong!  Men under 45 have higher health care costs than women.  Now let's use our function to see the average medical costs for men and women *over* 45.

In [4]:
women_over_45 = average_by_age_and_sex(45, 100, 'female')
men_over_45 = average_by_age_and_sex(45, 100, 'male')
print('The average medical costs for women over 45 was $%s, while the average costs for men under 45 was $%s.' % 
      (women_over_45, men_over_45))

The average medical costs for women over 45 was $16241.536243183511, while the average costs for men under 45 was $17915.267064961827.


Again, the average medical costs for men are higher than women.  Good to know!
<br>
<br>
<br>
<br>

## Medical Costs by Region
Men have higher medical costs than women, but what region of the U.S. has the highest average medical costs?  I notice that several of our questions have to do with average costs.  It would be helpful for us to have a single, reusable function that finds the average medical costs of any dataset.  Let's start there:

In [5]:
def average_costs(dataset):
    total = 0.0
    for row in dataset:
        total += float(row['charges'])
    return total / len(dataset)

print(average_costs(medical_data))

13270.422265141257


Now when we want to drill down in the data, all we need to do is filter the dataset -- keeping just the rows with the characteristic we're looking for -- then use our `average_costs` function to find the average.  A simple way to filter the rows we want is with list comprehensions!  First, let's use a list comprehension to make a dataset with just rows for individuals from the Southeast region.  We'll check our list comprehension trick by printing the first three lines.

In [6]:
southeast_region = [row for row in medical_data if row['region'] == 'southeast']

print(southeast_region[:3])

[{'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}, {'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}]


It works!  Now, we can find the average medical costs for people from the Southeast.

In [7]:
southeast_costs = average_costs(southeast_region)
print(southeast_costs)

14735.411437609895


Now, let's compare costs with the other regions.  To save time, we will write those handy list comprehensions right into the average cost function thus:

In [8]:
northeast_region = [row for row in medical_data if row['region'] == 'northeast']
southwest_region = [row for row in medical_data if row['region'] == 'southwest']
northwest_region = [row for row in medical_data if row['region'] == 'northwest']

northeast_costs = average_costs(northeast_region)
southwest_costs = average_costs(southwest_region)
northwest_costs = average_costs(northwest_region)
                                
print('Average costs -- Northeast: %s, Southwest: %s, Northwest: %s.' % (northeast_costs, southwest_costs, northwest_costs)) 

Average costs -- Northeast: 13406.3845163858, Southwest: 12346.93737729231, Northwest: 12417.575373969228.


Our h3 hypothesis was correct!  Southeastern folks have the highest average medical costs.  However, our h4 hypothesis has proven to be wrong, as people in the Southwest region have the lowest average medical costs, lower than the Northwest.  Must be that clean desert air.
<br>
<br>
While exploring these first four hypotheses, I noticed that those average medical costs seem somewhat high.  Are our averages being skewed by a few outlying datapoints?  It might be useful to step away from our hypotheses for a while and check to see if our datasets are skewed by comparing the average to the median.  Let's define a function to identify the median medical cost of a dataset.

In [9]:
def median_costs(dataset):
        #duplicate dataset and sort by charges (converted to float)
    sorted_data = sorted(dataset, key=lambda row: float(row['charges']))
        #if dataset is odd, return the charges from the middle row
    if len(dataset) % 2 != 0:
        return float(sorted_data[len(dataset)//2]['charges'])
        #otherwise, average the charges from the two middle rows
    else:
        middle1 = float(sorted_data[len(dataset)//2]['charges'])
        middle2 = float(sorted_data[len(dataset)//2 + 1]['charges'])
        return (middle1 + middle2) / 2

print('Median medical costs is %s, vs average cost of %s.' % (median_costs(medical_data), average_costs(medical_data)))

Median medical costs is 9388.753649999999, vs average cost of 13270.422265141257.


Ouch.  That's a lot of skew.  It appears we will need to be much more careful with how we use averages with this dataset.  In addition to comparing averages, we will need to compare median numbers for various subgroups in order to make sure the skewing data isn't giving us a false impression.  
<br>
Let's examine the median medical costs for each region:

In [10]:
print('Median costs by region -- Northeast: %s, Southeast: %s, Northwest: %s, Southwest: %s' %
      (
          median_costs(northeast_region), 
          median_costs(southeast_region), 
          median_costs(northwest_region), 
          median_costs(southwest_region)
      )
)

Median costs by region -- Northeast: 10089.09465, Southeast: 9341.3033, Northwest: 8965.79575, Southwest: 8798.593


Indeed, our h3 conclusion is now somewhat in doubt.  The median medical costs for the Northeast region is higher than the Southeast.  This median number may be more descriptive than the average when drawing conclusions about our skewed dataset.  We will continue to check any comparisons made with averages against the median.
<br>
<br>
<br>
## What factors have the greatest effect on medical costs?
Everyone knows that smoking is bad for your health.  But is smoking a better predictor of high medical costs than other factors, such as age or BMI?  Let's define a methodology for comparing the predictive power of various demographic categories.
- First, we need to define several boolean conditions to test.  Smoking vs. non-smoking is a simple example; however, other conditions will have to be specified.  For example, what qualifies as a high BMI for the sake of comparison?
- Second, we will need create subsets of our database that are filtered by condition.  Fortunately, we can do this easily with list comprehensions!  We will need subsets for people that meet our conditions and don't meet our conditions.
- Third, we will need to find the averages for our subsets.  We have a function for that!
- Last, we need to divide the positive subset by the negative subset to identify an effect factor.  For example, we will divide the average of smokers by the average of non-smokers to identify the factor by which smoking increases the average costs.
<br>
Now that we have a plan, let's define our conditions!  Smoking vs. non-smoking is an obvious first choice.  
Based on our comparison of men vs. women, sex doesn't appear to have too great of an effect on medical costs.  Lets look at BMI and age.  For BMI, lets look at a few conditions:  BMI of over 25 (considered overweight) vs. under; BMI over 30 (considered obese); and BMI over 40 (considered severely obese).  For age, lets look at the effect on costs of being over 45 (vs. under) and being over 60 (vs. under).  To sum it all up, we will examine the effect on costs of the following conditions:
1. Smoking vs. non-smoking
2. BMI of over 25 vs. under
3. BMI of over 30 vs. under
4. BMI of over 40 vs. under
5. Age over 45 vs. under
6. Age over 60 vs. under
<br>
Let's look at our first condition, smoking vs. non-smoking.  First, we seperate our dataset into subsets for each condition.

In [11]:
smokers = [row for row in medical_data if row['smoker'] == 'yes']
non_smokers = [row for row in medical_data if row['smoker'] == 'no']

Easily done!  Now, we take the average of each and divide the average for smokers by that of non-smokers to find a factor that describes the effect on medical costs.  

In [12]:
smoker_costs = average_costs(smokers)
non_smoker_costs = average_costs(non_smokers)
factor = smoker_costs / non_smoker_costs
print('The average smoker costs %s, and the average non-smoker costs %s, so on average, smoking increases medical costs by a factor of %s.'
      % (smoker_costs, non_smoker_costs, factor))

The average smoker costs 32050.23183153285, and the average non-smoker costs 8434.268297856199, so on average, smoking increases medical costs by a factor of 3.8000014582983206.


Wow!  Average costs for smokers are almost four times as high as non-smokers!  We will test our remaining conditions to see if we can find one that has a greater effect on costs than smoking.  However, our other conditions are highly repetitious with only slight changes in certain variables.  For example, our 3 BMI conditions are virtually the same, except for the BMI cutoff each one uses.  We could save some effort here by defining a function that will automate the repetitive practice of identifying subsets, averaging costs, and finding an effect factor.  This function will take the parameters `dataset` (meaning the full dataset), `category` (as in BMI or age), and `cutoff`, an integer.

In [13]:
def find_effect_on_cost(dataset, category, cutoff):
    under_set = [row for row in dataset if float(row[category]) < cutoff]
    over_set = [row for row in dataset if float(row[category]) >= cutoff]
    return average_costs(over_set) / average_costs(under_set)

No problem!  Now we can use this function to find the "effect factor" of each of our remaining conditions.

In [14]:
overweight_effect = find_effect_on_cost(medical_data, 'bmi', 25)
obese_effect = find_effect_on_cost(medical_data, 'bmi', 30)
severely_obese_effect = find_effect_on_cost(medical_data, 'bmi', 40)
age_over_45 = find_effect_on_cost(medical_data, 'age', 45)
age_over_60 = find_effect_on_cost(medical_data, 'age', 60)

print('Effect factors:')
print ('BMI over 25: %s, BMI over 30: %s, BMI over 40: %s, age over 45: %s, age over 60: %s' %
       (overweight_effect, obese_effect, severely_obese_effect, age_over_45, age_over_60))

Effect factors:
BMI over 25: 1.3557608965994707, BMI over 30: 1.4516351509882732, BMI over 40: 1.289737951611525, age over 45: 1.582714193822603, age over 60: 1.6961224208393453


Our h4 hypothesis appears to be correct!  Smoker's medical costs are proportionally greater than non-smoker's by a significantly greater factor than any other single condition.  However, we will want to check this result against a similar analysis that uses the median, since we have seen how dramatically our data can be skewed by outliers.  Let's quickly define a new function to compare the median costs of people who do and don't meet our conditions.  We can basically copy and paste our previous code and make one small change, indicated below.

In [15]:
def median_effect_on_cost(dataset, category, cutoff):
    under_set = [row for row in dataset if float(row[category]) < cutoff]
    over_set = [row for row in dataset if float(row[category]) >= cutoff]
    return median_costs(over_set) / median_costs(under_set) #<------------Change average_costs to median_costs

Now, lets look at effect factors for the *median* costs and compare them to those of the averages.  Our handy function doesn't work with our smoker data, so we will have to work those out the long way!

In [16]:
median_smoking_effect = median_costs(smokers) / median_costs(non_smokers)
median_smoking_effect

4.702007587709066

Wow!  If anything, our smoker data is skewed in the *opposite* direction as the rest of the set!  Let's quickly use our function to check the rest of the conditions using our `median_effect_on_cost()` function.

In [17]:
median_overweight_effect = median_effect_on_cost(medical_data, 'bmi', 25)
median_obese_effect = median_effect_on_cost(medical_data, 'bmi', 30)
median_severely_obese_effect = median_effect_on_cost(medical_data, 'bmi', 40)
median_age_over_45_effect = median_effect_on_cost(medical_data, 'age', 45)
median_age_over_60_effect = median_effect_on_cost(medical_data, 'age', 60)
print('Effect factors (using median):')
print('BMI over 25: %s, BMI over 30: %s, BMI over 40: %s, age over 45: %s, age over 60: %s' % 
      (median_overweight_effect, median_obese_effect, median_severely_obese_effect, median_age_over_45_effect, 
       median_age_over_60_effect))

Effect factors (using median):
BMI over 25: 1.1142731478941263, BMI over 30: 1.1580078951047805, BMI over 40: 1.0493825846210383, age over 45: 2.150065931221872, age over 60: 1.6719982053637383


It is now clear that smoking has a much greater effect on medical costs than any other factor, so h4 has definitively been proven true.  However, there is still some doubt about h5, our hypothesis that BMI of over 30 would have a significant effect on medical costs.  The average effect factor for BMI over 30 vs. under was around 1.5; however, the median effect was much smaller.  Some of the difficulty with our hypothesis is because the word *significant* is unspecific.  Is 1.5 a signficant effect?  We need a measure of the variability of the data to see if 1.5 is *statistically* significant.  We need a standard deviation!
<br>
Standard deviation is the average distance between the mean of a dataset and each datapoint.  Let's write a function to find the standard deviation of the medical costs within a dataset and test it on a few of our subsets to see what effect factor might be significant!

In [18]:
def costs_standard_deviation(dataset):
    average = average_costs(dataset)
    total = 0
    for row in dataset:
        total += abs(float(row['charges']) - average)
    return total / len(dataset)

Now lets use this function to find the standard deviation for our dataset.

In [19]:
costs_standard_deviation(medical_data)

9091.126581137027

Wow!  The standard deviation is very large compared to the average.  So we were right to be mistrustful of that 1.5 effect factor, since the standard deviation is quite a bit larger than the increase to the average caused by BMI, and the effect factor actually goes *down* as BMI increases, leading us to conclude that there are some outliers to this data set that are causing some odd features.
<br>
<br>
I hope this journey through our U.S. Medical Insurance Costs dataset has been enlightening.  Cheers! 