# U.S. Medical Insurance Costs

## Introduction & Objectives:
This project is about analyzing a dataset of US Medical Insurance Costs.
The ogininal data file  is `insurance.csv`. Variables are:  age, sex, bmi, children, smoker, region, **charges**

// Objectives:
- What is correlated to high medical costs?
- What is correlated to low medical costs?
- Are there weird correlations, like smoking and children? Do parents of many children smoke more?

### Roadmap
0/ [Prep](#prep): import libraries and the CSV file.

1/ [Clean up the data](#prep):
 - Make an indexed version of the data. (This might make it easier to loop through if I can use the id number as an index.)
 - Create sub dictionaries of interesting pairs.

2/ [Exploring Basic facts : Ranges and Counts](#explore):
- number of records,
- ranges of values in some variables, etc...

3/ [Analysis](#analysis)
- Percentages of smokers, regions, children...
- Costs compared to each variable.
- [Building a function to compare any variable to charges](#automate)
- [Average cost for age/BMI, by category](#"ageandbmi")

4/ Conclusions:
This was a fun and interesting project. Obviously on training wheels, but I challenged myself in certain places and learned a lot doing the project.
You can read about it on my blog. <!-- tk:  link here -->

- Smoking is bad and so is aging. No surprise there, but smoking is far worse than anything else.
- High bmi is quite bad, (more than 2x between lowest category and highest, more than youngest to oldest). But what is bmi exactly? And there is a very long tail at the high end, so I would be curious to look at the _median_ instead of the average.
- The difference by region is surprising; a finer analysis might lead to interesting causal canditates.
- Same with the difference between male and females. Why do men and southeastern people pay more?
For region and sex, it would be interesting to explore if the cause is more tied to behavior genetics, environment, etc...

But boy, smoking is _really_ bad.

⚠️ To try:
- Get the median on bmi values, smoking by age groups, smoking by age.
- More research is needed.


## <h3 id="prep" >Prep and Cleaning up the Data</h3>

### Setup
Import csv library
Import csv file
Turn the CSV into a dictionary of everything

### Prepare Data:
I create 3 indexed dictionaries:
- A full indexed dictionary : "indexed_insurance_data"
- A male and female, (indexed) sub-dictionary  "male_insurance_data" And "female_insurance-data"

And 3 new csv files:
- An indexed csv : "indexed_insurance.csv"
- A indexed male and female csv : "male_indexed_insurance.csv" and "female_indexed_insurance.csv"


In [1]:
#### SETUP

import csv
from collections import Counter # this is used much later, but I'm putting it here


### I want to import my file insurance.csv
insurance_data = csv.DictReader(open("insurance.csv"))
# print(insurance_data)                     ## check/debug line


###
### Making The dictionaries
### (Turns out I used indexed_insurance_data for everything)


## I want to have a giant indexed dictionary of all the data
indexed_insurance_data = {}
idd = 1
for row in insurance_data:
    indexed_insurance_data[idd] = row
    idd += 1
#print(indexed_insurance_data)             ## check/debug line


## I want to create a dictionary of all the males.
males_insurance_data = {}
for key, record in indexed_insurance_data.items():
    # print(key, record)                   ## debug line
    if record.get('sex') == 'male':
        males_insurance_data[key] = record
# print(males_insurance_data)

## A Dictionary of females:
females_insurance_data = {}
for key, record in indexed_insurance_data.items():
    # print(key, record) ## debug line
    if record.get('sex') == 'female':
        females_insurance_data[key] = record
# print(females_insurance_data)         ## check/debug line

###
### Making CSV Files
### (this turned out useless.)


### Let's write new csv files: the indexed version, male and female version

# scoping this wide, we'll reuse it.
fields = ['id', 'age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']



# creates a new csv file with a new column for the id.
with open("indexed_insurance.csv", "w") as output_csv:
    output_writer = csv.DictWriter(output_csv, fieldnames=fields)
    output_writer.writeheader()
    for index, record in indexed_insurance_data.items():
        row = {'id': index}
        row.update(record)
        output_writer.writerow(row)
## Tricky: we first make the rows using all the fields (adding 'id'). Then, we can just write the record, but then we're ## missing the id. The column exists, but the field is empty. So we do this hack: create a row = id: index dictionary. to which we update the rest of the dictionary. We can write the whole row as a dictionary.


file_path = "male_indexed_insurance.csv"
with open(file_path, 'w')as f:
    output_writer = csv.DictWriter(f, fieldnames=fields)
    output_writer.writeheader()
    for index, record in males_insurance_data.items():
        row={'id': index}
        row.update(record)
        output_writer.writerow(row)


file_path = "female_indexed_insurance.csv"
with open(file_path, 'w')as f:
    output_writer = csv.DictWriter(f, fieldnames=fields)
    output_writer.writeheader()
    for index, record in females_insurance_data.items():
        row={'id': index}
        row.update(record)
        output_writer.writerow(row)





### <h3 id="explore" >Explore Data: Basic facts </h3>

Single variables dictionaries: (This proved useless.)
- indexed_charges {id : charges}, indexed_bmi  {id : bmi}, indexed_age  {id : age}

#### Ranges and Counts : I  want to know the ranges of variables like age, and the number of males, smokers, etc...

Let's counting things:
- The number of records in the data set.
- Number of males/females (several ways, to check. That was interesting.)
- Number of smokers, of records per region, of records per number of children, per age...

⚠️ To Try:
- create a new simple dict directly from the csv using dictreader
- create another function that compare any two variable pair, instead of any variable to charges.




In [2]:
###
### Basic Facts
###

### So I want to have some single variable dict
## this gives a dict of id + charges
indexed_charges = {idd: data['charges'] for idd, data in indexed_insurance_data.items()}
# print(indexed_charges)

## this gives a dict of id + bmi
indexed_bmi = {idd : data['bmi'] for idd, data in indexed_insurance_data.items()}
#print(indexed_bmi)

## this gives a dict of id + age
indexed_age = {idd : data['age'] for idd, data in indexed_insurance_data.items()}
#print(indexed_age)

    ## This is another way to do it that I like less.
    # ind_char = {}
    # for index, record in indexed_insurance_data.items():
    #     ind_char[index] = record['charges']
    # #print(ind_char)

    ## This prints the first ten values of the dict. I would like something that simple, but with the key attached
    # range_ten = range(1,11)
    # for i in range_ten:
    #     print(i, indexed_charges[i])



###
### Ranges & Counts
###

### Getting the number of records:
full_range = range(len(indexed_insurance_data))
print("Number of records: " + str(full_range))


## let's count the males and females:
number_of_males = 0
for i in males_insurance_data.keys():
   number_of_males += 1
##print(number_of_males)  ## returns 676

number_of_females = 0
for i in females_insurance_data.keys():
   number_of_females += 1
# #print(number_of_females) ## returns 662

# let's check using the main dictionary
num_fem = 0
num_mal = 0
for record in indexed_insurance_data.values():
        if record.get('sex') == 'female':
            num_fem += 1
        elif record.get('sex') == 'male':
            num_mal += 1
print(f"Females = {num_fem}, Males = {num_mal}")


### Let's count occurrences of some variables
sex_counter = Counter(record['sex'] for record in indexed_insurance_data.values())
#print(sex_counter)

region_counter = Counter(record['region'] for record in indexed_insurance_data.values())
print( 'Region Count: '+ str(region_counter))

smoker_counter = Counter(record['smoker'] for record in indexed_insurance_data.values())
print(f"Smoker Count: Is this person a smoker? No: {smoker_counter['no']}, Yes: {smoker_counter['yes']}")

children_counter = Counter(record['children'] for record in indexed_insurance_data.values())
print('number of children: ' + str(children_counter))

age_counter = Counter(record['age'] for record in indexed_insurance_data.values())
sorted_age_counter = sorted(age_counter.items(), key=lambda item : item[0])
#print('Age counter (sorted): ' + str(sorted_age_counter)) ## this line's output is very long, so I omit it.

bmi_counter = Counter(record['bmi'] for record in indexed_insurance_data.values())
sorted_bmi_counter = sorted(bmi_counter.items(), key=lambda item:item[0])
#print('Bmi counter (sorted): ' + str(sorted_bmi_counter)) ## this line's output is very long, so I omit it.




Number of records: range(0, 1338)
Females = 662, Males = 676
Region Count: Counter({'southeast': 364, 'southwest': 325, 'northwest': 325, 'northeast': 324})
Smoker Count: Is this person a smoker? No: 1064, Yes: 274
number of children: Counter({'0': 574, '1': 324, '2': 240, '3': 157, '4': 25, '5': 18})


###  <h3 id="analysis">Basic analysis or let's calculate some basics: </h3>
- average insurance cost.
- Cost for smoking : 400% increase in insurance cost.
- cost by region (west pays less, north pays a little less.)

In [3]:
###
### Basic Analysis
###


## getting the average cost of insurance.
total_charges = 0
for i in indexed_charges.values():
    total_charges += float(i)
average_charge = total_charges/len(indexed_charges)
#print(average_charge)
print('   ')
print('===')
print("Average insurance charge: " + str(round(average_charge)))
print('===')


## Comparing smokers and non smokers.
smoker_total_charges = 0
smoker_counter = 0
non_smoker_total_charges = 0
non_smoker_counter = 0
for record in indexed_insurance_data.values():
    if record.get('smoker') == 'yes':
        smoker_total_charges += float(record['charges'])
        smoker_counter += 1
    elif record.get('smoker') == 'no':
        non_smoker_total_charges += float(record['charges'])
        non_smoker_counter += 1
smokers_average_charge = smoker_total_charges/smoker_counter
non_smoker_average_charges = non_smoker_total_charges/non_smoker_counter
print('---')
print('Smoking:')

print(f"On average, smokers pay ${round(smokers_average_charge)} for insurance, while non-smokers pay ${round(non_smoker_average_charges)}. This is {round((smokers_average_charge/non_smoker_average_charges))*100} % more  ")


## Comparing charge / regions :
ne_count =  nw_count = se_count = sw_count = 0
ne_total_charge =  nw_total_charge = se_total_charge = sw_total_charge = 0

for record in indexed_insurance_data.values():
    if record.get('region') == 'northeast':
        ne_count += 1
        ne_total_charge += float(record.get('charges'))
    elif record.get('region') == 'northwest':
        nw_count += 1
        nw_total_charge += float(record.get('charges'))
    elif record.get('region') == 'southwest':
        sw_count += 1
        sw_total_charge += float(record.get('charges'))
    elif record.get('region') == 'southeast':
        se_count += 1
        se_total_charge += float(record.get('charges'))


print('---')
print(f"Average charges by person per regions: \n - North West: ${round(nw_total_charge/nw_count)} \n - North East: ${round(ne_total_charge/ne_count)} \n - South West: ${round(sw_total_charge/sw_count)} \n - South East: ${round(se_total_charge/se_count)} ")
print('---')

   
===
Average insurance charge: 13270
===
---
Smoking:
On average, smokers pay $32050 for insurance, while non-smokers pay $8434. This is 400 % more  
---
Average charges by person per regions: 
 - North West: $12418 
 - North East: $13406 
 - South West: $12347 
 - South East: $14735 
---


####  <h4 id="automate"> Automate (some of) the boring stuff: automate comparing charge to any variable.</h4>
The function takes any variable as input and returns the average cost of insurance per category of that variable.

It works but the output become hard to read when the data has too many categories, like bmi and age, two of the most interesting variables.



In [4]:
###
### Create a function to compare charges per children.
###

print('---\n')
def compare_charges_per_var(var):
    total_charge_per_var_value = {}
    count_by_var_values = {}
    for record in indexed_insurance_data.values():
        var_value = record.get(var)
        charge_of_rec = float(record.get('charges'))

        if var_value in total_charge_per_var_value:
            count_by_var_values[var_value] += 1
            total_charge_per_var_value[var_value] += charge_of_rec

        elif var_value not in total_charge_per_var_value:
            count_by_var_values[var_value] = 1
            total_charge_per_var_value[var_value] = charge_of_rec

    average_charge_per_var = {}
    for var_value, total_charge in total_charge_per_var_value.items():
        average_charge_per_var[var_value] = round(total_charge/count_by_var_values[var_value])

    print(f"Average cost of insurance per {var}: {average_charge_per_var}  ")
    return average_charge_per_var ## this line was added later, so I can keep the dict outside the function and do things to it. Unspeakable things. Python things.

###
### Calling the new function:

compare_charges_per_var('children')
compare_charges_per_var('smoker')
compare_charges_per_var('sex')
#compare_charges_per_var('bmi') ## return print is silly long, it needs categories.
#compare_charges_per_var('age') ## return print is silly long, it needs categories.
compare_charges_per_var('region')
pass ## otherwise the last line outputs twice, because of the return. Oddly enough, the output is not formatted the same way the second time. Interesting.


---

Average cost of insurance per children: {'0': 12366, '1': 12731, '3': 15355, '2': 15074, '5': 8786, '4': 13851}  
Average cost of insurance per smoker: {'yes': 32050, 'no': 8434}  
Average cost of insurance per sex: {'female': 12570, 'male': 13957}  
Average cost of insurance per region: {'southwest': 12347, 'southeast': 14735, 'northwest': 12418, 'northeast': 13406}  


####  <h4 id="ageandbmi">Getting the Average cost per age / bmi and group them by category </h4>
First, I initialize an age variable using the above function.
Then I sort records from youngest to oldest to read it better,
Next, I group them by categories, to read it even better.
Finally , we do the same for bmi.


In [5]:
### I'm initializing these variables first, to not make a mess of my output in the next cell. There's gotta be a better way to do this.

## I have a look at the dict we are dealing with.
average_charge_per_age = compare_charges_per_var('age')
average_charge_per_bmi = compare_charges_per_var('bmi')



Average cost of insurance per age: {'19': 9748, '18': 7086, '28': 9069, '33': 12352, '32': 9220, '31': 10197, '46': 14343, '37': 18020, '60': 21979, '25': 9838, '62': 19164, '23': 12420, '56': 15026, '27': 12185, '52': 18256, '30': 12719, '34': 11614, '59': 18896, '63': 19885, '55': 16165, '22': 10013, '26': 6134, '35': 11307, '24': 10648, '41': 9654, '38': 8103, '36': 12204, '21': 4730, '48': 14633, '40': 11772, '58': 13879, '53': 16021, '43': 19267, '64': 23276, '20': 10160, '61': 22024, '44': 15859, '57': 16447, '29': 10430, '45': 14830, '54': 18759, '49': 12696, '47': 17654, '51': 15682, '42': 13061, '50': 15663, '39': 11778}  
Average cost of insurance per bmi: {'27.9': 16885, '33.77': 1700, '33': 6854, '22.705': 12048, '28.88': 8271, '25.74': 7667, '33.44': 8865, '27.74': 13540, '29.83': 14239, '25.84': 14118, '26.22': 8399, '26.29': 27809, '34.4': 11952, '39.82': 5840, '42.13': 24162, '24.6': 7954, '30.78': 21517, '23.845': 10407, '40.3': 10602, '35.3': 22872, '36.005': 13229, '

In [6]:

## This line sorts out a dictionary so it's more readable
sorted_average_charge_per_age = sorted(average_charge_per_age.items(), key=lambda item:item[0])
#print(sorted_average_charge_per_age)
#print(average_charge_per_age)

###
### Average cost by Age
###

### Now, we create a dictionary that has the total charges by category groups.
total_charge_per_age_category = {'18-26': 0, '27-35': 0,'36-44': 0,'45-53': 0,'54-64': 0}
cat_counter = {'18-26': 0, '27-35': 0,'36-44': 0,'45-53': 0,'54-64': 0}

for age, charge in average_charge_per_age.items():
    int_age = int(age)
    if 64 >= int_age >= 54:   ## this is a cleaner syntax the IDE is suggesting. Neat and reads better.
        cat_counter['54-64'] += 1
        total_charge_per_age_category['54-64'] += charge
    elif int_age <= 53 and int_age >= 45:
        cat_counter['45-53'] += 1
        total_charge_per_age_category['45-53'] += charge
    elif int_age <= 44 and int_age >= 36:
        cat_counter['36-44'] += 1
        total_charge_per_age_category['36-44'] += charge
    elif int_age <= 35 and int_age >= 27:
        cat_counter['27-35'] += 1
        total_charge_per_age_category['27-35'] += charge
    elif int_age <= 26 and int_age >= 18:
        cat_counter['18-26'] += 1
        total_charge_per_age_category['18-26'] += charge
#print( f"Total average charge per age category: {total_charge_per_age_category}")

### Now I average the newly created dictionary into a new one.
average_charge_per_person_by_age_category = {'18-26': 0, '27-35': 0,'36-44': 0,'45-53': 0,'54-64': 0}
for cat, cha in total_charge_per_age_category.items():
    average_charge_per_person_by_age_category[cat] = round(cha/cat_counter[cat])
print( f"Average cost of insurance per person by age category: {average_charge_per_person_by_age_category}")



###
### Average for BMI: Let's do the same for BMI
###

# initialize categories of bmi and the counter we'll need.
bmi_categories_total_charges = {'<18': 0, '18-25': 0, '25-30': 0, '30-35': 0, '35-40': 0, '40+': 0}
bmi_categories_counter = {'<18': 0, '18-25': 0, '25-30': 0, '30-35': 0, '35-40': 0, '40+': 0}
average_charge_per_person_by_bmi_category = {'<18': 0, '18-25': 0, '25-30': 0, '30-35': 0, '35-40': 0, '40+': 0}

for bmi, charge in average_charge_per_bmi.items():
    float_bmi = float(bmi)
    if float_bmi < 18 :
        bmi_categories_counter['<18'] += 1
        bmi_categories_total_charges['<18'] += charge
    elif float_bmi >+ 18 and float_bmi < 25:
        bmi_categories_counter['18-25'] += 1
        bmi_categories_total_charges['18-25'] += charge
    elif float_bmi >+ 25 and float_bmi < 30:
        bmi_categories_counter['25-30'] += 1
        bmi_categories_total_charges['25-30'] += charge
    elif float_bmi >+ 30 and float_bmi < 35:
        bmi_categories_counter['30-35'] += 1
        bmi_categories_total_charges['30-35'] += charge
    elif float_bmi >+ 35 and float_bmi < 40:
        bmi_categories_counter['35-40'] += 1
        bmi_categories_total_charges['35-40'] += charge
    elif float_bmi >+ 40:
        bmi_categories_counter['40+'] += 1
        bmi_categories_total_charges['40+'] += charge
#print(bmi_categories_total_charges)
#print(bmi_categories_counter)


for cat, total in bmi_categories_total_charges.items():
    average_charge_per_person_by_bmi_category[cat] = round(total/bmi_categories_counter[cat])
print(f"Average cost of insurance per person by category of BMI: {average_charge_per_person_by_bmi_category}")


Average cost of insurance per person by age category: {'18-26': 8975, '27-35': 11010, '36-44': 13302, '45-53': 15531, '54-64': 18682}
Average cost of insurance per person by category of BMI: {'<18': 7760, '18-25': 10163, '25-30': 11276, '30-35': 14715, '35-40': 16994, '40+': 16940}
Cost of insurance increase with both age and BMI


The end. I think i'm good for now, let's check what the solution is, for inspiration.

Things i've learned:


