# U.S. Medical Insurance Costs

Intro:

## Step 1: Examine the starting Dataset

For this project I'm using a dataset provided by codecademy.  As the first step I'll import the data using the csv library and do some initial exploration of the dataset.

In [1]:
import csv

insurance_data = {}
with open('insurance.csv') as insurance_raw:
    insurance_dict = csv.DictReader(insurance_raw)
    for sub_dict in insurance_dict:
        for key, value in sub_dict.items():
            if key in insurance_data:
                insurance_data[key].append(value)
            else:
                insurance_data[key] = [value]
                
def explore_column(a_dict, key):
    types = []
    length = len(a_dict[key])
    for i in a_dict[key]:
        if type(i) in types:
            continue
        else:
            types.append(type(i))
    return 'The {} column has {} values made up of the following data types: {}'.format(key, length, types)
    


In [2]:
print(explore_column(insurance_data, 'age'))
print(explore_column(insurance_data, 'sex'))
print(explore_column(insurance_data, 'bmi'))
print(explore_column(insurance_data, 'children'))
print(explore_column(insurance_data, 'smoker'))
print(explore_column(insurance_data, 'region'))
print(explore_column(insurance_data, 'charges'))

The age column has 1338 values made up of the following data types: [<class 'str'>]
The sex column has 1338 values made up of the following data types: [<class 'str'>]
The bmi column has 1338 values made up of the following data types: [<class 'str'>]
The children column has 1338 values made up of the following data types: [<class 'str'>]
The smoker column has 1338 values made up of the following data types: [<class 'str'>]
The region column has 1338 values made up of the following data types: [<class 'str'>]
The charges column has 1338 values made up of the following data types: [<class 'str'>]


Now I've confirmed that the dataset has 1338 entires not including the header, and that all data is being tracked using string values.  In order to make it easier to run analysis I'm going to create a new dictionary that stores the age, bmi, children, and charges values as integers and floats.

In [3]:
insurance_data_clean = {}

def append_column(new_dict, old_dict, key, desired_type):
            if key not in new_dict:
                new_dict[key] = []
            for item in old_dict[key]:
                new_dict[key].append(desired_type(item))
            
append_column(insurance_data_clean, insurance_data, 'age', int)
append_column(insurance_data_clean, insurance_data, 'sex', str)
append_column(insurance_data_clean, insurance_data, 'bmi', float)
append_column(insurance_data_clean, insurance_data, 'children', int)
append_column(insurance_data_clean, insurance_data, 'smoker', str)
append_column(insurance_data_clean, insurance_data, 'region', str)
append_column(insurance_data_clean, insurance_data, 'charges', float)
        

At this point I have a clean dataset and am ready to begin analysis.  Now that I have had a chance to explore the data, I'm going to find an equation that can best estimate insurance costs based on the data provided.  I'll be doing this using a combination of linear regression for the integer and float variables, and more simple arithmetic for the categorical string variables.  All variables will be compared to total charges.

## Step 2: Calculate relationship between charges and categorical string variables

First up I am going to tackle sex, smoker, and region to better understand their impact on real-world healthcare costs.

In [4]:
def calculate_average_charge_conditional(key, condition):
    total_charge = 0
    a_list = []
    index = 0
    for i in insurance_data_clean[key]:
        if i == condition:
            a_list.append(i)
            total_charge += insurance_data_clean['charges'][index]
        index += 1
    return total_charge / len(a_list)

In [5]:
print(calculate_average_charge_conditional('sex', 'male'))
print(calculate_average_charge_conditional('sex', 'female'))
print()
print(calculate_average_charge_conditional('sex', 'male') - calculate_average_charge_conditional('sex', 'female'))

13956.751177721886
12569.57884383534

1387.1723338865468


According to the analysis above, the first piece of our healthcare cost equation is ```'sex' * 1387.17``` where sex is equal to 1 for male and 0 for female.

In [6]:
print(calculate_average_charge_conditional('smoker', 'yes'))
print(calculate_average_charge_conditional('smoker', 'no'))
print()
print(calculate_average_charge_conditional('smoker', 'yes') - calculate_average_charge_conditional('smoker', 'no'))

32050.23183153285
8434.268297856199

23615.96353367665


According to the analysis above, the second piece of our healthcare cost equation shows that smokers pay significantly more on average than non-smokers.  At this point our equation looks like this: ``` 'sex' * 1387.17 + 'smoker' * 23615.96 ``` where smokers are '1' and non-smokers are 0.

In [7]:
region_values = []
for region in insurance_data_clean['region']:
    if region not in region_values:
        region_values.append(region)
        
print(region_values)

['southwest', 'southeast', 'northwest', 'northeast']


In [8]:
print(calculate_average_charge_conditional('region', 'southwest'))
print(calculate_average_charge_conditional('region', 'southeast'))
print(calculate_average_charge_conditional('region', 'northwest'))
print(calculate_average_charge_conditional('region', 'northeast'))

12346.93737729231
14735.411437609895
12417.575373969228
13406.3845163858


In [9]:
region_modifier_key = {
    'southwest': 0,
    'southeast': 2388.47,
    'northwest': 70.64,
    'northeast': 1059.45
}

sex_key = {
    'male': 1,
    'female': 0
}

smoker_key = {
    'yes': 1,
    'no': 0
}

Region is a bit more tricky as there are four possible values, but the regional cost differences can be accounted for using the translation above.  The east coast appears to have consistently higher healthcare costs, with the southeast being especially expensive.  ```'sex' * 1387.17 + 'smoker' * 23615.96 + region_modifier```

## Step 3: Linear regression for float and int values

Next up will be the three int/float values that contribute to the ultimate real-world healthcare cost.

In [18]:
age_points = list(zip(insurance_data_clean['age'], insurance_data_clean['charges']))
bmi_points = list(zip(insurance_data_clean['bmi'], insurance_data_clean['charges']))
children_points = list(zip(insurance_data_clean['children'], insurance_data_clean['charges']))

def get_y(m, b, x):
    return m*x + b

possible_ms_age = [m*.1 for m in range(2500, 2800)]
possible_bs_age = [b*.1 for b in range(-32500, -32000)]
possible_ms_bmi = [m*.1 for m in range(1200, 1400)]
possible_bs_bmi = [b*.1 for b in range(54100, 54200)]
possible_ms_children = [m*10 for m in range(100,300)]
possible_bs_children = [b*10 for b in range(800, 1000)]

def calculate_error(m, b, point):
    x_point = point[0]
    y_point = point[1]
    y_value = get_y(m, b, x_point)
    return abs(y_value - y_point)

def calculate_all_error(m, b, points):
    total_error = 0
    for point in points:
        total_error += calculate_error(m, b, point)
    return total_error

def lin_reg(m_list, b_list, datapoints):
    smallest_error = float('inf')
    best_m = 0
    best_b = 0
    for m in m_list:
        for b in b_list:
            if calculate_all_error(m, b, datapoints) < smallest_error:
                smallest_error = calculate_all_error(m, b, datapoints)
                best_m = m
                best_b = b
    return print('The smallest possible error is {}, with the best m value as {} and best b value as {}.'.format(smallest_error, best_m, best_b))
            

In [11]:
lin_reg(possible_ms_age, possible_bs_age, age_points)

The smallest possible error is 8975649.727358997, with the best m value as 269.5 and best b value as -3216.6000000000004.


new equation ```'sex' * 1387.17 + 'smoker' * 23615.96 + region_modifier + age*269.5 - 3216.6```

In [19]:
lin_reg(possible_ms_bmi, possible_bs_bmi, bmi_points)

The smallest possible error is 11133900.765520982, with the best m value as 129.70000000000002 and best b value as 5419.900000000001.


new equation ```'sex' * 1387.17 + 'smoker' * 23615.96 + region_modifier + age*269.5 + bmi*130 + 2194.5```

In [17]:
lin_reg(possible_ms_children, possible_bs_children, children_points)

The smallest possible error is 11239562.083321003, with the best m value as 1000 and best b value as 8110.
