# U.S. Medical Insurance Costs

## Part 1 - Data classification 

File name insurance.csv

Take note of how information is organized. How will this affect how you analyze the data in Python? Is there anything of particular interest to you in the dataset that you want to investigate? Think about these things before you jump into analyzing it.

Columns (7) are: Age, Sex , BMI , number_children , Smoker , Region and Charges

Arrange data in a lists names with the DictReader function in the csv module. 

In [1]:
import csv

ages = []
sexs = []
bmis = []
number_children = []
smokers = []
regions = []
charges = []

with open('insurance.csv', newline='') as users_csv:
  user_reader = csv.DictReader(users_csv)
  for row in user_reader:
    ages.append(int(row['age']))
    sexs.append(row['sex'])
    bmis.append(float(row['bmi']))
    number_children.append(int(row['children']))
    smokers.append(row['smoker'])
    regions.append(row['region'])
    charges.append(float(row['charges']))

 


Now to organize data we use a dictionary containing keys as columns

In [2]:
      
def create_dictionary(ages, sexs, bmis, number_children, smokers, regions, charges):
    list_names = ["age", "sex", "BMI", "Children", "Smoker", "Region", "Charge"]
    complete_list = [ages, sexs, bmis, number_children, smokers, regions, charges]
    patients_dict = {}
    for i in range(len(list_names)):
        patients_dict[list_names[i]] = complete_list[i]
    return patients_dict
patients = create_dictionary(ages, sexs, bmis, number_children, smokers, regions, charges)




## Part 2 - Calculating relevant data 
With the data in DictReader, we find a dictionary with columns as keys and for each value the corresponding data type value. With the data organized in the different lists we can: 
1) Review each data column values
2)  Review data values within different categories for example, whats the average insurance cost of smokers.

Let's start by calculating the number of patients, the average age of patients, the average charge, the number of women and men, the number of smokers and non-smokers and the unique regions.

In [3]:
total_age = 0
total_bmi = 0
total_charge = 0
total_females = 0
total_smokers = 0
different_region = []

for age in ages:
    total_age += int(age)
average_age = total_age / len(ages)
for bmi in bmis:
    total_bmi += float(bmi)
average_bmi = total_bmi / len(bmis)
for charge in charges:
    total_charge += float(charge)
average_charge = total_charge / len(charges)
for sex in sexs:
    if sex == "female":
        total_females += 1
total_males = len(sexs) - total_females
for smoker in smokers:
    if smoker == "yes":
        total_smokers += 1
total_nonsmokers = len(smokers) - total_smokers

for region in regions:
    if region not in different_region:
        different_region.append(region)
        
print("The total number of patients is {patients}.".format(patients=len(ages)))
print("The average age is {Average_age} years.".format(Average_age = round(average_age,1)))
print("The average BMI is {Average_bmi}.".format(Average_bmi = round(average_bmi,1)))
print("The average charge is {Average_charge}.".format(Average_charge = round(average_charge,1)))  
print("The patients that are female are {Total_females} and the total that are males are {Total_males}".format(Total_females = total_females, Total_males = total_males))
print("The patients that are smokers are {Total_smokers} and the total that are non-smoker are {Total_nonsmoker}".format(Total_smokers = total_smokers, Total_nonsmoker = total_nonsmokers))
print(different_region)


The total number of patients is 1338.
The average age is 39.2 years.
The average BMI is 30.7.
The average charge is 13270.4.
The patients that are female are 662 and the total that are males are 676
The patients that are smokers are 274 and the total that are non-smoker are 1064
['southwest', 'southeast', 'northwest', 'northeast']


Now lets create a dictionary containing each unique region with the number of patients in that region.

In [4]:
region_dict = {}
region_counter = 0
for region in different_region:
    region_dict.update({region: 0})
for key in region_dict:
    for region in regions:
        if region == key:
            region_dict[region] += 1
    
    
print(region_dict)

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}


Let's do some more analysis. The scope of the project is to evaluate how different variables affect the insurance cost (charge), meaning we can analyse how indivual variables affect the charge value. Now we can calculate the effect on indicual variables to the change in charge. First let's calculate the average charge for smokers and for non-smokers. And then similarly calculate the average charge for women or men. And finally let's check if there is a difference in charge for each patient's region.

In [5]:
smoking_charge = 0
non_smoking_charge = 0 
female_charge = 0
male_charge = 0
southwest_charge = 0
southeast_charge = 0
northwest_charge = 0
northeast_charge = 0

smoker_charge_tuple = list(zip(smokers, charges)) 
sex_charge_tuple = list(zip(sexs, charges))
region_charge_tuple = list(zip(regions, charges))

for patient in smoker_charge_tuple:
    for value in range(len(patient)):
        if patient[value] == 'yes':
            smoking_charge += float(patient[value+1])
            average_smoking_charge = round(smoking_charge / total_smokers, 1)
            
for patient in smoker_charge_tuple:
    for value in range(len(patient)):
        if patient[value] == 'no':
            non_smoking_charge += float(patient[value+1])
            average_nonsmoking_charge = round(non_smoking_charge / total_nonsmokers, 1) 
            
delta_in_smoker = average_smoking_charge - average_nonsmoking_charge 

for patient in sex_charge_tuple:
    for value in range(len(patient)):
        if patient[value] == 'female':
            female_charge += float(patient[value+1])
            average_female_charge = round(female_charge / total_females, 1)

for patient in sex_charge_tuple:
    for value in range(len(patient)):
        if patient[value] == 'male':
            male_charge += float(patient[value+1])
            average_male_charge = round(male_charge / total_males, 1)
            
delta_in_sex = round(average_female_charge - average_male_charge,1)   

for patient in region_charge_tuple:
    for value in range(len(patient)):
        if patient[value] == 'southwest':
            southwest_charge += float(patient[value+1])
            average_southwest_charge = round(southwest_charge / region_dict['southwest'], 1)
            
for patient in region_charge_tuple:
    for value in range(len(patient)):
        if patient[value] == 'southeast':
            southeast_charge += float(patient[value+1])
            average_southeast_charge = round(southeast_charge / region_dict['southeast'], 1)
            
for patient in region_charge_tuple:
    for value in range(len(patient)):
        if patient[value] == 'northwest':
            northwest_charge += float(patient[value+1])
            average_northwest_charge = round(northwest_charge / region_dict['northwest'], 1)
            
for patient in region_charge_tuple:
    for value in range(len(patient)):
        if patient[value] == 'northeast':
            northeast_charge += float(patient[value+1])
            average_northeast_charge = round(northeast_charge / region_dict['northeast'], 1)
            
average_region_charge_dict = {}
for region in different_region:
    average_region_charge_dict[region] = 0
average_region_charge_dict['southwest'] = average_southwest_charge
average_region_charge_dict['southeast'] = average_southeast_charge
average_region_charge_dict['northwest'] = average_northwest_charge
average_region_charge_dict['northeast'] = average_northeast_charge
    
most_expensive_region_charge = max(average_northeast_charge, average_northwest_charge, average_southeast_charge, average_southwest_charge)
for average_charge in average_region_charge_dict:
    if average_region_charge_dict[average_charge] == most_expensive_region_charge:
        most_expensive_region = average_charge

least_expensive_region_charge = min(average_northeast_charge, average_northwest_charge, average_southeast_charge, average_southwest_charge)
for average_charge in average_region_charge_dict:
    if average_region_charge_dict[average_charge] == least_expensive_region_charge:
        least_expensive_region = average_charge
        
print("The average charge for a smoker is {average_smoking_charge}".format(average_smoking_charge = average_smoking_charge))
print("The average charge for a non- smoker is {average_nonsmoking_charge}".format(average_nonsmoking_charge = average_nonsmoking_charge))
print("The difference in charge for a smoker and a non- smoker is {delta_in_smoker}! So please quit smoking!".format(delta_in_smoker = delta_in_smoker))
print("The average charge for women is {average_female_charge}".format(average_female_charge = average_female_charge))
print("The average charge for men is {average_male_charge}".format(average_male_charge = average_male_charge))
print("The difference in charge for women and men {delta_in_sex}".format(delta_in_sex = delta_in_sex))
print("The average charge for southwest region is {average_southwest_charge}".format(average_southwest_charge = average_southwest_charge))
print("The average charge for southeast region is {average_southeast_charge}".format(average_southeast_charge = average_southeast_charge))
print("The average charge for northwest region is {average_northwest_charge}".format(average_northwest_charge = average_northwest_charge))
print("The average charge for northeast region is {average_northeast_charge}".format(average_northeast_charge = average_northeast_charge))
print("The most expensive region is {most_expensive_region} and the least expensive region is {least_expensive_region}.".format(most_expensive_region = most_expensive_region, least_expensive_region = least_expensive_region)) 

The average charge for a smoker is 32050.2
The average charge for a non- smoker is 8434.3
The difference in charge for a smoker and a non- smoker is 23615.9! So please quit smoking!
The average charge for women is 12569.6
The average charge for men is 13956.8
The difference in charge for women and men -1387.2
The average charge for southwest region is 12346.9
The average charge for southeast region is 14735.4
The average charge for northwest region is 12417.6
The average charge for northeast region is 13406.4
The most expensive region is southeast and the least expensive region is southwest.


Now let's evaluate how the number of children affect the charge value. Lets calculate the average charge for having no children, 1 child, 2 children or 3 children. 

In [6]:
no_children_charge = 0
one_child_charge = 0
two_children_charge = 0
three_child_charge = 0
total_no_children = 0
total_1_child = 0
total_2_children = 0
total_3_children = 0

for child in number_children:
    if child == 0:
        total_no_children += 1
    if child == 1:
        total_1_child += 1
    if child == 2:
        total_2_children += 1
    if child == 3:
        total_3_children += 1
children_charge_tuple = zip(number_children, charges)

for patient in children_charge_tuple:
    for value in range(len(patient)):
        if patient[value] == 0:
            no_children_charge += float(patient[value+1])
            average_nochildren_charge = round(no_children_charge / total_no_children, 1)
        if patient[value] == 1:
            one_child_charge += float(patient[value+1])
            average_one_children_charge = round(one_child_charge / total_1_child, 1)
        if patient[value] == 2:
            two_children_charge += float(patient[value+1])
            average_two_children_charge = round(two_children_charge / total_2_children, 1)
        if patient[value] == 3:
            three_child_charge += float(patient[value+1])
            average_three_children_charge = round(three_child_charge / total_3_children, 1)
print("The average charge for having no child is {average_nochildren_charge}".format(average_nochildren_charge = average_nochildren_charge))
print("The average charge for having 1 children is {average_one_children_charge}".format(average_one_children_charge = average_one_children_charge))
print("The average charge for having 2 children is {average_two_children_charge}".format(average_two_children_charge = average_two_children_charge))  
print("The average charge for having 3 children is {average_three_children_charge}".format(average_three_children_charge = average_three_children_charge))  

            



The average charge for having no child is 12366.0
The average charge for having 1 children is 12731.2
The average charge for having 2 children is 15073.6
The average charge for having 3 children is 15355.3


## Part 3 - Creating a predictive linear model

Lets think over again the scope of the project, we want to evaluate the indiviual effect of each variable on the insurance charge. We could create a linear model that takes into account each variable separately (age, sex, BMI, number of children, smoker and region) to the value of the insurance charge. 
The model should be like:
```
Charge = m1*age + m2*sex + m3*bmi + m4*number_children + m5*region + m6*smoker + aggregateB

```
First define a function that calculate the error of a datapoint


In [7]:
def get_y(m, b, x):
  y = m * x + b
  return y

Now lets create a function to try different m values and b values and see which line produces the least error. To calculate error between a point and a line, he wants a function called calculate_error(), which will take in m, b, and an [x, y] point called point and return the distance between the line and the point.

To find the distance:

Get the x-value from the point and store it in a variable called x_point
Get the y-value from the point and store it in a variable called y_point
Use get_charge() to get the y-value charge value that x_point would be on the line
Find the difference between the y from get_y and y_point
Return the absolute value of the distance (you can use the built-in function abs() to do this)
The distance represents the error between the line y = m*x + b and the point given.

In [8]:
#Write your calculate_error() function here
def calculate_error(m, b, point):
    x_point = point[0]
    y_point = point[1]
    delta_y = abs(get_y(m, b, x_point) - y_point)
    return delta_y
    

Now let's organize the datapoints that we are going to analyse. Let's start by age and charge datapoints. 

In [9]:
datapoints_age_charge = list(zip(ages, charges))

Now let's fit a line to this data, we will need a function called calculate_all_error, which takes m and b that describe a line, and points, a set of data.

calculate_all_error should iterate through each point in points and calculate the error from that point to the line (using calculate_error). It should keep a running total of the error, and then return that total after the loop.

In [10]:
def calculate_all_error(m, b, datapoints):
    total_error = 0
    for point in datapoints:
        point_error = calculate_error(m, b, point)
        total_error += point_error
    return total_error

Now lets define some range for our slope and our start point

In [11]:
minimium_age = min(ages)
maximium_age = max(ages)

minimium_charge = min(charges)
maximium_charges = max(charges)


print("The youngest person in the dataset is {minimium_age}, the oldest is {maximium_age}".format(minimium_age = minimium_age, maximium_age = maximium_age))
print("The minimium charge in the dataset is {minimium_charge} and the maximum charge is {maximium_charges}".format(minimium_charge = minimium_charge, maximium_charges = maximium_charges))


The youngest person in the dataset is 18, the oldest is 64
The minimium charge in the dataset is 1121.8739 and the maximum charge is 63770.42801


Now this information tell us that charge values could go from 1121 up to 63770 in the data set, therefore a good range for our analysis for our b variable could be [500, 100000]. Now lets calculate for our slope range values.

In [12]:
slope_values_age_charge = (maximium_charges - minimium_charge) / (maximium_age - minimium_age)
print(slope_values_age_charge)

1361.925089347826


The possible range values for age could go from 0 (meaning as age value increases the carge value increases) up to 2000. 

Now create a list for the range values of the slope and the b intercept.

In [13]:
possible_slope_charge = [m for m in range(-10, 2010, 10)]

possible_b_charge = [b for b in range(-70000, 70000, 100)]


We are going to find the smallest error. First, we will make every possible y = m*x + b line by pairing all of the possible ms with all of the possible bs. Then, we will see which y = m*x + b line produces the smallest total error with the set of data stored in datapoint.

First, create the variables that we will be optimizing:

smallest_error — this should start at infinity (float("inf")) so that any error we get at first will be smaller than our value of smallest_error
best_m — we can start this at 0
best_b — we can start this at 0
We want to:

Iterate through each element m in possible_ms
For every m value, take every b value in possible_bs
If the value returned from calculate_all_error on this m value, this b value, and datapoints is less than our current smallest_error,
Set best_m and best_b to be these values, and set smallest_error to this error.
By the end of these nested loops, the smallest_error should hold the smallest error we have found, and best_m and best_b should be the values that produced that smallest error value.

In [14]:
def linear_parameters(possible_m, possible_b, datapoints):
    
    smallest_error = float("inf")
    best_m = 0
    best_b = 0 
    for m in possible_m:
        for b in possible_b:
            error = calculate_all_error(m, b, datapoints)
            if error < smallest_error:
                best_m = m
                best_b = b
                smallest_error = error
    return best_m, best_b, smallest_error
age_linear_parameters = linear_parameters(possible_slope_charge, possible_b_charge, datapoints_age_charge)
print(age_linear_parameters)

(270, -3200, 8976110.539059)


This tell us that the best linear slope for the age factor is 270 and a value intercept b of -3200!

Now let's do the same analysis for other variables like BMI


In [17]:
datapoints_bmi_charge = list(zip(bmis, charges))
bmi_linear_parameters = linear_parameters(possible_slope_charge, possible_b_charge, datapoints_bmi_charge)
print(bmi_linear_parameters)


(130, 5400, 11133952.176420998)


The best slope for bmi variable is 130 with an intercept of 5400

Now let's use our function to calculate the children variable parameters


In [19]:
datapoints_children_charge = list(zip(number_children, charges))
children_linear_parameters = linear_parameters(possible_slope_charge, possible_b_charge, datapoints_children_charge)
print(children_linear_parameters)

(120, 9200, 11171209.628321005)


The best slope for children variable is 120 with an intercept of 9200

Now lets calculate for sex variable, we define female as 0 and male as 1 to perform the calculations

In [27]:
gender_numeric = []
possible_smoker_slope_charge = [m for m in range(-1000, 3000, 50)]
for sex in sexs:
    if sex == 'female':
        gender_numeric.append(0)
    elif sex == 'male':
        gender_numeric.append(1)
datapoints_sex_charge = list(zip(gender_numeric, charges))
gender_linear_parameters = linear_parameters(possible_smoker_slope_charge, possible_b_charge, datapoints_sex_charge)
print(gender_linear_parameters)


(-50, 9400, 11173669.488821002)


The best slope for gender is -50  with an intercept of 9400

Now lets do the same analysis for smoking, define non smoker as 0 and smoker as 1

In [28]:
smoking_numeric = []
possible_smoker_slope_charge = [m for m in range(0, 30000, 100)]
for smoker in smokers:
    if smoker == 'no':
        smoking_numeric.append(0)
    elif smoker == 'yes':
        smoking_numeric.append(1)
datapoints_smoker_charge = list(zip(smoking_numeric, charges))
smoker_linear_parameters = linear_parameters(possible_smoker_slope_charge, possible_b_charge, datapoints_smoker_charge)
print(smoker_linear_parameters)

(27200, 7300, 7458472.857479012)


The best slope for smoking is 27200 with an intercept of 7300

Calculate an aggregate B value

In [29]:
aggregate_b = -3200 + 5400 + 9200 + 9400 + 7300
print(aggregate_b)

28100


## Model Definition and validation

Now we completed our model and need to validate it.

Our model is

```
Charge = 270*Age + 130*BMI + 120*Number_children -50*Sex + 27200*Smoker + aggregate b

female gender is value of 0 and male value of 1
Non-smoker is value 0 and smoker value of 1

```
Take the minimium adn maximum values for charge, and lets check the variables for those charge values in order to deduce the best value for aggregate b.


In [48]:
index_atmin_charge = 0
for charge in range(len(charges)):
    if charges[charge] == minimium_charge:
        index_atmin_charge = charge
age_atmin_charge = ages[index_atmin_charge]
children_atmin_charge = number_children[index_atmin_charge]
bmi_atmin_charge = bmis[index_atmin_charge]
smoker_atmin_charge = smokers[index_atmin_charge]
if smoker_atmin_charge == 'no':
    smoker_atmin_charge = 0
elif smoker_atmin_charge =='yes':
    smoker_atmin_charge = 1
gender_atmin_charge = sexs[index_atmin_charge]
if gender_atmin_charge == 'female':
    gender_atmin_charge = 0
elif gender_atmin_charge == 'male':
    gender_atmin_charge = 0
print(age_atmin_charge, bmi_atmin_charge, children_atmin_charge, gender_atmin_charge, smoker_atmin_charge)

index_atmax_charge = 0
for charge in range(len(charges)):
    if charges[charge] == maximium_charges:
        index_atmax_charge = charge
age_atmax_charge = ages[index_atmax_charge]
children_atmax_charge = number_children[index_atmax_charge]
bmi_atmax_charge = bmis[index_atmax_charge]
smoker_atmax_charge = smokers[index_atmax_charge]
if smoker_atmax_charge == 'no':
    smoker_atmax_charge = 0
elif smoker_atmax_charge =='yes':
    smoker_atmax_charge = 1
gender_atmax_charge = sexs[index_atmax_charge]
if gender_atmax_charge == 'female':
    gender_atmax_charge = 0
elif gender_atmax_charge == 'male':
    gender_atmax_charge = 0
print(age_atmax_charge, bmi_atmax_charge, children_atmax_charge, gender_atmax_charge, smoker_atmax_charge)

18 23.21 0 0 0
54 47.41 0 0 1


Lets define a function to calculate the model charge and test it with the min and max parameters

In [53]:
def get_charge_modeled(age, bmi, number_children, sex, smoker):
    charge_modeled = 270*age + 130*bmi + 120*number_children - 50*sex + 27200*smoker 
    return charge_modeled
model_min_charge = get_charge_modeled(age_atmin_charge, bmi_atmin_charge, children_atmin_charge, gender_atmin_charge, smoker_atmin_charge)
model_max_charge = get_charge_modeled(age_atmax_charge, bmi_atmax_charge, children_atmax_charge, gender_atmax_charge, smoker_atmax_charge)
delta_in_min_charge = minimium_charge - model_min_charge
delta_in_max_charge = maximium_charges - model_max_charge

print(delta_in_min_charge, delta_in_max_charge)

7877.3 47943.3
-6755.426100000001 15827.12801


Lets first take our value of aggregate b as the value of delta_in_max_charge

```
Charge = 270*Age + 130*BMI + 120*Number_children -50*Sex + 27200*Smoker + 15800

female gender is value of 0 and male value of 1
Non-smoker is value 0 and smoker value of 1

```

And now create a function to calculate the model charge for all the datapoints.


In [89]:
def get_charges_modeled(age, bmi, number_children, sex, smoker):
    modeled_charges_list = []
    for i in range(len(age)):
        modeled_charges_list.append(270*age[i] + 130*bmi[i] + 120*number_children[i] - 50*sex[i] + 27200*smoker[i]+round(delta_in_max_charge,0)-average_error)
    return modeled_charges_list
        

Lets test the model for all the datapoints.


In [90]:
model_charge_list = get_charges_modeled(ages, bmis, number_children, gender_numeric, smoking_numeric)
rounded_model_charge_list = []
for model_charge in model_charge_list:
    rounded_model_charge_list.append(round(model_charge, 2))

Calculate the differences in the actual charge and the model charge to check if our model b aggregate is accurate


In [91]:
delta_in_value_charge = []
for value in range(len(rounded_model_charge_list)):
    delta_in_value_charge.append(rounded_model_charge_list[value] - charges[value])

    

The difference seems big! Calculate the average error in our model charge vs actual charge 

In [92]:
value_delta_counter = 0
for value in delta_in_value_charge:
    value_delta_counter += value
    average_error = value_delta_counter / len(delta_in_value_charge)
print(average_error)
    

22804.949371630068


Now subtract this fix value to our aggregate b value to reajust our model. And we are done! We have a model to calculate the insurance charge that takes into account the person age, bmi, sex, number of children and smoker/non-smoker. The final linear model is:

```
Charge = 270*Age + 130*BMI + 120*Number_children -50*Sex + 27200*Smoker - 7000 

female gender is value of 0 and male value of 1
Non-smoker is value 0 and smoker value of 1

```
