# U.S. Medical Insurance Costs

### Welcome to my Medical Portfolio Project

Source: 'insurance.csv'

insurance.csv' includes 1338 datapoints of insurance data for a group of randomly sampled individuals.

Each datapoint includes the following variables: Age, Sex, BMI, Children, Smoker, Region, Charges

The goal of my analysis was to determine which variables were most impactful to insurance cost, so that an individual may know how to potentially lower it.

To achieve this I utilized both averages data, but also correlation data in the form of the R-value and P-value to factor out any statistically in-signicant results, and to demonstrate which have the greatest total correlation with cost.

Additionally, as smoker status turned out to be an outlier, I recalculated everything for non-smokers to determine how averages change with the assumption that an individual does not smoke. 

The following code has the following functions:

* Read in the data as two seperate dictionaries (one with all data, and one with the smoker data removed), additionally adding a seperate binary "parent" nominal variable.
* A function to determine the base average insurance charge data of the given dataset.
* A function to calculate the average cost of insurance given a specific variable
* A function to pull the average cost data for all variables and compare it against the base average for the dataset.
* A function to take in any binary nominal variables and any numeric variables, and see how they are correlated to the cost
* A function to check all valid variables and organize the data based on how significant the correlation is.

* Lastly I called the average cost function, and the correlation function for both cases (all data and non-smokers)

###### PS: The functions have been built open-ended, meaning it is possible:

* To determine correlation between two variables outside of cost. 
    * For example, the correlation between age and BMI.
    
* To check the average of a value given a specific classifier:
    * For example, we can see the average age of a parent
    * Or the average BMI by region

A couple examples of this application have also been demonstrated with the final function calls.

In [1]:
import csv
from copy import deepcopy

#create my variables:
total_age = 0
average_age = 0

total_age_parent = 0
average_age_parent = 0

total_bmi = 0
average_bmi = 0

total_bmi_male = 0
average_bmi_male = 0

total_bmi_female = 0
average_bmi_female = 0

data_list = []
no_smoking_list =[]


#Variables in use: "age", "sex", "bmi", "children", "smoker", "region", "charges"

#Open our data file and read it as a CSV
with open("insurance.csv") as data_csv:
    data_reader = csv.DictReader(data_csv)
    
    #Iterate through each row
    for row in data_reader:
        
        #Determine if they are a parent and add a new variable to indicate this
        if int(row.get("children")) >0:
                row["parent"] = "yes"
        elif int(row.get("children")) == 0:
            row["parent"] = "no"

        #Create our data_list. Create a seperate list if they are not a smoker.    
        if row.get("smoker") == "yes":
            data_list.append(row)
            continue
        elif row.get("smoker") == "no":
            data_list.append(row)
            no_smoking_list.append(row)
            continue

            
no_smokers = deepcopy(no_smoking_list) 

#maintain a list of variables for future purposes
sum_variables = ["age", "bmi", "charges"]
info_variables = ["sex", "children", "smoker", "region", "parent"]
entry_count = (len(data_list))


#Edit the smoking list to remove that variable completely
for entry in no_smokers:
    entry.pop("smoker")



In [2]:
#Determine the average charge regardless of any specific variable for the database.
def average_charges (database):
    total_sum = 0
    counter = 0
    average_sum = 0
    for entry in database:
        total_sum += float(entry.get("charges"))
        counter += 1
    average_sum = total_sum/counter
    return average_sum

In [3]:
#Function to determine the average cost of insurance based on a given variable
#The function takes a summing variable, and a classifier variable with the option to add a detail variable to drill down.
def average_calculator (sum_variable, info_variable, detail_variable = "",database = data_list):
    
    #define variables
    total_sum = 0
    average_sum = 0
    counter = 0
    temp_var = ""
    failed_detail = 0
    possible_values = []
    results = {}
    final_dict = {}

    
    
#start by making sure entries are valid
    if sum_variable != "age" and sum_variable != "bmi" and sum_variable != "charges":
        print("{} is not a valid Sum Variable".format(sum_variable))
        return None
        
    if info_variable != "sex" and info_variable != "children" and info_variable != "smoker" and info_variable != "parent" and info_variable !="region":
        print("{} is not a valid Info Variable".format(info_variable))
        return None   
    
    
    
#determine if we want all of the detailed variables, or a specific one. If variable isn't found return all possible values:
    for entry in database:
        
        #detail variable is not given so pull all possible values
        if detail_variable == "": 
            temp_var = entry[info_variable]
            if temp_var not in possible_values:
                possible_values.append(temp_var)
                
        #detail variable is given - check if it's valid and if so append the value.
        elif detail_variable != "": 
            if detail_variable in entry[info_variable]:
                if detail_variable not in possible_values:
                    possible_values.append(detail_variable)
                    continue

            #provided detail variable is not found so we default to all possible values.        
            else:          
                temp_var = entry[info_variable]
                failed_detail = 1
                if temp_var not in possible_values:
                    possible_values.append(temp_var)
                    
    #if we defaulted to all values, let the user know before moving on                 
    if failed_detail == 1:
        print("provided detail variable is not found. Defaulted to all possible values.")
    possible_values.sort()
    #print (possible_values)
    

#next, we will take the sum var and examine the average for the info var (based on the values in the detail var)
    
    #calculate the average independent of any info variable:
    average_sum = average_charges(database)

    
    
    #calculate the average for the info/detail variables:
    for value in possible_values:
        total_sum = 0
        counter = 0
        for entry in database:
            if entry[info_variable] == value:
                total_sum += float(entry[sum_variable])  
                counter += 1
        average_sum = total_sum/counter 
        results["Average " + str(sum_variable) + " for " + str(value)] = round(average_sum,2)
    
    final_dict[str(sum_variable + " by " + str(info_variable))] = results
    
    #return a dictionary of all the possible values given inputs
    return (final_dict)
    
        
#average_calculator("charges","age","",data_list)

In [4]:
#Let's determine which factors most significantly impact insurance cost
#This function when called pulls all valid classifiers and determines how they affect insurance cost on average.
def cost_increaser(database = data_list):
    
    #Define Variables
    key_list = []
    temp_dict = {}
    cost_change = []
    temp_charge = 0
    temp_change = 0
    
    #get all the keys
    for entry in database:
        for key in entry.keys():
            if key not in key_list:
                key_list.append(key)
    
    #itereate through each of the info variables:
    for key in key_list:
        
        #Pull the average charges for each detail variable using our last function and put them in our dictionary:
        temp_dict = average_calculator("charges",key,"",database)
        if temp_dict == None:
            continue
        
        #Iterate through the newly created dictionary
        for data in temp_dict:
            
            #Dig into the values for each entry
            for entry in temp_dict.values():
                counter = 0
                
                #assign the average charges from the dictionary.
                average_charge = average_charges(database)
                
                
                for value in entry.values():
                    
                    #Assign the name of the variable being calculated.
                    temp_var = (list(entry)[counter])
                    
                    #Assign the value of the variable being calculated.
                    temp_charge = value
                    
                    #Determine how much the average charge for the variable differs from the overall average charge.
                    temp_change = temp_charge - average_charge
                    counter += 1
                    
                    #Ignore if we are examining the base variable with no change from average.
                    if temp_change == 0:
                        continue
                    
                    #Append to a dictionary the change in cost, with the variable name as a list
                    cost_change.append([round(temp_change,2), "Change in " + temp_var + data[data.rfind(" "):]])
    
    #Sort by the highest to lowest
    cost_change.sort(reverse=True)
    
    
    #Print the results
    for entry in cost_change:
        print ("The " + str(entry[1]) + " is " + str(entry [0]) + " dollars.")
    
    
    
#cost_increaser()        

In [5]:
#Function to calculate how correlated variables are

#import the libraries:
import numpy as np
import scipy.stats
correlation_list = []


#Define the variable, it takes two variables, but the second will default to charges, as that is the primary concern.
#The first Variable will be the one we want to see how correlated it is to changing the charges.
def correlation_calc (var1,var2="charges",database = data_list):
    charges_list = []
    temp_list = []
    temp_dict = {}
    test = ""
    possible_values = []
    binary_1 = ""
    binary_2 = ""
    temp_values = {}


    
    #find any binary variables
    for entry in database:
        try:
            temp_var = entry[var1]
            if temp_var not in possible_values:
                possible_values.append(temp_var) 
        except:       
            print(var1, "is invalid.")
            return
           
            
    #create the lists to go through for the calculation:       
    if len(possible_values) == 2:
        for data in database:
            #create the list of charges
            charges_list.append(float(data[var2]))
        
            #pull the two possible values for the binary classifier
            binary_1 = possible_values[0]
            binary_2 = possible_values[1]

            #append a 0 for case one, and a 2 for case two.
            if data[var1] == binary_1:
                temp_list.append(0)
                continue
            elif data[var1] == binary_2:
                temp_list.append(1)
                continue
                
    #If it is not a binary variable, we can compare against sum variables. We only need to filter out classifiers with more than 2 entries.
    elif len(possible_values) > 10:
        for data in database:
            #create the list of charges
            charges_list.append(float(data[var2]))
            temp_list.append(float(data[var1]))
            
    elif len(possible_values) < 10 and len(possible_values) > 2:
        print(var1, "can't be correlated")
        return

    #Do the math and get the correlation coefficients
    x = np.array(temp_list)
    y = np.array(charges_list)
    
    correlation = np.corrcoef(x,y)
    r, p = scipy.stats.pearsonr(x,y)
    
    #Assign and return the results
    key = str(var1) + " " + str(binary_2)
    temp_values["r_value"] = r
    temp_values["p_value"] = p
    temp_dict[key] = temp_values
    return temp_dict




In [14]:
def cost_correlation(database=data_list, var2="charges"):
    key_list = [] 
    temp_var = ""
    correlation_list = []
    r_values = []
    absolute_r = []
    counter = 0
    
    #Get a list of keys/variables from the chosen data list
    for entry in database:
        for key in entry.keys():
            if key not in key_list:
                key_list.append(key)
        
    #itereate through each of the variables and pull the correlation values for each variable:
    for key in key_list:
        
        #If the key is empty ignore it
        if correlation_calc(key) == None:
            continue
        else:
            #Take the given key and check how it correlates to cost using the correlation calculator
            correlation_list.append(correlation_calc(key,var2,database))
    
    
    #Work through our created list with the correlation information
    for value in correlation_list:
        
        #Get the variable name being examined
        temp_var = list(value)
        
        #Change the naming convention of the variables
        temp_var = temp_var[0]
        temp_var_1st = temp_var[:temp_var.find(" ")]
        temp_var_2nd = temp_var[temp_var.find(" "):]
        if temp_var_2nd == " ":
            temp_var_renamed = str(temp_var_1st) 
        if temp_var_2nd != " ":
            temp_var_renamed = str(temp_var_1st) + ": "+ str(temp_var_2nd)
        
        #Go through the P-values and eliminate any values that are not statistically significant
        for data in value.values():
            
            #If P-Value = 0, it means we're comparing charges against charges so it can be scrapped.
            if data["p_value"] == 0:
                continue
            
            #If P-Value is less than 0.05 then we will include it. 
            elif data["p_value"] < 0.05:
                r_values.append([data["r_value"], temp_var_renamed])

            #If P-Value is greater than 0.05 then it is not statistically significant and should not be included.
            elif data["p_value"] >= 0.05:
                print (temp_var_1st, "stats are NOT statistically significant")
                continue
    
    #Create a function to sort by the absolute value without actually editing the values (so we can sort by distance from 1)
    def get_ordered_list(input):
        return (1- abs(input[0]))
    
    #Sort the list with our function
    r_values.sort(key=get_ordered_list)
    
    #Display the data from most to least significant, print the results and indicate how correlated it is, and in which direction.
    for data in r_values:
        counter += 1
        if abs(data[0]) >= .9:
            if data[0] < 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+ " with an r-value of " +"\""+ str(data[0]) +"\""+ " indicating high negative correlation with the cost of insurance.")
            elif data[0] > 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+ " with an r-value of " +"\""+ str(data[0]) +"\""+ " indicating high positive correlation with the cost of insurance.")
            continue
            
            
        elif abs(data[0]) >= .7:
            if data[0] < 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+  " with an r-value of " + str(data[0]) + " indicating medium negative correlation with the cost of insurance.")
            elif data[0] > 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+  " with an r-value of " + str(data[0]) + " indicating medium positive correlation with the cost of insurance.")
            continue
            
        elif abs(data[0]) >= .5:
            if data[0] < 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+ " with an r-value of " +"\""+ str(data[0]) +"\""+ " indicating low negative correlation with the cost of insurance.")
            elif data[0] > 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+ " with an r-value of " +"\""+ str(data[0]) +"\""+ " indicating low positive correlation with the cost of insurance.")
            continue
            
        elif abs(data[0]) >= .3:
            if data[0] < 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+  " with an r-value of " +"\""+ str(data[0]) +"\""+ " indicating minor negative correlation with the cost of insurance.")
            elif data[0] > 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+  " with an r-value of " +"\""+ str(data[0]) +"\""+ " indicating minor positive correlation with the cost of insurance.")
            continue
            
        elif abs(data[0]) >= .1:
            if data[0] < 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+ " with an r-value of " +"\""+ str(data[0]) +"\""+ " indicating minimal negative correlation with the cost of insurance.")
            elif data[0] > 0:
                print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+ " with an r-value of " +"\""+ str(data[0]) +"\""+ " indicating minimal positive correlation with the cost of insurance.")    
            continue
            
        elif abs(data[0]) < .1:
            print("The #"+ str(counter) +" most significant variable is " +"\""+str(data[1]) +"\""+ " with an r-value of " +"\""+ str(data[0]) +"\""+ " indicating insignificant correlation with the cost of insurance.")                                                 
            continue

    
    
#cost_correlation()       

In [7]:
print("Average Charges Data for All Individuals:")
cost_increaser(data_list)
print("\n\n\n\n")
print("Average Charges Data for Non-Smokers:")
cost_increaser(no_smokers)

Average Charges Data for All Individuals:
age is not a valid Info Variable
bmi is not a valid Info Variable
charges is not a valid Info Variable
The Change in Average charges for yes smoker is 18779.81 dollars.
The Change in Average charges for 3 children is 2084.9 dollars.
The Change in Average charges for 2 children is 1803.14 dollars.
The Change in Average charges for southeast region is 1464.99 dollars.
The Change in Average charges for male sex is 686.33 dollars.
The Change in Average charges for yes parent is 679.52 dollars.
The Change in Average charges for 4 children is 580.24 dollars.
The Change in Average charges for northeast region is 135.96 dollars.
The Change in Average charges for 1 children is -539.25 dollars.
The Change in Average charges for female sex is -700.84 dollars.
The Change in Average charges for northwest region is -852.84 dollars.
The Change in Average charges for no parent is -904.44 dollars.
The Change in Average charges for 0 children is -904.44 dollars.

In [8]:
print("Average Charges Correlation Data for All Individuals:")
cost_correlation(data_list)
print("\n\n\n\n")
print("Average Charges Correlation Data for Non-Smokers:")
cost_correlation(no_smokers)    

Average Charges Correlation Data for All Individuals:
children can't be correlated
region can't be correlated
The #1 most significant variable is "smoker:  no" with an r-value of -0.7872514304984748 indicating medium negative correlation with the cost of insurance.
The #2 most significant variable is "age" with an r-value of "0.2990081933306478" indicating minimal positive correlation with the cost of insurance.
The #3 most significant variable is "bmi" with an r-value of "0.19834096883362912" indicating minimal positive correlation with the cost of insurance.
The #4 most significant variable is "parent:  yes" with an r-value of "0.06476047639409527" indicating insignificant correlation with the cost of insurance.
The #5 most significant variable is "sex:  male" with an r-value of "0.05729206220202522" indicating insignificant correlation with the cost of insurance.





Average Charges Correlation Data for Non-Smokers:
children can't be correlated
region can't be correlated
sex stats 

In [13]:
print ("The average age of a parent in the dataset is:")
print(average_calculator("age", "parent"))
print("\n")
print ("The average BMI of a smoker in the dataset is:")
print(average_calculator("bmi", "smoker"))

The average age of a parent in the dataset is:
{'age by parent': {'Average age for no': 38.44, 'Average age for yes': 39.78}}


The average BMI of a smoker in the dataset is:
{'bmi by smoker': {'Average bmi for no': 30.65, 'Average bmi for yes': 30.71}}


In [19]:
print("How are other variables correlated with Age?\n")
cost_correlation(data_list,"age")

How are other variables correlated with Age?

children can't be correlated
region can't be correlated
sex stats are NOT statistically significant
smoker stats are NOT statistically significant
parent stats are NOT statistically significant
The #1 most significant variable is "charges" with an r-value of "0.2990081933306478" indicating minimal positive correlation with the cost of insurance.
The #2 most significant variable is "bmi" with an r-value of "0.1092718815485352" indicating minimal positive correlation with the cost of insurance.
