# US Medical Insurance Costs
The aim of this project is to use knowledge of Python to investigate a dataset of medical insurance costs found within the supplied `insurance.csv` file

In [1]:
import csv

To start, any necessary and supplimentary libraries are imported:  
The `csv` library is imported to allow for the file to be read easily into the code  
The next step is to bring the data into Python so that we can intially inspect it, then go on to investigate it

In [2]:
with open("insurance.csv", newline="") as insurance_csv:
    temp_dict = csv.DictReader(insurance_csv)
    insurance_data = [row for row in temp_dict]

print(len(insurance_data))
print(insurance_data[:10])

1338
[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}, {'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}, {'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}, {'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}, {'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '8240.5896'}, {'age': '37', 'sex': 'female', 'bmi': '27.74', 'children': '3', 'smoker': 'no', 'region': 'northwest', 'cha

Here the `csv.DictReader` object is used to import each row of `insurance.csv` as a dictionary where it is then added to a list 

This list then contains a dictionary for all 1338 patients within `insurance.csv`  

From inspection we can see there were 7 columns in the `csv` file thus each patient dictionary now has 7 keys:
* Patient age
* Patient sex
* Patient BMI
* Patient number of children
* Patient smoker status
* Patient U.S geographical region
* Patient yearly medical insurance cost  

It can also be seen that no data is missing but that in the current state, all values are strings

In [3]:
age_data = [int(patient["age"]) for patient in insurance_data]
sex_data = [patient["sex"] for patient in insurance_data]
bmi_data = [float(patient["bmi"]) for patient in insurance_data]
child_data = [int(patient["children"]) for patient in insurance_data]
smoker_data = [patient["smoker"] for patient in insurance_data]
region_data = [patient["region"] for patient in insurance_data]
cost_data = [float(patient["charges"]) for patient in insurance_data]

Using list comprehension, each variable is organised into a list and is also converted into the most suitable data type  
From here we can easily investigate each variable 

In [4]:
# A function that will return a dictionary of value:frequency pairs
def create_freq_dict(ls):

    freq_dict = {} # We initialise a dictionary to have value:frequency pairs

    for val in ls:

        if val not in freq_dict.keys(): # We check if key exists in dictionary
            freq_dict[val] = 1 # If not, add it and set the count to 1

        else:
            freq_dict[val] += 1 # Otherwise increase the count

    return freq_dict

For ease, the `create_freq_dict` function is created to easily return a dictionary of item:frequency pairs of a given list

In [5]:
# Function that finds some summary statistics of a given list of data, subject to the datatype of said list
def get_summary_stats(data):

    # As there are categorical and quantitative variables, the summary statistics will be different
    # In the case of nominal categorical data, the only valid summary statistic is the frequency of each variable
    if type(data[0]) == str:

        return create_freq_dict(data)

    # With numeric data we can find more summary statistics
    elif type(data[0]) in [int, float]:
        return {
            'Min' : min(data),
            'Max' : max(data),
            'Mean' : round(sum(data)/len(data), 5) # From inspection of the dataset, charges are given to 5 decimal places
        }
    return 'Error'


age_summary = get_summary_stats(age_data)
sex_summary = get_summary_stats(sex_data)
bmi_summary = get_summary_stats(bmi_data)
child_summary = get_summary_stats(child_data)
smoker_summary = get_summary_stats(smoker_data)
region_summary = get_summary_stats(region_data)
cost_summary = get_summary_stats(cost_data)

print(age_summary,
      sex_summary,
      bmi_summary,
      child_summary,
      smoker_summary,
      region_summary,
      cost_summary,
      sep='\n')

{'Min': 18, 'Max': 64, 'Mean': 39.20703}
{'female': 662, 'male': 676}
{'Min': 15.96, 'Max': 53.13, 'Mean': 30.6634}
{'Min': 0, 'Max': 5, 'Mean': 1.09492}
{'yes': 274, 'no': 1064}
{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
{'Min': 1121.8739, 'Max': 63770.42801, 'Mean': 13270.42227}


By using the `get_summary_stats` function, we can easily get a quick summary of each of the variables and have called it upon each variable  

From the summaries we can see:
* The average age of the patients is 39, we would expect an average of 41 if the dataset had an even distribution of ages
* There is a nearly even split of sexes
* The average patient in the dataset is considered overweight by BMI
* The average patient has 1 child
* A large majority of patients do not smoke
* The geographical distribution of patients is mostly equal with more being from the southeast
* From the cost data we can see that there is a huge range 

In [6]:
# Function that returns unique values from a list of data
def get_unique(data, sort=True):

    uniques = [] # Create an empty list to keep the uniques

    for entry in data:
        if entry not in uniques: # Check if a given element is already in the unique list
            uniques.append(entry)

    if sort:
        uniques = sorted(uniques) 

    return uniques


unique_ages = get_unique(age_data)
unique_child_nums = get_unique(child_data)

print(unique_ages, unique_child_nums, sep='\n')

[18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]
[0, 1, 2, 3, 4, 5]


Using the `get_unique` function, we can quickly see the unique values of variables  
From this we can see that there is at least one patient for each possible value given by the ranges for age and number of children  

In [7]:
# A function to calculate the average charge per regional quadrant
def get_average_region_cost(region):

    if region not in ['northeast', 'southeast', 'southwest', 'northwest']:
        return 'Please enter a proper regional quadrant e.g. southwest.'

    cost_data_sub_region = [float(patient["charges"]) for patient in 
                                insurance_data if patient["region"] == region] # Create a subset 
    region_total = sum(cost_data_sub_region)

    return round(region_total / len(cost_data_sub_region), 5)


print(get_average_region_cost('northeast'),
      get_average_region_cost('southeast'),
      get_average_region_cost('southwest'),
      get_average_region_cost('northwest'),
      sep='\n')

13406.38452
14735.41144
12346.93738
12417.57537


Here the function `get_average_region_cost` is created to calculate the average cost per given region  
The results indicate that patients from the eastern United States pay higher yearly health insurance charges  
Further analysis could be performed to see if the dataset contains a broad representation of ages, sex, etc. across each region

In [8]:
# A function to calculate the average cost dependant on if a patient smokes
def get_average_smoker_cost(is_smoker=True):

    cost_init = 0.0

    if is_smoker == False:
        cost_data_sub_nonsmoker = [float(patient["charges"]) for patient in
                                   insurance_data if patient["smoker"] == 'no']
        non_smoker_total = sum(cost_data_sub_nonsmoker)

        return round(non_smoker_total / len(cost_data_sub_nonsmoker), 5)

    cost_data_sub_smoker = [float(patient["charges"]) for patient in
                                insurance_data if patient["smoker"] == 'yes']
    smoker_total = sum(cost_data_sub_smoker)

    return round(smoker_total / len(cost_data_sub_smoker), 5)


print(get_average_smoker_cost(),
      get_average_smoker_cost(is_smoker=False),
      sep='\n')

32050.23183
8434.2683


The function `get_average_smoker_cost` is created to calcualte the average insurance charge for a patient dependant on if they smoke
From the results, it is suggested that the yearly charge is much higher if an individual smokes
An interesting note is that the average smoker cost is only roughly half the cost of the patient with the highest yearly charge 

In [9]:
def get_average_sex_cost(sex):

    if sex not in ['male', 'female']:
        return 'Please enter "male" or "female"'

    cost_data_sub_sex = [float(patient['charges']) for patient in insurance_data if patient['sex'] == sex]
    sex_total_cost = sum(cost_data_sub_sex)

    return round(sex_total_cost / len(cost_data_sub_sex), 5)


print(get_average_sex_cost('male'),
      get_average_sex_cost('female'),
      sep='\n')

13956.75118
12569.57884


The `get_average_sex_cost` function returns the average yearly charge dependant on a patients' sex
We can see that the averages are very similar but higher for male patients

From this analysis, it is suggested that the factor that is most likely to affect a patients' yearly medical insurance cost is if the patient smokes or does not

Further analysis could be done to investigate how other attributes influence a patients yearly charge such as age or BMI