# U.S. Medical Insurance Costs

## Scoping Project

*source: http://www.datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/*


- ##### Project Goals – Define the goal(s) of the project
    
    1. For different columns, develop different ways of analysis. Below are the examples provided by Codecademy.
        - Find out the average age of the patients in the dataset.
        - Analyze where a majority of the individuals are from.
        - Figure out what the average age is for someone who has at least one child in this dataset.
    
    2. Find out what might cause higher charges of insurance cost. Below are the examples provided by Codecademy.
        - Look at the different costs between smokers vs. non-smokers.
        - Make predictions about what features are the most influential for an individual’s medical insurance charges based on analysis.
    
- ##### Actions – What actions/interventions do you have that this project will inform?
- ##### Data – What data do you have access to internally? What data do you need? What can you augment from external and/or public sources?
     Data source: https://www.kaggle.com/datasets/mirichoi0218/insurance
- ##### Analysis – What analysis needs to be done? Does it involve description, detection, prediction, optimization, or behavior change? How will the analysis be validated?
     Description: Primarily focused on understanding events and behaviors that have happened in the past. 
- ##### Ethical Considerations:  How have you thought through privacy, transparency, discrimination/equity, and accountability issues around this project?
     Explore areas where the data may include bias and how that would impact potential use cases.


## CSV file Columns Definition

*source: https://www.kaggle.com/datasets/mirichoi0218/insurance*

* age: age of primary beneficiary
    
* sex: insurance contractor gender, female, male

* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

* children: Number of children covered by health insurance / Number of dependents

* smoker: Smoking

* region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

* charges: Individual medical costs billed by health insurance


In [1]:
import csv

In [2]:
#Create a dict to save each patient's data from the csv file, the dict key is assigned numbers starting from 1
patient_infos = {}
i = 1

with open('insurance.csv') as file:
    infos = csv.DictReader(file)
    for info in infos:
        patient_infos.update({f'Patient {i}': info})
        i += 1
        
#print(patient_infos)

To fulfill the project goals listed above, a class called `PatientsInfo` has been built out which contains eight methods:
* `average_age()`
* `major_region()`
* `average_age_with_kids()`
* `costs_vs_smoke()`
* `costs_vs_kids()`
* `costs_vs_genders()`
* `age_groups()`
* `average_costs_vs_ages()`

The class has been built out below. 

In [3]:
class PatientsInfo:
    def __init__(self, patients_info):
        self.patients_info = patients_info

    #method that calculates the average age of the patients
    def average_age(self):
        sum = 0
        for info in self.patients_info.values():
            sum += int(info['age'])
            average_age = round(sum / len(self.patients_info), 2)
        return average_age
    
    #method that finds out where a majority of the patients are from 
    def major_region(self):
        regions = {}    #To record number of patients in each region
        max = 0
        region_ = ''
        for info in self.patients_info.values():
            if info['region'] not in regions:
                regions[info['region']] = 1
            else:
                regions[info['region']] += 1
        for region, num_ppl in regions.items():
            if num_ppl > max:
                max = num_ppl
                region_ = region
        return regions, region_, max
    
    #method that calculates the average age for patients who has at least one child
    def average_age_with_kids(self):
        counter = 0
        sum = 0
        for info in self.patients_info.values():
            if info['children'] != '0':
                counter += 1
                sum += int(info['age'])
        average_age = round(sum / counter, 2)
        return average_age
    
    #method that compares the insurance costs between smokers vs. non-smokers
    def costs_vs_smoke(self):
        smoker_sum = 0
        nonsmoker_sum = 0
        s_counter = 0
        ns_counter = 0
        for info in self.patients_info.values():
            if info['smoker'] == 'yes':
                smoker_sum += float(info['charges'])
                s_counter += 1
            else:
                nonsmoker_sum += float(info['charges'])
                ns_counter += 1
        average_s_cost = round(smoker_sum/s_counter, 2)
        average_ns_cost = round(nonsmoker_sum/ns_counter, 2)
        costs_diff = abs(average_s_cost - average_ns_cost)
        return average_s_cost, average_ns_cost, costs_diff
    
    #method that compares the insurance costs between patients with and without kids
    def costs_vs_kids(self):
        kids_sum = 0
        nokids_sum = 0
        k_counter = 0
        nk_counter = 0
        for info in self.patients_info.values():
            if info['children'] != '0':
                kids_sum += float(info['charges'])
                k_counter += 1
            else:
                nokids_sum += float(info['charges'])
                nk_counter += 1
        average_k_cost = round(kids_sum/k_counter, 2)
        average_nk_cost = round(nokids_sum/nk_counter, 2)
        costs_diff = abs(average_k_cost - average_nk_cost)
        return average_k_cost, average_nk_cost, costs_diff
    
    #method that compares the insurance costs between female and male patients
    def costs_vs_genders(self):
        f_sum = 0
        m_sum = 0
        f_counter = 0
        m_counter = 0
        for info in self.patients_info.values():
            if info['sex'] == 'female':
                f_sum += float(info['charges'])
                f_counter += 1
            else:
                m_sum += float(info['charges'])
                m_counter += 1
        average_f_cost = round(f_sum/f_counter, 2)
        average_m_cost = round(m_sum/m_counter, 2)
        costs_diff = abs(average_f_cost - average_m_cost)
        return average_f_cost, average_m_cost, costs_diff
    
    #method that seperate the patients info into different age groups
    #As I found the ages in the dataset are ranged from 18 to 64 years old. So, I seperate them into 5 age groups
    def age_groups(self):
        age_groups_dict = {}
        age18_26 = []
        age27_35 = []
        age36_44 = []
        age45_53 = []
        age54_64 = []
        for info in self.patients_info.values():
            if int(info['age']) in range(18, 27):
                age18_26.append(info)
            elif int(info['age']) in range(27, 36):
                age27_35.append(info)
            elif int(info['age']) in range(36, 45):
                age36_44.append(info)
            elif int(info['age']) in range(45, 54):
                age45_53.append(info)
            elif int(info['age']) in range(54, 65):
                age54_64.append(info)
            age_groups_dict['age 18 to 26'] = age18_26
            age_groups_dict['age 27 to 35'] = age27_35
            age_groups_dict['age 36 to 44'] = age36_44
            age_groups_dict['age 45 to 53'] = age45_53
            age_groups_dict['age 54 to 64'] = age54_64
        return age_groups_dict
    
        
    #method that calculates the average insurance costs among different age groups
    def average_costs_vs_ages(self, age_groups_dict):
        average_age_group_costs = {}
        for age_group, infos in age_groups_dict.items():
            sum = 0
            for info in infos:
                sum += float(info['charges'])
            average_age_group_costs[age_group] = round(sum/len(infos), 2)
        return average_age_group_costs


The next step is to create an instance of the class called `Patient_infos`. With this instance, each method can be used to see the results of the analysis.

In [42]:
Patient_infos = PatientsInfo(patient_infos)

### Goal #1  Average Age for All Patients

In [43]:
Patient_infos.average_age()

39.21

The average age of the patients is 39.21 years old.

### Goal #2  Major Region

In [44]:
Patient_infos.major_region()

({'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324},
 'southeast',
 364)

The patients are separated into 4 regions. There are 325 patients from Southwest, 364 patients from Southeast, 325 patiens from Northwest, and 324 patients from Northeast.

The major regions is Southeast with 364 patients.

### Goal #3  Average Age for Patients With At Least One Child

In [45]:
Patient_infos.average_age_with_kids()

39.78

The average age of the patients with at least one child is 39.78 years old. It's jsut slightly higher than the average age of all patients, which is 39.21 years old.

### Goal #4 Comparison of Insurance Costs Among Smokers and Non-smokers

In [46]:
Patient_infos.costs_vs_smoke()

(32050.23, 8434.27, 23615.96)

The average insurance costs for patients who smoke is 32050.23 dollars. 
While, the average insurance costs for patients who DO NOT smoke is 8434.27 dollars. 
So the difference of the insurance costs between smokers vs. non-smokers is 23615.96 dollars, which is quite a big gap.

`The insurance cost would be possibly higher for smokers.`

### Goal #5 Comparison of Insurance Costs Among Patients With And Without Children

In [47]:
Patient_infos.costs_vs_kids()

(13949.94, 12365.98, 1583.960000000001)

The average insurance costs of patients with at least one children is 13949.94 dollars. 
While, the average insurance costs of patients without children is 12365.98 dollars.
So the difference of the insurance costs between patients with and without children is 1583.96 dollars. 

`The insurance cost would be possibly higher for people who have children.`

### Goal #6 Comparison of Insurance Costs Between Females And Males

In [48]:
Patient_infos.costs_vs_genders()

(12569.58, 13956.75, 1387.17)

The average insurance costs of female patients is 12569.58 dollars.
While the average insurance costs of male patients is 13956.75 dollars. 
So the difference of the insurance costs between female and male patients is 1387.17 dollars. 

`The insurance cost would be possibly higher for males.`

### Goal #7 Comparison of Insurance Costs Among Different Age Groups

In [49]:
age_groups_dict = Patient_infos.age_groups()
#print(age_groups_dict)

Patient_infos.average_costs_vs_ages(age_groups_dict)

{'age 18 to 26': 8839.44,
 'age 27 to 35': 11003.99,
 'age 36 to 44': 13328.53,
 'age 45 to 53': 15539.92,
 'age 54 to 64': 18538.71}

The average insurance costs of patients from age 18 to 26 is 8839.44 dollars; from age 27 to 35 is 11003.99 dollars; from age 36 to 44 is 13328.53 dollars; from age 45 to 53 is 15539.92 dollars; and from age 54 to 64 is 18538.71 dollars. 

`The insurance cost would be possibly higher when people gets older.`

### Ethical Considerations
     
To explore areas where the data may include bias and how that would impact potential use cases, a class called `CheckBias` has been built out which contains one method:

* `count_ratio()`

The class has been built out below. 

In [17]:
class CheckBias:
    def __init__(self, patients_info, total_patients):
        self.patients_info = patients_info
        self.total_patients = total_patients
    
    #method that counts the ratio of target patients in certain category in the dataset
    #including sex, children status, smoking status, regions spread
    #eg: first count how many female patients in category 'sex', then turn it into a ratio with the number of total patients
    def count_ratio(self, category, target):
        counter = 0
        for info in self.patients_info.values():
            if info[category] == target:
                counter += 1
        ratio = round(counter / self.total_patients, 2)
        return ratio


In [18]:
total_patients = len(patient_infos)
#print(total_patients)

check_bias = CheckBias(patient_infos, total_patients)

In [50]:
female_r = check_bias.count_ratio('sex', 'female')
male_r = check_bias.count_ratio('sex', 'male')
no_kids_r = check_bias.count_ratio('children', '0')
with_kids_r = 1 - no_kids_r
smoker_r = check_bias.count_ratio('smoker', 'yes')
non_smoker_r = check_bias.count_ratio('smoker', 'no')
region_sw_r = check_bias.count_ratio('region', 'southwest')
region_se_r = check_bias.count_ratio('region', 'southeast')
region_nw_r = check_bias.count_ratio('region', 'northwest')
region_ne_r = check_bias.count_ratio('region', 'northeast')

In [51]:
#creat a dict to save the results of the ratios for each category
ratio_dict = {'Sex': {'female':female_r, 'male':male_r}, 
              'Children Status':{'without_kids':no_kids_r, 'with_kids':with_kids_r},
              'Smoking Status':{'smoker':smoker_r, 'non_smoker':non_smoker_r},
              'Regions Spread':{'southwest':region_sw_r, 'southeast':region_se_r, 'northwest':region_nw_r, 'northeast':region_ne_r}
             }

In [36]:
print(ratio_dict)

{'Sex': {'female': 0.49, 'male': 0.51}, 'Children Status': {'without_kids': 0.43, 'with_kids': 0.5700000000000001}, 'Smoking Status': {'smoker': 0.2, 'non_smoker': 0.8}, 'Regions Spread': {'southwest': 0.24, 'southeast': 0.27, 'northwest': 0.24, 'northeast': 0.24}}


## Findings And Precdtions


* `The average age of the patients in the dataset is 39.21 years old. `
* `The average age for patients who has at least one child is 39.78 years old.`
* `The majority of the individuals in the dataset are from southeast, which are 364 people. `
* `The insurance cost would be possibly higher for smokers.`
* `The insurance cost would be possibly higher for people who have children.`
* `The insurance cost would be possibly higher for males.`
* `The insurance cost would be possibly higher when people gets older.`


## Possible Bias In the Dataset

In the dataset, if looking at the ratios of sex, children status, smoking status and regions spread, one could find that the ratios of smoking status is relatively unbalanced, which is 1 to 4 for Smokers and Non-Smokers. So, eventhough in the analysis, it shows that the average insurance costs for smokers is 23615.96 dollars higher than non-smokers, which is a great gap comparing to other factors. It would be better to collect more smokers' insurance costs data to lessen the possible bias in the dataset.