# U.S. Medical Insurance Costs

This fictional csv dataset provided by Codecademy contains data of patients and their medical insurance costs in the United States. The dataset includes patients' ages, sex, bmi, number of children, smoker status (yes/no), geographic location (northeast, northwest, southeast, southwest), and their respective insurance charges.

This notebook provides descriptive statistics about the full dataset, followed by an analysis of these questions:

1. How does smoker status affect insurance charges and compare across geographic regions?
2. How do women's insurance charges compare to men's?
3. How do childless women's insurance charges compare to those of childless men?
4. Assuming that a BMI of 30 or greater indicates obesity, how do insurance charges between obese and non-obese patients compare?
5. What is the average insurance cost of patients under 50 years of age and over 50 years of age, respectively?

## Load Data into Dictionary and Lists

In [3]:
import csv

#Function to get list of records, where each record is a dictionary
def make_patient_info_lst(ins_file):
    patient_info_lst = []
    with open (ins_file) as csv_file:
        csv_dict = csv.DictReader(csv_file)
        for row in csv_dict:
            patient_info_lst.append(row)
    return patient_info_lst

#Get list of records, where each record is a dictionary, for the full dataset
full_dataset = make_patient_info_lst(r'C:\Users\Stina\Documents\Codecademy\python-portfolio-project-starter-files\python-portfolio-project-starter-files\insurance.csv')

#Function to get list of records, where each record is a dictionary, while controlling for one variable:
def make_sliced_patient_info_lst(column_name, data_lst, control_operand, control_value):
    new_dict_lst = []
    for dict in data_lst:
        for key, value in dict.items():
            if key == column_name:
                if control_operand == '==':
                    control_condition = (value == control_value)
                elif control_operand == '>':
                    control_condition = (float(value) > control_value)
                elif control_operand == '<':
                    control_condition = (float(value) < control_value)
                elif control_operand == '<=':
                    control_condition = (float(value) <= control_value)
                elif control_operand == '>=':
                    control_condition = (float(value) >= control_value)
                elif control_operand == '!=':
                    control_condition = (value != control_value)
                else:
                    return "Invalid operand, please pass in a boolean operator in string format for the control_operand argument."
                if control_condition:
                    new_dict_lst.append(dict)
    return new_dict_lst

#Function to put each column's data into its own list
def load_list_data(patient_info_lst, column_name):
    lst = []
    for dict in patient_info_lst:
        for key, value in dict.items():
            if key == column_name:
                lst.append(value)
    return lst


## Create Patient Info Class

In [14]:
class PatientInfo:
    def __init__(self, dataset):
        self.dataset = dataset
        self.patients_ages = load_list_data(dataset, 'age')
        self.patients_sexes = load_list_data(dataset, 'sex')
        self.patients_bmis = load_list_data(dataset, 'bmi')
        self.patients_num_children = load_list_data(dataset, 'children')
        self.patients_smoker = load_list_data(dataset, 'smoker')
        self.patients_regions = load_list_data(dataset, 'region')
        self.patients_insurance_charges = load_list_data(dataset, 'charges')
        
    def male_female_ratio(self):
        male_count = 0
        female_count = 0
        for sex in self.patients_sexes:
            if sex == 'male':
                male_count += 1
            else:
                female_count += 1
        sum_total = male_count + female_count
        male_percent = 100*male_count/sum_total
        female_percent = 100*female_count/sum_total
        return f"In this dataset, {round(male_percent, 2)}% of patients are male and {round(female_percent, 2)}% are female."
    
    def smoker_nonsmoker_ratio(self):
        smoker_count = 0
        nonsmoker_count = 0
        for status in self.patients_smoker:
            if status == 'yes':
                smoker_count += 1
            else:
                nonsmoker_count += 1
        sum_total = smoker_count + nonsmoker_count
        smoker_percent = 100*smoker_count/sum_total
        nonsmoker_percent = 100*nonsmoker_count/sum_total
        return f"In this dataset, {round(smoker_percent, 2)}% of patients are smokers and {round(nonsmoker_percent, 2)}% are nonsmokers."
    
    def regions_ratio(self):
        northeast_count = 0
        northwest_count = 0
        southeast_count = 0
        southwest_count = 0
        for region in self.patients_regions:
            if region == "northeast":
                northeast_count += 1
            elif region == "northwest":
                northwest_count += 1
            elif region == "southeast":
                southeast_count += 1
            else:
                southwest_count += 1
        sum_total = northeast_count + northwest_count + southeast_count + southwest_count
        northeast_percent = 100*northeast_count/sum_total
        northwest_percent = 100*northwest_count/sum_total
        southeast_percent = 100*southeast_count/sum_total
        southwest_percent = 100*southwest_count/sum_total
        return  f"In this dataset, {round(northeast_percent, 2)}% of patients are from the northeast, {round(northwest_percent, 2)}% are from the northwest, {round(southeast_percent, 2)}% are from the southeast, and {round(southwest_percent, 2)}% are from the southwest."
    
    def average_age(self):
        lst_length = len(self.patients_ages)
        sum_total = 0
        for item in self.patients_ages:
            sum_total += float(item)
        average = sum_total/lst_length
        return f"The average age of patients in this dataset is {round(average, 1)}."
        
    def average_bmi(self):
        lst_length = len(self.patients_bmis)
        sum_total = 0
        for item in self.patients_bmis:
            sum_total += float(item)
        average = sum_total/lst_length
        return f"The average bmi of patients in this dataset is {round(average, 1)}."
        
        
    def average_num_children(self):
        lst_length = len(self.patients_num_children)
        sum_total = 0
        for item in self.patients_num_children:
            sum_total += float(item)
        average = sum_total/lst_length
        return f"The average number of children of patients in this dataset is {round(average, 1)}."
        
    def average_insurance_charges(self):
        lst_length = len(self.patients_insurance_charges)
        sum_total = 0
        for item in self.patients_insurance_charges:
            sum_total += float(item)
        average = sum_total/lst_length
        return f"The average insurance charges for patients in this dataset are ${round(average, 2)}."
            
#Make full dataset an instance of PatientInfo class:
all_insurance_data = PatientInfo(full_dataset)
        

In this dataset, 50.52% of patients are male and 49.48% are female.


### Overview of Full Dataset: Ratios and Averages

In [15]:
print(all_insurance_data.male_female_ratio())
print(all_insurance_data.smoker_nonsmoker_ratio())
print(all_insurance_data.regions_ratio())
print(all_insurance_data.average_age())
print(all_insurance_data.average_bmi())
print(all_insurance_data.average_num_children())
print(all_insurance_data.average_insurance_charges())

In this dataset, 50.52% of patients are male and 49.48% are female.
In this dataset, 20.48% of patients are smokers and 79.52% are nonsmokers.
In this dataset, 24.22% of patients are from the northeast, 24.29% are from the northwest, 27.2% are from the southeast, and 24.29% are from the southwest.
The average age of patients in this dataset is 39.2.
The average bmi of patients in this dataset is 30.7.
The average number of children of patients in this dataset is 1.1.
The average insurance charges for patients in this dataset are $13270.42.


### Analysis Questions

#### 1. How does smoker status affect insurance charges and compare across geographic regions?

In [33]:
#Make sliced dataset and create class instance

smokers_dataset = make_sliced_patient_info_lst('smoker', full_dataset, '==', 'yes')
nonsmokers_dataset = make_sliced_patient_info_lst('smoker', full_dataset, '==', 'no')

smokers = PatientInfo(smokers_dataset)
nonsmokers = PatientInfo(nonsmokers_dataset)

print("Smokers: " + smokers.average_insurance_charges())
print("Non-mokers: " + nonsmokers.average_insurance_charges())
print("Smokers: " + smokers.regions_ratio())
print("Non-smokers: " + nonsmokers.regions_ratio())

Smokers: The average insurance charges for patients in this dataset are $32050.23.
Non-mokers: The average insurance charges for patients in this dataset are $8434.27.
Smokers: In this dataset, 24.45% of patients are from the northeast, 21.17% are from the northwest, 33.21% are from the southeast, and 21.17% are from the southwest.
Non-smokers: In this dataset, 24.15% of patients are from the northeast, 25.09% are from the northwest, 25.66% are from the southeast, and 25.09% are from the southwest.


#### 2. How do women's insurance charges compare to men's?

In [25]:
#Make sliced dataset and create class instance

women_dataset = make_sliced_patient_info_lst('sex', full_dataset, '==', 'female')
men_dataset = make_sliced_patient_info_lst('sex', full_dataset, '==', 'male')

women = PatientInfo(women_dataset)
men = PatientInfo(men_dataset)

print("Women: " + women.average_insurance_charges())
print("Men: " + men.average_insurance_charges())

Women: The average insurance charges for patients in this dataset are $12569.58.
Men: The average insurance charges for patients in this dataset are $13956.75.


#### 3. How do childless women's insurance charges compare to those of childless men?

In [28]:
#Create dataset, lists, and PatientInfo class instance
childless_dataset = make_sliced_patient_info_lst('children', full_dataset, '==', '0')
have_children_dataset = make_sliced_patient_info_lst('children', full_dataset, '>=', 1)
childless_women_dataset = make_sliced_patient_info_lst('sex', childless_dataset, '==', 'female')
childless_men_dataset = make_sliced_patient_info_lst('sex', childless_dataset, '==', 'male')

childless = PatientInfo(childless_dataset)
have_children = PatientInfo(have_children_dataset)
childless_women = PatientInfo(childless_women_dataset)
childless_men = PatientInfo(childless_men_dataset)

print("Patients without children: " + childless.average_insurance_charges())
print("Patients with children: " + have_children.average_insurance_charges())
print("Women without children: " + childless_women.average_insurance_charges())
print("Men with children: " + childless_men.average_insurance_charges())

Patients without children: The average insurance charges for patients in this dataset are $12365.98.
Patients with children: The average insurance charges for patients in this dataset are $13949.94.
Women without children: The average insurance charges for patients in this dataset are $11905.71.
Men with children: The average insurance charges for patients in this dataset are $12832.7.


#### 4. Assuming that a BMI of 30 or greater indicates obesity, how do insurance charges between obese and non-obese patients compare?

In [29]:
#Create dataset, lists, and PatientInfo class instance
obese_dataset = make_sliced_patient_info_lst('bmi', full_dataset, '>=', 30)
nonobese_dataset = make_sliced_patient_info_lst('bmi', full_dataset, '<', 30)

obese = PatientInfo(obese_dataset)
nonobese = PatientInfo(nonobese_dataset)

print("Obese patients: " + obese.average_insurance_charges())
print("Non-obese: " + nonobese.average_insurance_charges())

Obese patients: The average insurance charges for patients in this dataset are $15552.34.
Non-obese: The average insurance charges for patients in this dataset are $10713.67.


#### 5. What is the average insurance cost of patients under 50 years of age and over 50 years of age, respectively?

In [31]:
#Create dataset, lists, and PatientInfo class instance
under_50_dataset = make_sliced_patient_info_lst('age', full_dataset, '<', 50)
over_50_dataset = make_sliced_patient_info_lst('age', full_dataset, '>=', 50)

under_50 = PatientInfo(under_50_dataset)
over_50 = PatientInfo(over_50_dataset)

print("Under 50: " + under_50.average_insurance_charges())
print("Over 50: " + over_50.average_insurance_charges())

Under 50: The average insurance charges for patients in this dataset are $11399.1.
Over 50: The average insurance charges for patients in this dataset are $17902.55.


## Points of Caution

In several categories, the dataset is skewed or not representative of the general U.S. population. Thus, the results of this analysis should be treated with caution. For example, only 4.29% of the patients are from the southwest region, one of the most densely populated area in the United States. Only one fifth of patient records are from smokers. Thus, there is a lot more data available on nonsmokers than smokers. The insights about smokers may need to be considered with more caution that those about nonsmokers.

This analysis did not examine the distribution of continuous variables like age, BMI, or number of children to identify skewed data or outliers at play. Taking weight as an example, the analysis did not take a closer look at other important conditions like severe obesity or being underweight. The simple obese/non-obese distinction may not identify all underlying factors at play between patients' weight and health.

## Discussion & Needs for Further Research

According to this dataset, smoking is roughly equally common across geographic regions of the U.S. Obesity, smoker status, and age appear to be significant factors in predicting patients' insurance charges. Sex and having children, on the other hand, does not seem to be a major determinant of healthcare costs.

Further analysis is needed to examine the dataset and derive insights that are statistically significant and that stakeholders can utilize to make decisions with more confidence. The distribution of variables needs to be examined more closely. Two-tailed and one-tailed t-tests would allow analysts to identify correlations that are truly significant. Furthermore, an awareness of variables' distributions will aid in deciding for which variables to control when running t-tests as well as defining more patient categories for continuous variables.

If tasked to keep analyzing these patients' records, I would try to find data on their socioeconomic status and race, factors that have historically shown to be significant determinants of health. I would also try to draw a difference between insurance-covered charges and out-of-pocket expenses. Analyzing these variables, I would try to control for availability of healthcare resources in a given radius of a patient's home in order to gain insight into the importance of access to geographic and financial access to care.

I am looking forward to further growing my skillset in research, data collection, data analysis, and automation tools like machine learning models to be able to gather insights about our society and economy that inform and empower social equity.