# U.S. Medical Insurance Costs Analysis
*[Data](https://github.com/JBharwani2/Data-Science-Portfolio/blob/main/insurance.csv) accessed through a CSV file provided by www.codecademy.com*

This project analyzes the cost of patients' medical insurance in the U.S. and the factors that may influence this cost.

### The following information will be presented based on the data:
- A dictionary that contains all patient information
- Average age of the patients
- Number of males vs. females counted in the dataset
- Percentage of patients that smoke
- Average cost of medical insurance for these patients
- Geographical breakdown of the patients


### The following question will be researched in this project:
- Is there a difference in cost for individuals that smoke?
- Is there a correlation between insurance cost and gender?
- Does age or body mass index affect the difference in cost between genders?
- Does a patient's region affect the difference in cost between genders?

In [1]:
import csv

The only necessary import for this project is the `csv` library which will allow for accessing the csv file containing the U.S. medical insurance data. Further study can be done on this data with the addition of other libraries.

In [2]:
insurance_data = []
with open("insurance.csv") as insurance_csv:
    csv_data = csv.DictReader(insurance_csv)
    
    # For each row of data in the file, a new dictionary is added to the insurance data list
    for row in csv_data:
        insurance_data.append(row)

A list is developed to store the data contained in the csv file by iterating through each row and saving each as an individual dictionary. A list data-type is used in this case since there are no defining characteristics of each patient besides the index of each list entry. Meanwhile, it is easy to iterate through the list entries and access the dictionary details via the shared column names: age, sex, bmi, children, smoker, region, charges.

In [3]:
total_age = 0
for line in insurance_data:
    total_age += int(line["age"])
    
average_age = int(total_age / len(insurance_data))

print("The average age of the set of patients is " + str(average_age))

The average age of the set of patients is 39


In [4]:
female_count = 0
male_count = 0
for line in insurance_data:
    if "female" in line["sex"]:
        female_count += 1
    else:
        male_count += 1
        
male_female_ratio = int(male_count / female_count)

print(str(male_count) + " male")
print(str(female_count) + " female")
print("The ratio between men to women is approximately " + str(male_female_ratio) + ":1")

676 male
662 female
The ratio between men to women is approximately 1:1


In [5]:
smoker_count = 0
total_patients = len(insurance_data)

for line in insurance_data:
    if line["smoker"] == "yes":
        smoker_count += 1

smoker_ratio = round(total_patients / smoker_count, 2)
        
print(str(smoker_ratio) + "% of patients are smokers")

4.88% of patients are smokers


In [6]:
total_spending = 0

for line in insurance_data:
    total_spending += float(line["charges"])

average_cost = round(total_spending / total_patients, 2)
    
print("The average cost of patients' medical insurance: $" + str(average_cost))

The average cost of patients' medical insurance: $13270.42


In [7]:
north_west = 0
south_west = 0
north_east = 0
south_east = 0

for line in insurance_data:
    if line["region"] == "northwest":
        north_west += 1
    elif line["region"] == "southwest":
        south_west += 1
    elif line["region"] == "northeast":
        north_east += 1
    elif line["region"] == "southeast":
        south_east += 1

print("The following is the regional breakdown of patients:")
print("Northwest: " + str(north_west) + "\t\t Northeast: " + str(north_east))
print("Southwest: " + str(south_west) + "\t\t Southeast: " + str(south_east))

The following is the regional breakdown of patients:
Northwest: 325		 Northeast: 324
Southwest: 325		 Southeast: 364


In [8]:
smoker_cost = 0
non_smoker_cost = 0

for patient in insurance_data:
    patient_cost = float(patient["charges"])
    if patient["smoker"] == "yes":
        smoker_cost += patient_cost
    else:
        non_smoker_cost += patient_cost
        
smoker_cost = smoker_cost / smoker_count
non_smoker_cost = non_smoker_cost / (total_patients - smoker_count)
smoker_cost_difference = round(smoker_cost - non_smoker_cost, 2)

print("The average cost for patients that smoke is $" + str(smoker_cost_difference) + " higher than patients that do not.")

The average cost for patients that smoke is $23615.96 higher than patients that do not.


The above analyses shows that there is a steep difference in medical insurance cost between patient's that smoke compared to those that do not. Being a smoker increases patient costs by over $23,000 on average. This discovers a key component that increases the cost of medical insurance.

In [9]:
female_cost = 0
male_cost = 0
female_smoker_cost = 0
male_smoker_cost = 0

female_smoker_count = 0
male_smoker_count = 0

for patient in insurance_data:
    patient_cost = float(patient["charges"])
    if patient["sex"] == "female":   
        if patient["smoker"] == "yes":
            female_smoker_cost += patient_cost
            female_smoker_count += 1
        else:
            female_cost += patient_cost
    elif patient["sex"] == "male":   
        if patient["smoker"] == "yes":
            male_smoker_cost += patient_cost
            male_smoker_count += 1
        else:
            male_cost += patient_cost
        
female_cost = round(female_cost / (female_count - female_smoker_count), 2)
male_cost = round(male_cost / (male_count - male_smoker_count), 2)
female_smoker_cost = round(female_smoker_cost / female_smoker_count, 2)
male_smoker_cost = round(male_smoker_cost / male_smoker_count, 2)

gender_cost_difference = round(female_cost - male_cost, 2)

print(str(female_smoker_count) + " female patients smoke")
print(str(male_smoker_count) + " male patients smoke")
print("---")
print("The average cost for female patients that do not smoke: $" + str(female_cost))
print("The average cost for male patients that do not smoke: $" + str(male_cost))
print("---")
print("The average cost for female patients that smoke: $" + str(female_smoker_cost))
print("The average cost for male patients that smoke: $" + str(male_smoker_cost))

115 female patients smoke
159 male patients smoke
---
The average cost for female patients that do not smoke: $8762.3
The average cost for male patients that do not smoke: $8087.2
---
The average cost for female patients that smoke: $30679.0
The average cost for male patients that smoke: $33042.01


Due to the previous findings that being a smoker greatly increases the cost of insurance, the analysis of cost differences due to sex was approached differently. This analysis divides the patients into a group that smokes and a group that does not. The data shows that there are more male patients that smoke compared to female patients that smoke. This, in-turn, results in male smokers having a higher average cost than female smokers by roughly \\$2,500. However, for patients that do not smoke, females paid \\$675 more on average than male patients.

Other factors that are not being considered in this analysis are BMI, age, and region. However, the separation of smokers from the dataset shows a possible correlation between sex and medical insurance costs.

In [10]:
def find_median(input_list):
    n = len(input_list)
    input_list.sort()
    
    if n % 2 == 0:
        med1 = input_list[n//2]
        med2 = input_list[n//2 - 1]
        median = (med1 + med2) / 2
    else:
        median = input_list[n//2]
    
    return round(median, 2)
        
female_age = 0
male_age = 0
female_BMI = 0
male_BMI = 0

female_age_list = []
male_age_list = []
female_BMI_list = []
male_BMI_list = []

for patient in insurance_data:
    patient_cost = float(patient["charges"])
    if patient["sex"] == "female":   
        female_age += int(patient["age"])
        female_BMI += float(patient["bmi"])
        female_age_list.append(int(patient["age"]))
        female_BMI_list.append(float(patient["bmi"]))
    else:
        male_age += int(patient["age"])
        male_BMI += float(patient["bmi"])
        male_age_list.append(int(patient["age"]))
        male_BMI_list.append(float(patient["bmi"]))          

# Solve for each mean value
female_age_mean = round(female_age / female_count, 2)
male_age_mean = round(male_age / male_count, 2)
female_BMI_mean = round(female_BMI / female_count, 2)
male_BMI_mean = round(male_BMI / male_count, 2)

# Solve for each median value
female_age_median = find_median(female_age_list)
male_age_median = find_median(male_age_list)
female_BMI_median = find_median(female_BMI_list)
male_BMI_median = find_median(male_BMI_list)

print("The following is the statistical breakdown of patient age by gender:")
print("Mean of female age: " + str(female_age_mean) + "\t\t Median of female age: " + str(female_age_median))
print("Mean of male age: " + str(male_age_mean) + "\t\t\t Median of male age: " + str(male_age_median))

print("\nThe following is the statistical breakdown of patient BMI by gender:")
print("Mean of female BMI: " + str(female_BMI_mean) + "\t\t Median of female BMI: " + str(female_BMI_median))
print("Mean of male BMI: " + str(male_BMI_mean) + "\t\t\t Median of male BMI: " + str(male_BMI_median))

The following is the statistical breakdown of patient age by gender:
Mean of female age: 39.5		 Median of female age: 40.0
Mean of male age: 38.92			 Median of male age: 39.0

The following is the statistical breakdown of patient BMI by gender:
Mean of female BMI: 30.38		 Median of female BMI: 30.11
Mean of male BMI: 30.94			 Median of male BMI: 30.69


Age and BMI were two of the factors named above that were not accounted for when investigating the difference in cost between males and females. However, understanding the median of mode of each subset's age can show if the data is skewed in either direction. 

By observing the above findings, both the mean and median of age and BMI seem to be similar regardless of sex. Even though age is skewed towards younger patients and BMI is skewed towards a higher value, both females and males have the same skew in their data. This shows that both age and BMI have little effect on the cost difference found between males and females in this dataset.

In [11]:
# Function used to count males and females for each region
def sex_count(patient, female_count, male_count):    
    if patient["sex"] == "female":
        female_count += 1
    else:
        male_count += 1
        
    return female_count, male_count

north_west_average_cost = 0
south_west_average_cost = 0
north_east_average_cost = 0
south_east_average_cost = 0

north_west_female = 0
south_west_female = 0
north_east_female = 0
south_east_female = 0

north_west_male = 0
south_west_male = 0
north_east_male = 0
south_east_male = 0

for patient in insurance_data:
    patient_cost = float(patient["charges"])
    
    if patient["region"] == "northwest":
        north_west_average_cost += patient_cost
        north_west_female, north_west_male = sex_count(patient, north_west_female, north_west_male)
    elif patient["region"] == "southwest":
        south_west_average_cost += patient_cost
        south_west_female, south_west_male = sex_count(patient, south_west_female, south_west_male)
    elif patient["region"] == "northeast":
        north_east_average_cost += patient_cost
        north_east_female, north_east_male = sex_count(patient, north_east_female, north_east_male)
    elif patient["region"] == "southeast":
        south_east_average_cost += patient_cost
        south_east_female, south_east_male = sex_count(patient, south_east_female, south_east_male)
        
north_west_average_cost = round(north_west_average_cost / north_west, 2)
south_west_average_cost = round(south_west_average_cost / south_west, 2)
north_east_average_cost = round(north_east_average_cost / north_east, 2)
south_east_average_cost = round(south_east_average_cost / south_east, 2)

print("The following is the regional breakdown of the average insurance cost:")
print("Northwest: $" + str(north_west_average_cost) + "\t\t Northeast: $" + str(north_east_average_cost))
print("Southwest: $" + str(south_west_average_cost) + "\t\t Southeast: $" + str(south_east_average_cost))
                                                       
print("\nThe number of female patients per region:")
print("Northwest: " + str(north_west_female) + "\t\t Northeast: " + str(north_east_female))
print("Southwest: " + str(south_west_female) + "\t\t Southeast: " + str(south_east_female))
                                                       
print("\nThe number of male patients per region:")
print("Northwest: " + str(north_west_male) + "\t\t Northeast: " + str(north_east_male))
print("Southwest: " + str(south_west_male) + "\t\t Southeast: " + str(south_east_male))

The following is the regional breakdown of the average insurance cost:
Northwest: $12417.58		 Northeast: $13406.38
Southwest: $12346.94		 Southeast: $14735.41

The number of female patients per region:
Northwest: 164		 Northeast: 161
Southwest: 162		 Southeast: 175

The number of male patients per region:
Northwest: 161		 Northeast: 163
Southwest: 163		 Southeast: 189


The final question to be answered is a breakdown of average cost and sex ratio by region. Earlier analysis found that were more patients in the southeast than any other region. However, using the average cost should remove any differences caused by the distribution of patients. When looking at average price alone, the southeast also has the highest average cost of any region. Patients from any part of the eastern U.S. region have a higher average cost than patients from the western U.S.

When observing the gender breakdown of each region. Most regions are very closely related. However, the southeast again is an outlier with a greater number of male patients than female patients. However, this would not lead to higher average female costs because this region has higher average costs than other regions.

## Conclusion

After analyzing this dataset of U.S. medical insurance costs, the results illustrate that there is a correlation between gender and the cost for medical insurance. The conclusion is that the average female patient spends more on medical insurance than the average male patient. In addition, the analysis shows that smoking, age, body mass index, and region have all been ruled out as factors that could alter these findings according to this dataset.