# U.S. Medical Insurance Costs

### Goals:
The goal of this project is to analyze the U.S. Medical Insurance Costs dataset provided using Python and see what interesting insights we can gain from the data.

Some imformation that would be useful to have is:
1. Basic Data:
   - Size of dataset
   - Fields of dataset
   - Sample size by sex
   - Sample size by region
   - Sample size of smokers and non-smokers
2. Age Data:
   - Min and max ages in dataset
   - Average age in dataset
   - Average age by sex
   - Average age by number of children
   - Average age of smokers and non-smokers
   - Average age in each region
   - Spread of ages (buckets)
3. BMI Data:
   - Min and Max BMI in dataset
   - Average BMI
   - Average BMI by sex
   - Average BMI by region
   - Average BMI by sex in each region
   - Average BMI for smokers and non-smokers
4. Insurance Yearly Cost Data:
   - Average charges in the dataset
   - Average charges by sex
   - Average charges by region
   - Average charges for smokers and non-smokers
   - Average charges by number of children
   - Average insurance costs by BMI (buckets)

In [1]:
#library to read CSV Files
import csv

insurance_data_records = []

#Storing the insurance data into a list of dictionaries
#each entry in the list is one record
with open('insurance.csv') as insurance_csv:
    insurance_data_records = list(csv.DictReader(insurance_csv))
    

In [50]:
#These are a few functions that will help us with data analysis.
def get_mean(data_list):
    return sum(data_list) / len(data_list)

def get_median(data_list):
    sorted_data = sorted(data_list)
    n = len(data_list)
    midpoint = n//2
    if n % 2:
        return (sorted_data[midpoint-1] + sorted_data[midpoint]) / 2
    else:
        return sorted_data[midpoint]

def get_max_min(data_list):
    return max(data_list),min(data_list)

def get_percentage_each_age_bucket(age_list, age_split):
    age_buckets = {}
    for age in age_list:
        bucket = age // age_split
        age_buckets[bucket*age_split] = age_buckets.get(bucket*age_split, 0) + 1

    for bucket in age_buckets:
        age_buckets[bucket] = age_buckets[bucket] * 100 / len(age_list)

    return age_buckets

def sample_size_by_feature(data):
    people_by_feature = {}
    for row in data:
        people_by_feature[row] = people_by_feature.get(row, 0) + 1
    
    return people_by_feature

def avg_metric_by_feature(data, metric, feature):
    people_by_feature = sample_size_by_feature([row[feature] for row in data])
    metric_by_feature = {}
    for row in data:
        current_feature = row[feature]
        metric_by_feature[current_feature] = metric_by_feature.get(current_feature,0) + float(row[metric])

    for feature in people_by_feature:
        metric_by_feature[feature] /= people_by_feature[feature]
    
    return metric_by_feature

#### Basic Data:
1. Size of dataset
2. Fields of dataset
3. Sample size by sex
4. Sample size by region
5. Sample size of smokers and non-smokers

In [54]:
dataset_size = len(insurance_data_records)
fieldnames = insurance_data_records[0].keys()

male_data = []
female_data = []
smoker_data = []
non_smoker_data = []
ages = []
regions = []

#Split the data by sex and by choice to smoke
for row in insurance_data_records:
    if row['sex'] == 'male':
        male_data.append(row)
    else:
        female_data.append(row)

    if row['smoker'] == 'yes':
        smoker_data.append(row)
    else:
        non_smoker_data.append(row)

    ages.append(int(row['age']))
    regions.append(row['region'])
    

num_males = len(male_data)
num_females = len(female_data)
num_smokers = len(smoker_data)
num_non_smokers = len(non_smoker_data)
sample_from_each_region = sample_size_by_feature(regions)

print(f"The size of the dataset to analyze is {dataset_size}\n")
print(f"The fields found in the dataset are:")
for field in fieldnames:
    print(f"\t'{field}' of type {type(insurance_data_records[0][field])}")
print(f"\nNumber of Males: {num_males}")
print(f"Number of Females: {num_females}")
print(f"Number of Smokers: {num_smokers}")
print(f"Number of Non-smokers: {num_non_smokers}\n")
for region, sample in sample_from_each_region.items():
    print(f"The sample size from the {region} is {sample} people.")

The size of the dataset to analyze is 1338

The fields found in the dataset are:
	'age' of type <class 'str'>
	'sex' of type <class 'str'>
	'bmi' of type <class 'str'>
	'children' of type <class 'str'>
	'smoker' of type <class 'str'>
	'region' of type <class 'str'>
	'charges' of type <class 'str'>

Number of Males: 676
Number of Females: 662
Number of Smokers: 274
Number of Non-smokers: 1064

The sample size from the southwest is 325 people.
The sample size from the southeast is 364 people.
The sample size from the northwest is 325 people.
The sample size from the northeast is 324 people.


<b>As we can see</b>
- All of the insurance data we have is stored as text strings, including number fields. This means that as we process the data we will need to convert it to the appropriate data type.
- A fairly even sample of males and females was collected.
- A fairly even sample of people from each region was collected. Southeast region being a slight outlier.
- Data contains almost 4 times more non-smokers than smokers. Further research might reveal how representative of the U.S. population this is.

#### Analyzing Age Data
1. Min and max ages in dataset
2. Average age in dataset
3. Median age in dataset
4. Average age by sex
5. Average age by number of children
6. Average age of smokers and non-smokers
7. Average age in each region
8. Spread of ages (buckets)

In [47]:
max_age, min_age = get_max_min(ages)
avg_age = get_mean(ages)
avg_age_by_sex = avg_metric_by_feature(insurance_data_records, 'age', 'sex')
median_age = get_median(ages)
avg_age_by_num_children = avg_metric_by_feature(insurance_data_records, 'age', 'children')
avg_age_by_smoker = avg_metric_by_feature(insurance_data_records, 'age', 'smoker')
avg_age_by_region = avg_metric_by_feature(insurance_data_records, 'age', 'region')
age_buckets = get_percentage_each_age_bucket(ages, 10)

print(f"Max Age: {max_age}\nMin Age: {min_age}\nAverage age: {avg_age:.2f}\nMedian Age: {median_age}\n")
for sex,age in avg_age_by_sex.items():
    print(f"Average age of {sex}s: {age:.2f}")
    
print()

for smoker,age in sorted(avg_age_by_smoker.items()):
    print(f"Average age of {'smoker' if smoker == 'yes' else 'non-smoker'}s: {age:.2f}")
    
print()

for children,age in sorted(avg_age_by_num_children.items()):
    print(f"Average age of people with {children} children: {age:.2f}")

print()

for region,age in avg_age_by_region.items():
    print(f"Average age in the {region} region: {age:.2f}")

print()

for bucket in sorted(age_buckets):
    print(f"{age_buckets[bucket]:.2f}% of data collected is from people in their {bucket}s")

Max Age: 64
Min Age: 18
Average age: 39.21
Median Age: 39

Average age of females: 39.50
Average age of males: 38.92

Average age of non-smokers: 39.39
Average age of smokers: 38.51

Average age of people with 0 children: 38.44
Average age of people with 1 children: 39.45
Average age of people with 2 children: 39.45
Average age of people with 3 children: 41.57
Average age of people with 4 children: 39.00
Average age of people with 5 children: 35.61

Average age in the southwest region: 39.46
Average age in the southeast region: 38.94
Average age in the northwest region: 39.20
Average age in the northeast region: 39.27

10.24% of data collected is from people in their 10s
20.93% of data collected is from people in their 20s
19.21% of data collected is from people in their 30s
20.85% of data collected is from people in their 40s
20.25% of data collected is from people in their 50s
8.52% of data collected is from people in their 60s


In [7]:




print()

for region,avg_age in avg_metric_by_feature(insurance_data_dicts, 'age', 'region').items():
    print(f"The average age for people in the {region} region is {avg_age}")

The sample size from the southwest surveyed 325 people.
The sample size from the southeast surveyed 364 people.
The sample size from the northwest surveyed 325 people.
The sample size from the northeast surveyed 324 people.

The average age for people in the southwest region is 39.45538461538462
The average age for people in the southeast region is 38.93956043956044
The average age for people in the northwest region is 39.19692307692308
The average age for people in the northeast region is 39.26851851851852


### Analyzing BMI Data
1. Find average BMI in dataset
2. Find average BMI per region
3. Find average BMI for male and female populations

In [8]:
avg_bmi = get_mean([float(row['bmi']) for row in insurance_data_dicts])
print(f"The average BMI for the dataset is {avg_bmi}")

The average BMI for the dataset is 30.663396860986538


In [10]:
for region,avg_bmi in avg_metric_by_feature(insurance_data_dicts, 'bmi', 'region').items():
    print(f"The average bmi for people in the {region} region is {avg_bmi}")

The average bmi for people in the southwest region is 30.59661538461538
The average bmi for people in the southeast region is 33.35598901098903
The average bmi for people in the northwest region is 29.199784615384626
The average bmi for people in the northeast region is 29.17350308641976


In [12]:
avg_male_bmi = get_mean([float(row['bmi']) for row in male_data])
avg_female_bmi = get_mean([float(row['bmi']) for row in female_data])
print(f"The average bmi for males in the dataset is {avg_male_bmi}.")
print(f"The average bmi for females in the dataset is {avg_female_bmi}.\n")

for region,avg_bmi in avg_metric_by_feature(male_data, 'bmi', 'region').items():
    print(f"The average bmi for males in the {region} region is {avg_bmi}.")

for region,avg_bmi in avg_metric_by_feature(female_data, 'bmi', 'region').items():
    print(f"The average bmi for females in the {region} region is {avg_bmi}")

The average bmi for males in the dataset is 30.943128698224832.
The average bmi for females in the dataset is 30.377749244713023.

The average bmi for males in the southeast region is 33.99.
The average bmi for males in the northwest region is 29.120155279503102.
The average bmi for males in the northeast region is 29.024539877300615.
The average bmi for males in the southwest region is 31.129447852760737.
The average bmi for females in the southwest region is 30.060493827160496
The average bmi for females in the southeast region is 32.67125714285712
The average bmi for females in the northwest region is 29.27795731707316
The average bmi for females in the northeast region is 29.324316770186336


### Analyzing Insurance Data and its relations
1. Find average charges in the dataset
2. Find average charges by sex
3. Find average charges by region
4. Find average charges for smokers and non-smokers
5. Find average charges for people with different amounts of children

In [13]:
avg_charges = get_mean([float(row['charges']) for row in insurance_data_dicts])
print(f"The average insurance cost in the dataset is {avg_charges}")

The average insurance cost in the dataset is 13270.422265141257


In [14]:
avg_male_charges = get_mean([float(row['charges']) for row in male_data])
avg_female_charges = get_mean([float(row['charges']) for row in female_data])
print(f"The average insurance cost for males in the dataset is {avg_male_charges}.")
print(f"The average insurance cost for females in the dataset is {avg_female_charges}.\n")
print(f"On average, males in the dataset pay {avg_male_charges-avg_female_charges} more for insurance.")

The average insurance cost for males in the dataset is 13956.751177721886.
The average insurance cost for females in the dataset is 12569.57884383534.

On average, males in the dataset pay 1387.1723338865468 more for insurance.


In [35]:
pct_males_who_smoke = len([1 for row in male_data if row['smoker'] == 'yes']) * 100 / len(male_data)
pct_females_who_smoke = len([1 for row in female_data if row['smoker'] == 'yes']) * 100 / len(female_data)
print(f"{pct_males_who_smoke}% of males in the dataset are smokers.")
print(f"{pct_females_who_smoke}% of females in the dataset are smokers.")

23.5207100591716% of males in the dataset are smokers.
17.371601208459214% of females in the dataset are smokers.


In [16]:
charges_by_region = avg_metric_by_feature(insurance_data_dicts, 'charges', 'region')
for region,avg_charges in charges_by_region.items():
    print(f"The average insurance cost for people in the {region} region is {avg_charges}")

highest_cost_region = max(charges_by_region, key=charges_by_region.get)
print(f"The region with the highest insurance costs on average is the {highest_cost_region} region.")

The average insurance cost for people in the southwest region is 12346.93737729231
The average insurance cost for people in the southeast region is 14735.411437609895
The average insurance cost for people in the northwest region is 12417.575373969228
The average insurance cost for people in the northeast region is 13406.3845163858
The region with the highest insurance costs on average is the southeast region.


In [33]:
smoker_dict = avg_metric_by_feature(insurance_data_dicts, 'charges','smoker')

print(f"The average insurance cost for smokers in the dataset is {smoker_dict['yes']}")
print(f"The average insurance cost for non-smokers in the dataset is {smoker_dict['no']}")
print(f"On average, smokers in the dataset pay ${smoker_dict['yes']-smoker_dict['no']} more on insurance than non-smokers.")

The average insurance cost for smokers in the dataset is 32050.23183153285
The average insurance cost for non-smokers in the dataset is 8434.268297856199
On average, smokers in the dataset pay $23615.96353367665 more on insurance than non-smokers.


In [29]:
people_by_num_children = sample_size_by_feature([int(row['children']) for row in insurance_data_dicts])
for num_children, people in sorted(people_by_num_children.items()):
    print(f"{people} people had {num_children} children")

print()

charges_by_num_children = avg_metric_by_feature(insurance_data_dicts, 'charges','children')
for num_children, charges in sorted(charges_by_num_children.items()):
    print(f"People with {num_children} children have an average insurance cost of {charges}.")

574 people had 0 children
324 people had 1 children
240 people had 2 children
157 people had 3 children
25 people had 4 children
18 people had 5 children

People with 0 children have an average insurance cost of 12365.975601635882.
People with 1 children have an average insurance cost of 12731.171831635793.
People with 2 children have an average insurance cost of 15073.563733958328.
People with 3 children have an average insurance cost of 15355.31836681528.
People with 4 children have an average insurance cost of 13850.656311199999.
People with 5 children have an average insurance cost of 8786.035247222222.
