# U.S. Medical Insurance Costs

Data Source: https://www.kaggle.com/mirichoi0218/insurance <br>
For [Data Analyst Career Path](https://www.codecademy.com/learn/paths/data-analyst) on Codecademy

- Find out the average age of the patients in the dataset
- Analyze where a majority of the individuals are from
- Look at the different costs between smokers vs. non-smokers
- Figure out what the average age is for someone who has at least one child in this dataset
- Optional: Document and organize your findings into dictionaries, lists, or another convenient datatype
- Optional: Make predictions about a dataset’s features based on your findings: what features are the most influential for an individual’s medical insurance charges
- Optional: Explore areas where the data may include bias and how that would impact potential use cases

In [197]:
import csv
import numpy as np

main_data = []
ages = []
regions = []
bmis = []
ins_costs = []

with open('insurance.csv') as data_file:
    dataread = csv.DictReader(data_file)
    for r in dataread:
        line = {}
        for k, v in r.items():
            if k == 'age':
                v = int(v)
                ages.append(v)
            if k == 'bmi':
                v = float(v)
                bmis.append(v)
            if k == 'children':
                v = int(v)
            if k == 'charges':
                v = float(v)
                ins_costs.append(v)
            line.update({k: v})
        main_data.append(line)
        regions.append(r['region'])

# Option 2 for making separate lists        
# ages = [a['age'] for a in main_data]
# regions = [a['region'] for a in main_data]
# print(main_data[0])


def find_aver(lst):
    if lst == 'age':
        aver_age = round(np.mean(ages), 1)
        print(f'The average age is {aver_age} years')
    elif lst == 'bmi':
        aver_bmi = round(np.mean(bmis), 2)
        print(f'The average BMI is {aver_bmi}')
    elif lst == 'cost':
        aver_cost = round(np.mean(ins_costs), 2)
        print(f'The average ins cost is ${aver_cost}')
    else:
        print('This argument does not exist yet')

- Find out the **average age** of the patients in the dataset

In [199]:
find_aver('age')

The average age is 39.2 years


- Analyze **where a majority of the individuals are from**

In [181]:
regions_set = list(set(regions))
regions_count = {r: regions.count(regions_set[i]) for i, r in enumerate(regions_set)}
print(regions_count)
print(f'The majority is from South-East region')

{'northwest': 325, 'northeast': 324, 'southwest': 325, 'southeast': 364}
The majority is from South-East region


- Look at the **different costs between smokers vs. non-smokers**

In [155]:
smok_cost = 0.0
nonsmok_cost = 0.0
smokers = 0
nonsmokers = 0

for r in main_data:
    if r['children'] == 0:
        if r['smoker'] == 'yes':
            smok_cost += r['charges']
            smokers += 1
        else:
            nonsmok_cost += r['charges']
            nonsmokers += 1

smok_aver_cost = round(smok_cost / smokers, 2)
nonsmok_aver_cost = round(nonsmok_cost / nonsmokers, 2)

sm_coeff = smok_aver_cost - nonsmok_aver_cost
print(f'The average diff is ${sm_coeff}')

print(f'There are {smokers} childless smokers. The average ins cost is: ${smok_aver_cost}')
print(f'There are {nonsmokers} childless non-smokers. The average ins cost is: ${nonsmok_aver_cost}')

The average diff is $23729.57
There are 115 childless smokers. The average ins cost is: $31341.36
There are 459 childless non-smokers. The average ins cost is: $7611.79


How do they check smoking status? Only by words?

- Figure out what the average age is for someone who has at least one child in this dataset

In [184]:
tot_age = 0
fam_count = 0

for r in main_data:
    if r['children'] != 0:
        tot_age += r['age']
        fam_count += 1
        
aver_fam_age = round(tot_age / fam_count, 2)
print(f'The average age of people with at least one child is {aver_fam_age} years')

The average age of people with at least one child is 39.78 years


- **Optional:** Make predictions about a dataset’s features based on your findings: what features are the most influential for an individual’s medical insurance charges

- We can estimate **ins cost diff based on sex** <br/>
We won't include smokers because it increases cost a lot. And we try to exclude people with children

In [157]:
men_cost = 0.0
women_cost = 0.0
men = 0
women = 0

for r in main_data:
    if r['smoker'] == 'no' and r['children'] == 0:
        if r['sex'] == 'male':
            men_cost += r['charges']
            men += 1
        else:
            women_cost += r['charges']
            women += 1

men_aver_cost = round(men_cost / men, 2)
women_aver_cost = round(women_cost / women, 2)

sx_coeff = round(men_aver_cost - women_aver_cost, 2)
print(f'The average diff is ${sx_coeff}')

print(f'There are {men} non-smokers childless men. The average ins cost is: ${men_aver_cost}')
print(f'There are {women} non-smokers childless women. The average ins cost is: ${women_aver_cost}')

The average diff is $-157.51
There are 223 non-smokers childless men. The average ins cost is: $7530.81
There are 236 non-smokers childless women. The average ins cost is: $7688.32


- We can estimate **ins cost diff based on children count** <br/>
We won't include smokers because it increases cost a lot

In [158]:
ch0_cost = 0.0
ch1_cost = 0.0
ch2_cost = 0.0
ch3_cost = 0.0
ch4_cost = 0.0
ch5_cost = 0.0
ch0 = 0
ch1 = 0
ch2 = 0
ch3 = 0
ch4 = 0
ch5 = 0

for r in main_data:
    if r['smoker'] == 'no':
        if r['children'] == 0:
            ch0_cost += r['charges']
            ch0 += 1
        elif r['children'] == 1:
            ch1_cost += r['charges']
            ch1 += 1
        elif r['children'] == 2:
            ch2_cost += r['charges']
            ch2 += 1           
        elif r['children'] == 3:
            ch3_cost += r['charges']
            ch3 += 1            
        elif r['children'] == 4:
            ch4_cost += r['charges']
            ch4 += 1            
        elif r['children'] == 5:
            ch5_cost += r['charges']
            ch5 += 1
            
ch0_aver_cost = round(ch0_cost / ch0, 2)
ch1_aver_cost = round(ch1_cost / ch1, 2)
ch2_aver_cost = round(ch2_cost / ch2, 2)
ch3_aver_cost = round(ch3_cost / ch3, 2)
ch4_aver_cost = round(ch4_cost / ch4, 2)
ch5_aver_cost = round(ch5_cost / ch5, 2)

ch1_coeff = round(ch1_aver_cost - ch0_aver_cost, 2)
print(f'The average diff for 1 child vs 0 child is ${ch1_coeff}')
ch2_coeff = round(ch2_aver_cost - ch1_aver_cost, 2)
print(f'The average diff for 2 children vs 1 children is ${ch2_coeff}')
ch3_coeff = round(ch3_aver_cost - ch2_aver_cost, 2)
print(f'The average diff for 3 children vs 2 children is ${ch3_coeff}')

print(f'There are {ch0} people with 0 child. The average ins cost is: ${ch0_aver_cost}')
print(f'There are {ch1} people with 1 child. The average ins cost is: ${ch1_aver_cost}')
print(f'There are {ch2} people with 2 child. The average ins cost is: ${ch2_aver_cost}')
print(f'There are {ch3} people with 3 child. The average ins cost is: ${ch3_aver_cost}')
print(f'There are {ch4} people with 4 child. The average ins cost is: ${ch4_aver_cost}')
print(f'There are {ch5} people with 5 child. The average ins cost is: ${ch5_aver_cost}')

The average diff for 1 child vs 0 child is $691.32
The average diff for 2 children vs 1 children is $1189.98
The average diff for 3 children vs 2 children is $121.43
There are 459 people with 0 child. The average ins cost is: $7611.79
There are 263 people with 1 child. The average ins cost is: $8303.11
There are 185 people with 2 child. The average ins cost is: $9493.09
There are 118 people with 3 child. The average ins cost is: $9614.52
There are 22 people with 4 child. The average ins cost is: $12121.34
There are 17 people with 5 child. The average ins cost is: $8183.85


- We can estimate **ins cost diff based on age**

In [188]:
# Option 1
ages_range = range(min(ages), max(ages)+1)
age_cost = {}

for age in ages_range:
    cost = 0.0
    count = 0
    for r in main_data:
        if r['age'] == age:
            cost += r['charges']
            count += 1
    if count != 0:
        aver_cost = round(cost / count, 2)
        age_cost.update({age: {'Count': count, 'Aver Cost': aver_cost}})

# Option 2. Here we have to sort the dict, so I decided to continue with Option 1
# for r in main_data:
#     cost = 0.0
#     count = 0
#     if r['age'] not in age_cost:
#         for i in main_data:
#             if r['age'] == i['age']:
#                 cost += i['charges']
#                 count += 1
#     if count != 0:
#         aver_cost = round(cost / count, 2)
#         age_cost.update({r['age']: {'Count': count, 'Aver Cost': aver_cost}})            
        
diff_total = 0
aver_costs = [age_cost[r]['Aver Cost'] for r in age_cost]
for k in range(len(aver_costs)-1):
    diff = aver_costs[k+1] - aver_costs[k]
    diff_total += diff
    
aver_diff = round(diff_total / len(aver_costs), 2)
print(f'The average increase in ins cost is ${aver_diff} for one year')
# print(age_cost)

The average increase in ins cost is $344.45 for one year


- This is the simple func to return **average ins cost for each age**<br/>
(Based on dict created on the above step)

In [194]:
# average cost for given age

def aver_cost_age(age):
    try:
        print(f"The average cost for {age} years old is: ${age_cost[age]['Aver Cost']}")
    except KeyError:
        print(f'The age {age} does not exist in the dataset. Please provide an age between {min(ages)} and {max(ages)} years')

aver_cost_age(19)

The average cost for 19 years old is: $9747.91


- We can estimate **ins cost diff based on BMI**<br/>
We try not to include people with children because it increases cost

In [160]:
# average cost for bmi levels

bmi_levels = {'Below normal': 18.5, 'Normal': 25, 'Above Normal': 30}

bmi_below_cost = 0.0
bmi_norm_cost = 0.0
bmi_abovenorm_cost = 0.0
bmi_obes_cost = 0.0
bmi_below = 0
bmi_norm = 0
bmi_abovenorm = 0
bmi_obes = 0

for r in main_data:
    if r['smoker'] == 'no' and r['children'] == 0:
        if r['bmi'] < 18.5:
            bmi_below_cost += r['charges']
            bmi_below += 1
        elif 18.5 <= r['bmi'] < 25:
            bmi_norm_cost += r['charges']
            bmi_norm += 1
        elif 25 <= r['bmi'] < 30:
            bmi_abovenorm_cost += r['charges']
            bmi_abovenorm += 1           
        elif r['bmi'] >= 30:
            bmi_obes_cost += r['charges']
            bmi_obes += 1            
            
bmi_below_avcost = round(bmi_below_cost / bmi_below, 2)
bmi_norm_avcost = round(bmi_norm_cost / bmi_norm, 2)
bmi_abovenorm_avcost = round(bmi_abovenorm_cost / bmi_abovenorm, 2)
bmi_obes_avcost = round(bmi_obes_cost / bmi_obes, 2)

bmi_norm_diff = round(bmi_norm_avcost - bmi_below_avcost, 2)
print(f'The average diff for normal BMI vs below norm BMI is: ${bmi_norm_diff}')
bmi_abovenorm_diff = round(bmi_abovenorm_avcost - bmi_norm_avcost, 2)
print(f'The average diff for above normal BMI vs norm BMI is: ${bmi_abovenorm_diff}')
bmi_obes_diff = round(bmi_obes_avcost - bmi_abovenorm_avcost, 2)
print(f'The average diff for obesity BMI vs above norm BMI is: ${bmi_obes_diff}')


print(f'There are {bmi_below} people with BMI below normal. The average ins cost is: ${bmi_below_avcost}')
print(f'There are {bmi_norm} people with BMI in normal zone. The average ins cost is: ${bmi_norm_avcost}')
print(f'There are {bmi_abovenorm} people with BMI above normal. The average ins cost is: ${bmi_abovenorm_avcost}')
print(f'There are {bmi_obes} people with BMI in obesity zone. The average ins cost is: ${bmi_obes_avcost}')


The average diff for normal BMI vs below norm BMI is: $671.1
The average diff for above normal BMI vs norm BMI is: $596.33
The average diff for obesity BMI vs above norm BMI is: $500.7
There are 9 people with BMI below normal. The average ins cost is: $6203.55
There are 74 people with BMI in normal zone. The average ins cost is: $6874.65
There are 136 people with BMI above normal. The average ins cost is: $7470.98
There are 240 people with BMI in obesity zone. The average ins cost is: $7971.68


- Just **averages for BMI and ins cost**

In [200]:
find_aver('bmi')
find_aver('cost')

The average BMI is 30.66
The average ins cost is $13270.42


We can see that average BMI is slightly in obesity area.

- Optional: Explore areas where the data may include **bias** and how that would impact potential use cases

There are not so many people with 4 or more children, so for example people with 5 children have lesser ins cost than people with 1 child. And people with 4 child have some kind of outlier.