# U.S. Medical Insurance Costs

In this notebook a dataset containing insurance data will be analyzed. This dataset is contained in the file `insurance.csv`, and thus we'll begin by importing the **csv** library so we can work with this file:


In [1]:
import csv

## Dataset description

The dataset file contains seven fields:
- `age`, the age of the individual;
- `sex`, the sex of the individual;
- `bmi`, the [body mass index](https://en.wikipedia.org/wiki/Body_mass_index) of the patient;
- `children`, the number of children of the individual;
- `smoker`, whether the individual smokes or not;
- `region`, a very broad description of the patient's location;
- `charges`, how much that individual pays for insurance.

We can define lists for each of these fields:

In [2]:
ages = []
sexes = []
bmis = []
children = []
is_smoker = []
regions = []
charges = []

With our lists defined, we can read the dataset file and populate them:

In [3]:
with open('insurance.csv', newline='') as insurance_dataset:
    insurance_dict = csv.DictReader(insurance_dataset)
    for item in insurance_dict:
        ages.append(int(item["age"]))
        sexes.append(item["sex"])
        bmis.append(float(item["bmi"]))
        children.append(int(item["children"]))
        is_smoker.append(True if item["smoker"] == 'yes' else False)
        regions.append(item["region"])
        charges.append(float(item["charges"]))

With this method each list will have the same number of items and we can use the indexes to refer to each individual.
We can see if this holds by checking the length of each list:

In [4]:
# Un/comment each line to show/hide its output

print(len(ages))
print(len(sexes))
print(len(bmis))
print(len(children))
print(len(is_smoker))
print(len(regions))
print(len(charges))

1338
1338
1338
1338
1338
1338
1338


In [5]:
# store the length for further use
ds_length = len(ages)
ds_length_range = range(ds_length)
ds_length_range

range(0, 1338)

All arrays have 1338 items each - there is no missing data. Perfect!
With the data imported, we can go ahead and analyze it.

## Analysis

### Sex Distribution

First, we'll check the sex distribution of our dataset. We can do this by counting how many `male`s and `female`s are present, as well as their percentages:

In [6]:
males = sexes.count('male')
females = sexes.count('female')

print(f"Number of males: {males} ({round(males/ds_length * 100, 2)}%)")
print(f"Number of females: {females} ({round(females/ds_length * 100, 2)}%)")

Number of males: 676 (50.52%)
Number of females: 662 (49.48%)


The dataset is balanced in terms of males and females, with only 6 more males present.

### Ages

We'll be getting a feel for the ages of the people in the dataset. For that we'll be getting the youngest and oldest age present, as well as the average and median ages:

In [7]:
sorted_ages = sorted(ages)

# lowest age
lowest_age = sorted_ages[0]
print(f"Lowest age: {lowest_age}")

# highest age
highest_age = sorted_ages[-1]
print(f"Highest age: {highest_age}")

# average
sum_ages = 0
for age in ages:
    sum_ages = sum_ages + age
average_age = sum_ages/ds_length

print(f"Average age: {round(average_age, 2)}")

# median
med_point = int(len(sorted_ages)/2)
median_age = ((sorted_ages[med_point]) + sorted_ages[1 + med_point]) / 2
print(f"Median age: {median_age}")


Lowest age: 18
Highest age: 64
Average age: 39.21
Median age: 39.0


### Insurance costs by age group

We're going to use the following age ranges:
- 18 to 24 (range 1)
- 25 to 34 (range 2)
- 35 to 44 (range 3)
- 45 to 54 (range 4)
- 55 to 64 (range 5)

A new column, `age_group`, will be created and its value will be determined by which range the individual's age falls into:

In [8]:
range_25_34 = range(25, 35)
range_35_44 = range(35, 45)
range_45_54 = range(45, 55)
# the other ranges are unnecessary

age_group = []
group1_entries = 0
group2_entries = 0
group3_entries = 0
group4_entries = 0
group5_entries = 0
for age in ages:
    if age < 25:
        age_group.append(1)
        group1_entries += 1
    elif age in range_25_34:
        age_group.append(2)
        group2_entries += 1
    elif age in range_35_44:
        age_group.append(3)
        group3_entries += 1
    elif age in range_45_54:
        age_group.append(4)
        group4_entries += 1
    else:
        age_group.append(5)
        group5_entries += 1

print(f"""People in group 1: {group1_entries}
People in group 2: {group2_entries}
People in group 3: {group3_entries}
People in group 4: {group4_entries}
People in group 5: {group5_entries}""")

People in group 1: 278
People in group 2: 271
People in group 3: 260
People in group 4: 287
People in group 5: 242


With the age ranges determined, we can go ahead and tally up the insurance costs:

In [9]:
cost_range1 = 0.0
cost_range2 = 0.0
cost_range3 = 0.0
cost_range4 = 0.0
cost_range5 = 0.0

for i in ds_length_range:
    match age_group[i]:
        case 1:
            cost_range1 += charges[i]
        case 2:
            cost_range2 += charges[i]
        case 3:
            cost_range3 += charges[i]
        case 4:
            cost_range4 += charges[i]
        case 5:
            cost_range5 += charges[i]

print(f"""Cost for range 18-24: {round(cost_range1, 2)}, average {round(cost_range1/group1_entries, 2)}
Cost for range 25-34: {round(cost_range2, 2)}, average {round(cost_range2/group2_entries, 2)}
Cost for range 35-44: {round(cost_range3, 2)}, average {round(cost_range3/group3_entries, 2)}
Cost for range 45-54: {round(cost_range4, 2)}, average {round(cost_range4/group4_entries, 2)}
Cost for range 55-64: {round(cost_range5, 2)}, average {round(cost_range5/group5_entries, 2)}
""")
    

Cost for range 18-24: 2505152.61, average 9011.34
Cost for range 25-34: 2805498.37, average 10352.39
Cost for range 35-44: 3414883.86, average 13134.17
Cost for range 45-54: 4550077.3, average 15853.93
Cost for range 55-64: 4480212.85, average 18513.28



As would be expected, insurance costs go up as age increases.

### Smokers by age group

We'll be determining how many smokers there are in each age group for upcoming analysis.

In [10]:
smokers_group1 = 0
smokers_group2 = 0
smokers_group3 = 0
smokers_group4 = 0
smokers_group5 = 0

for i in ds_length_range:
    if is_smoker[i]:
        match age_group[i]:
            case 1:
                smokers_group1 += 1
            case 2:
                smokers_group2 += 1
            case 3:
                smokers_group3 += 1
            case 4:
                smokers_group4 += 1
            case 5:
                smokers_group5 += 1

print(f"""Group 1 smokers: {smokers_group1}/{group1_entries}
Group 2 smokers: {smokers_group2}/{group2_entries}
Group 3 smokers: {smokers_group3}/{group3_entries}
Group 4 smokers: {smokers_group4}/{group4_entries}
Group 5 smokers: {smokers_group5}/{group5_entries}""")

Group 1 smokers: 60/278
Group 2 smokers: 56/271
Group 3 smokers: 61/260
Group 4 smokers: 55/287
Group 5 smokers: 42/242


### Influence of smoking in insurance costs

Since smoking is a risk factor, it can be assumed that smoking drives up an individual's insurance costs. Let's see by how much.


In [11]:
group1_smokers_cost = 0.0
group1_nosmokers_cost = 0.0
group2_smokers_cost = 0.0
group2_nosmokers_cost = 0.0
group3_smokers_cost = 0.0
group3_nosmokers_cost = 0.0
group4_smokers_cost = 0.0
group4_nosmokers_cost = 0.0
group5_smokers_cost = 0.0
group5_nosmokers_cost = 0.0

for i in ds_length_range:
    match age_group[i]:
        case 1:
            if is_smoker[i]:
                group1_smokers_cost += charges[i]
            else:
                group1_nosmokers_cost += charges[i]
        case 2:
            if is_smoker[i]:
                group2_smokers_cost += charges[i]
            else:
                group2_nosmokers_cost += charges[i]
        case 3:
            if is_smoker[i]:
                group3_smokers_cost += charges[i]
            else:
                group3_nosmokers_cost += charges[i]
        case 4:
            if is_smoker[i]:
                group4_smokers_cost += charges[i]
            else:
                group4_nosmokers_cost += charges[i]
        case 5:
            if is_smoker[i]:
                group5_smokers_cost += charges[i]
            else:
                group5_nosmokers_cost += charges[i]

print(f"""Group 1:
Smokers pay in average {round(group1_smokers_cost/smokers_group1,2)} while non-smokers pay in average {round(group1_nosmokers_cost/(group1_entries - smokers_group1),2)}

Group 2:
Smokers pay in average {round(group2_smokers_cost/smokers_group2,2)} while non-smokers pay in average {round(group2_nosmokers_cost/(group2_entries - smokers_group2),2)}

Group 3:
Smokers pay in average {round(group3_smokers_cost/smokers_group3,2)} while non-smokers pay in average {round(group3_nosmokers_cost/(group3_entries - smokers_group3),2)}

Group 4:
Smokers pay in average {round(group4_smokers_cost/smokers_group4,2)} while non-smokers pay in average {round(group4_nosmokers_cost/(group4_entries - smokers_group4),2)}

Group 5:
Smokers pay in average {round(group5_smokers_cost/smokers_group5,2)} while non-smokers pay in average {round(group5_nosmokers_cost/(group5_entries - smokers_group5),2)}""")

Group 1:
Smokers pay in average 27796.54 while non-smokers pay in average 3841.1

Group 2:
Smokers pay in average 28416.48 while non-smokers pay in average 5647.33

Group 3:
Smokers pay in average 31366.05 while non-smokers pay in average 7545.5

Group 4:
Smokers pay in average 35310.4 while non-smokers pay in average 11241.4

Group 5:
Smokers pay in average 39696.37 while non-smokers pay in average 14064.83


### Sex and smoking in insurance costs

We'll be seeing the effects of both sex and smoking in insurance costs.

In [12]:
group1_female_smokers_cost = 0.0
group1_female_nosmokers_cost = 0.0
group2_female_smokers_cost = 0.0
group2_female_nosmokers_cost = 0.0
group3_female_smokers_cost = 0.0
group3_female_nosmokers_cost = 0.0
group4_female_smokers_cost = 0.0
group4_female_nosmokers_cost = 0.0
group5_female_smokers_cost = 0.0
group5_female_nosmokers_cost = 0.0
group1_male_smokers_cost = 0.0
group1_male_nosmokers_cost = 0.0
group2_male_smokers_cost = 0.0
group2_male_nosmokers_cost = 0.0
group3_male_smokers_cost = 0.0
group3_male_nosmokers_cost = 0.0
group4_male_smokers_cost = 0.0
group4_male_nosmokers_cost = 0.0
group5_male_smokers_cost = 0.0
group5_male_nosmokers_cost = 0.0


def fem_smoker_handler(age_group, cost):
    global group1_female_smokers_cost
    global group2_female_smokers_cost
    global group3_female_smokers_cost
    global group4_female_smokers_cost
    global group5_female_smokers_cost
    
    match age_group:
        case 1: 
            group1_female_smokers_cost += cost
        case 2:
            group2_female_smokers_cost += cost
        case 3:
            group3_female_smokers_cost += cost
        case 4:
            group4_female_smokers_cost += cost
        case 5:
            group5_female_smokers_cost += cost
                              
def fem_nonsmoker_handler(age_group, cost):
    global group1_female_nosmokers_cost
    global group2_female_nosmokers_cost
    global group3_female_nosmokers_cost
    global group4_female_nosmokers_cost
    global group5_female_nosmokers_cost

    match age_group:
        case 1: 
            group1_female_nosmokers_cost += cost
        case 2:
            group2_female_nosmokers_cost += cost
        case 3:
            group3_female_nosmokers_cost += cost
        case 4:
            group4_female_nosmokers_cost += cost
        case 5:
            group5_female_nosmokers_cost += cost

def male_smoker_handler(age_group, cost):
    global group1_male_smokers_cost
    global group2_male_smokers_cost
    global group3_male_smokers_cost
    global group4_male_smokers_cost
    global group5_male_smokers_cost

    match age_group:
        case 1: 
            group1_male_smokers_cost += cost
        case 2:
            group2_male_smokers_cost += cost
        case 3:
            group3_male_smokers_cost += cost
        case 4:
            group4_male_smokers_cost += cost
        case 5:
            group5_male_smokers_cost += cost

def male_nonsmoker_handler(age_group, cost):
    global group1_male_nosmokers_cost
    global group2_male_nosmokers_cost
    global group3_male_nosmokers_cost
    global group4_male_nosmokers_cost
    global group5_male_nosmokers_cost

    match age_group:
        case 1: 
            group1_male_nosmokers_cost += cost
        case 2:
            group2_male_nosmokers_cost += cost
        case 3:
            group3_male_nosmokers_cost += cost
        case 4:
            group4_male_nosmokers_cost += cost
        case 5:
            group5_male_nosmokers_cost += cost

for i in ds_length_range:
    if sexes[i] == "female":
        if is_smoker[i]:
            fem_smoker_handler(age_group[i], charges[i])
        else:
            fem_nonsmoker_handler(age_group[i], charges[i])
    else:
        if is_smoker[i]:
            male_smoker_handler(age_group[i], charges[i])
        else:
            male_nonsmoker_handler(age_group[i], charges[i])

print(f"Female smokers:\n{group1_female_smokers_cost}\n{group2_female_smokers_cost}\n{group3_female_smokers_cost}\n{group4_female_smokers_cost}\n{group5_female_smokers_cost}\nFemale non-smokers:\n{group1_female_nosmokers_cost}\n{group2_female_nosmokers_cost}\n{group3_female_nosmokers_cost}\n{group4_female_nosmokers_cost}\n{group5_female_nosmokers_cost}")
print(f"Male smokers:\n{group1_male_smokers_cost}\n{group2_male_smokers_cost}\n{group3_male_smokers_cost}\n{group4_male_smokers_cost}\n{group5_male_smokers_cost}\nMale non-smokers:\n{group1_male_nosmokers_cost}\n{group2_male_nosmokers_cost}\n{group3_male_nosmokers_cost}\n{group4_male_nosmokers_cost}\n{group5_male_nosmokers_cost}")

Female smokers:
695010.26888
598873.39427
842146.0512600001
715332.46436
676722.3929999999
Female non-smokers:
461405.52901900024
623888.6263999997
763247.1478600004
1429362.0065099997
1515073.3130599994
Male smokers:
972781.9965799998
992449.47106
1071182.7341100003
1226739.79288
990524.9554399999
Male non-smokers:
375954.8137399999
590286.88276
738307.92687
1178643.0372900001
1297892.1854100004


### Charges by region

We'll find out which region tends to pay more/less for insurance. First, we'll find out which values for `regions` are present:

In [13]:
# find unique regions
print(set(regions))

{'southeast', 'northeast', 'northwest', 'southwest'}


The four regions present are `southeast`, `southwest`, `northeast`, and `northwest`. We'll be creating a dictionary to keep track of the ongoing costs for each region:

In [14]:
dict_charges_by_region = {
    'southeast_total_cost': 0.0,
    'southwest_total_cost': 0.0,
    'northeast_total_cost': 0.0,
    'northwest_total_cost': 0.0
}

Now we'll go through the dataset and tally up the charges for each region and see what they look like:

In [15]:
# Don't run this repeatedly without resetting the dictionary in the cell above!

for i in ds_length_range:
    dict_key = f'{regions[i]}_total_cost'

    dict_charges_by_region[dict_key] += charges[i]

print(f'''Southeast total charges: {round(dict_charges_by_region['southeast_total_cost'], 2)}
Southwest total charges: {round(dict_charges_by_region['southwest_total_cost'], 2)}
Northeast total charges: {round(dict_charges_by_region['northeast_total_cost'], 2)}
Northwest total charges: {round(dict_charges_by_region['northwest_total_cost'], 2)}''')

Southeast total charges: 5363689.76
Southwest total charges: 4012754.65
Northeast total charges: 4343668.58
Northwest total charges: 4035712.0


Southeast pays more overall, but that by itself doesn't tell us much. It could simply be a result of higher population. Let's see how many people in this dataset live in this region:

In [17]:
print(f'''Southeast inhabitants: {regions.count("southeast")}
Southwest inhabitants: {regions.count("southwest")}
Northeast inhabitants: {regions.count("northeast")}
Northwest inhabitants: {regions.count("northwest")}''')

Southeast inhabitants: 364
Southwest inhabitants: 325
Northeast inhabitants: 324
Northwest inhabitants: 325


Indeed, southeast inhabitants outnumber the other regions.