# U.S. Medical Insurance Costs

## Introduction

This project is to investigate a U.S. medical insurance cost database. In this database, each row of data contains the person's age, sex, BMI, number of children, if he/she is a smoker, region where he/she is from, and the insurance charges. This project will analyze the following:
1. The age group distribution
2. The region distribution
3. The BMI distribution
4. Does sex affect insurance charges?
5. Does having children increases the insurance charges?
6. Does smokers have higher insurance charges?

## Code

Import modules

In [1]:
import csv
from tabulate import tabulate
import numpy as np

Load CSV into an overall list

In [2]:
insurance_data = []
with open('insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv, delimiter=',')
    for row in insurance_reader:
        insurance_data.append(row)

In [3]:
#print(insurance_data)
print(len(insurance_data))

1338


There are 1338 rows in the dataset, i.e. 1338 person's data

Creating lists of different attributes for analysis

In [4]:
age_list = []
bmi_list = []
region_list = []

for row in insurance_data:
    age_list.append(int(row['age']))
    bmi_list.append(float(row['bmi']))
    region_list.append(row['region'])

In [5]:
#print(age_list)

### Investigation 1: The age group distribution

First, we classify the data into different age groups and count the frequencies:

Below 19, 20-29, 30-39, 40-49, 50-59, Above 60

In [6]:
# Create dictionary to store distribution
age_distribution = {'Below 19': 0,
                    '20-29': 0,
                    '30-39': 0,
                    '40-49': 0,
                    '50-59': 0,
                    'Above 60': 0
}

# Count the frequencies for each age group
for age in age_list:
    if age <= 19:
        age_distribution['Below 19'] += 1
    elif age >= 20 and age <= 29:
        age_distribution['20-29'] += 1
    elif age >= 30 and age <= 39:
        age_distribution['30-39'] += 1
    elif age >= 40 and age <= 49:
        age_distribution['40-49'] += 1
    elif age >= 50 and age <= 59:
        age_distribution['50-59'] += 1
    elif age >= 60:
        age_distribution['Above 60'] += 1

The distribution is as follows:

In [7]:
print_list = []
for k, v in age_distribution.items():
    print_list.append([k, v])
print(tabulate(print_list, headers=['Age Group', 'Count']))

Age Group      Count
-----------  -------
Below 19         137
20-29            280
30-39            257
40-49            279
50-59            271
Above 60         114


Also other statistics:

In [8]:
# Convert age list into numpy array for analysis
age_np = np.array(age_list)

age_np_tabulate = [['mean', np.average(age_np)],
                   ['median', np.median(age_np)],
                   ['maximum', np.amax(age_np)],
                   ['minimum', np.amin(age_np)],
]
print(tabulate(age_np_tabulate))

-------  ------
mean     39.207
median   39
maximum  64
minimum  18
-------  ------


Based on the above information, in this database, most of the people are from age group 20-29, and the least people are from age group of above 60. The average age is around 39, while the maximum and minimum age are 64 and 18 respectively.

### Investigation 2: The region distribution

First, we check the unique values in the region column in database:

In [9]:
unique_region_list = []
for region in region_list:
    if region not in unique_region_list:
        unique_region_list.append(region)
print(unique_region_list)

['southwest', 'southeast', 'northwest', 'northeast']


Then, we count the number of people from each region:

In [10]:
# Create dictionary to store distribution
region_distribution = {'northeast': 0,
                    'southeast': 0,
                    'southwest': 0,
                    'northwest': 0
}

# Count the frequencies for each region group
for region in region_list:
    region_distribution[region] += 1

The distribution is as follows:

In [11]:
print_list = []
for k, v in region_distribution.items():
    print_list.append([k, v])
print(tabulate(print_list, headers=['Region', 'Count']))

Region       Count
---------  -------
northeast      324
southeast      364
southwest      325
northwest      325


According to the table above, most people are from southeast region, with the total of 364 people. The number of people from northeast, southwest and northwest are roughly the same.

### Investigation 3: The BMI distribution

According to the Centers for Disease and Control website, the BMI classification is as follows:
    
BMI < 18.5 - underweight
BMI between 18.5 & <25 - normal
BMI between 25.0 & <30 - overweight
BMI > 30.0 - obese

In [12]:
# Create dictionary to store distribution
bmi_distribution = {'underweight': 0,
                    'normal': 0,
                    'overweight': 0,
                    'obese': 0
}

# Count the frequencies for each BMI group
for bmi in bmi_list:
    if bmi < 18.5:
        bmi_distribution['underweight'] += 1
    elif bmi < 25.0:
        bmi_distribution['normal'] += 1
    elif bmi < 30.0:
        bmi_distribution['overweight'] += 1
    elif bmi >= 30.0:
        bmi_distribution['obese'] += 1

The distribution is as follows:

In [13]:
print_list = []
for k, v in bmi_distribution.items():
    print_list.append([k, v])
print(tabulate(print_list, headers=['BMI Classification', 'Count']))

BMI Classification      Count
--------------------  -------
underweight                20
normal                    225
overweight                386
obese                     707


According to the table above, 707 people fall into the obese category, which is over than half the people in this database.
225 people belongs to the normal category, which is roughly 17% of the people in the dataset.

### Investigation 4: Does sex affect insurance charges?

For this question, we can compare the mean and median insurance charges for male and female.

In [14]:
male_charges = []
female_charges = []

# sort the charges into male and female lists
for row in insurance_data:
    if row['sex'] == 'male':
        male_charges.append(float(row['charges']))
    elif row['sex'] == 'female':
        female_charges.append(float(row['charges']))

# convert lists into numpy arrays for analysis
male_charges_np = np.array(male_charges)
female_charges_np = np.array(female_charges)

# calculate mean and median of male and female insurance charges
male_mean = np.average(male_charges_np)
female_mean = np.average(female_charges_np)
male_median = np.median(male_charges_np)
female_median = np.median(female_charges_np)

Display results:

In [15]:
print("Average male insurance charges: " + str(male_mean))
print("Average female insurance charges: " + str(female_mean))
print("Median male insurance charges: " + str(male_median))
print("Median female insurance charges: " + str(female_median))

Average male insurance charges: 13956.751177721893
Average female insurance charges: 12569.578843835347
Median male insurance charges: 9369.61575
Median female insurance charges: 9412.9625


The mean insurance charge for male is higher than that of female, but the median insurance charge for male is lower than that of female. It is not possible to conclude sex affects the insurance charges based on this dataset.

### Investigation 5: Does having children increases the insurance charges?

For this question, we can compare the mean and median insurance charges for smokers and non-smokers.

In [16]:
no_children_charges = []
have_children_charges = []

# sort the charges into no children and with children lists
for row in insurance_data:
    if row['children'] == '0':
        no_children_charges.append(float(row['charges']))
    else:
        have_children_charges.append(float(row['charges']))

# convert lists into numpy arrays for analysis
no_children_charges_np = np.array(no_children_charges)
children_charges_np = np.array(have_children_charges)

# calculate mean and median of no children and have children insurance charges
no_children_mean = np.average(no_children_charges_np)
have_children_mean = np.average(children_charges_np)
no_children_median = np.median(no_children_charges_np)
have_children_median = np.median(children_charges_np)

Display results:

In [17]:
print("Average insurance charges without children: " + str(no_children_mean))
print("Average insurance charges with children: " + str(have_children_mean))
print("Median insurance charges without children: " + str(no_children_median))
print("Median insurance charges with children: " + str(have_children_median))

Average insurance charges without children: 12365.97560163589
Average insurance charges with children: 13949.941093481675
Median insurance charges without children: 9856.9519
Median insurance charges with children: 9223.8295


Based on the results above, the average insurance charges without children is lower than that of those with children, but the median for those without children is higher. It is not possible to conclude having children increases the insurance charges based on this dataset.

### Investigation 6: Does smokers have higher insurance charges?

For this question, we can compare the mean and median insurance charges for smokers and non-smokers.

In [18]:
smoker_charges = []
non_smoker_charges = []

# sort the charges into smoker and non-smoker lists
for row in insurance_data:
    if row['smoker'] == 'yes':
        smoker_charges.append(float(row['charges']))
    elif row['smoker'] == 'no':
        non_smoker_charges.append(float(row['charges']))

# convert lists into numpy arrays for analysis
smoker_charges_np = np.array(smoker_charges)
non_smoker_charges_np = np.array(non_smoker_charges)

# calculate mean and median of smokers and non-smokers insurance charges
smoker_mean = np.average(smoker_charges_np)
non_smoker_mean = np.average(non_smoker_charges_np)
smoker_median = np.median(smoker_charges_np)
non_smoker_median = np.median(non_smoker_charges_np)

Display results:

In [19]:
print("Average smoker insurance charges: " + str(smoker_mean))
print("Average non-smoker insurance charges: " + str(non_smoker_mean))
print("Median smoker insurance charges: " + str(smoker_median))
print("Median non-smoker insurance charges: " + str(non_smoker_median))

Average smoker insurance charges: 32050.23183153284
Average non-smoker insurance charges: 8434.268297856204
Median smoker insurance charges: 34456.348450000005
Median non-smoker insurance charges: 7345.4053


Based on the results above, the average and median of insurance charges for non-smokers are lower than that of smokers. We can conclude the smokers have higher insurance charges.