# U.S. Medical Insurance Costs

The goal of this project is to analyse medical insurance cost data for people in the US.

This project will scope, analyse and explain the different findings from the data.

Here are a few questions that this project has sought to answer:

- What is the distribution of individual insurance costs?
- What variables affect the insurance costs?
- What is the average insurance cost for a US citizen?
- How much do the insurance costs stray from the average based on the lifesyle and other variables of the individual?

**Data sources:**

The data file `Insurance.csv` was provided by [Codecademy.com](https://www.codecademy.com).

Note: The data for this project is *inspired* by real data, but is mostly fictional.

## Scope

### Project Goals

In this project the perspective will be through a biodiversity analyst for the National Parks Service. The National Park Service wants to ensure the survival of at-risk species, to maintain the level of biodiversity within their parks. Therefore, the main objectives as an analyst will be understanding characteristics about the species and their conservations status, and those species and their relationship to the national parks. Some questions that are posed:

- What is the distribution of males and females?
- Are certain variables likely to affect insurance costs more?
- What is the average insurance cost for a US citizen?
- Does the age affect the insurance cost?

### Data

The `csv` file provided has information about Medical Insurance costs for individuals in the US. 

### Analysis

In this section, descriptive statistics and data visualization techniques will be employed to understand the data better. Statistical inference will also be used to test if the observed values are statistically significant. Some of the key metrics that will be computed include: 

1. Distributions
1. counts 

### Evaluation




## Import Python Modules

First, import the primary modules that will be used in this project:

In [1]:
import csv
from collections import Counter

### Define the filename variable

In [2]:
filename = 'insurance.csv'

## Define Functions to be used

- **calculate_average_age** - calculates the average age of the individuals in the `csv` 
- **calculate_region** - retuens the number of people in each region
- **sex_total** - returns the number of people in each sex
- **smoker_average** - returns the average cost of smokers and non smokers and the counts for both smokers and non-smokers
- **range_cost** - returns the difference between the highest and lowest insurance cost
- **calculate_average_cost** - returns the average insurance cost of all individuals
- **average_cost_parents** - returns the average insurance cost for parents vs non parents
- **bmi_average** - returns the average BMI or the individuals and calculates the average cost of insurance for people above and below the average bmi


In [3]:
def calculate_average_age(filename):
    total_age = 0
    count = 0
    lower_age_count = 0
    higher_age_count = 0
    lower_age_cost = 0
    higher_age_cost = 0

    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            total_age += int(row['age'])
            count += 1

    if count == 0:
        return 0, 0, 0

    average_age = total_age / count

    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            age = int(row['age'])
            charges = float(row['charges'])
            if age < average_age:
                lower_age_cost += charges
                lower_age_count += 1
            elif age > average_age:
                higher_age_cost += charges
                higher_age_count += 1

    if lower_age_count == 0:
        lower_age_avg = 0
    else:
        lower_age_avg = round(lower_age_cost / lower_age_count, 1)

    if higher_age_count == 0:
        higher_age_avg = 0
    else:
        higher_age_avg = round(higher_age_cost / higher_age_count, 1)

    return int(average_age), lower_age_avg, higher_age_avg

In [4]:
def calculate_region(filename):
    region_counter = Counter()

    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            region = row['region']
            region_counter[region] += 1

    return region_counter

In [5]:
def average_cost_by_sex(filename):
    male_total_cost = 0
    female_total_cost = 0
    male_count = 0
    female_count = 0
    sex_counter = Counter()

    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            sex = row['sex']
            sex_counter[sex] += 1
            if sex == 'male':
                male_total_cost += float(row['charges'])
                male_count += 1
            else:
                female_total_cost += float(row['charges'])
                female_count += 1

    if male_count == 0:
        male_average_cost = 0
    else:
        male_average_cost = round(male_total_cost / male_count, 1)

    if female_count == 0:
        female_average_cost = 0
    else:
        female_average_cost = round(female_total_cost / female_count, 1)

    return male_average_cost, female_average_cost, sex_counter['male'], sex_counter['female']


In [6]:
def smoker_average(filename):
    smoker_total_cost = 0
    non_smoker_total_cost = 0
    smoker_count = 0
    non_smoker_count = 0
    smoker_counter = Counter()

    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            smoker = row['smoker']
            smoker_counter[smoker] += 1
            if smoker == 'yes':
                smoker_total_cost += float(row['charges'])
                smoker_count += 1
            else:
                non_smoker_total_cost += float(row['charges'])
                non_smoker_count += 1

    if smoker_count == 0:
        smoker_average_cost = 0
    else:
        smoker_average_cost = round(smoker_total_cost / smoker_count, 1)

    if non_smoker_count == 0:
        non_smoker_average_cost = 0
    else:
        non_smoker_average_cost = round(non_smoker_total_cost / non_smoker_count, 1)

    return smoker_average_cost, non_smoker_average_cost, smoker_counter['yes'], smoker_counter['no']

In [7]:
def range_cost(filename):
    insurance_costs = []
    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            insurance_costs.append(float(row['charges']))

    if not insurance_costs:
        return 0  

    insurance_costs_range = max(insurance_costs) - min(insurance_costs)
    return max(insurance_costs), min(insurance_costs), round(insurance_costs_range,1)

In [8]:
def calculate_average_cost(filename):
    total_cost = 0
    count = 0
    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            total_cost += float(row['charges'])
            count += 1

    if count == 0:
        return 0

    average_cost = round(total_cost / count, 1)
    return average_cost

In [9]:
def average_cost_parents(filename):
    parent_total_cost = 0
    non_parent_total_cost = 0
    parent_count = 0
    non_parent_count = 0
    parent_counter = Counter()

    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            parent = int(row['children'])
            parent_counter[parent] += 1
            if parent != 0:
                parent_total_cost += float(row['charges'])
                parent_count += 1
            if parent == 0:
                non_parent_total_cost+= float(row['charges'])
                non_parent_count += 1

    if parent_count == 0:
        parent_average_cost = 0
    elif non_parent_count == 0:
        non_parent_average_cost = 0
    else:
        parent_average_cost = round(parent_total_cost/ parent_count, 1)
        non_parent_average_cost = round(non_parent_total_cost/ non_parent_count, 1)

    return parent_average_cost, non_parent_average_cost

In [10]:
def bmi_average(filename):
    total_bmi = 0
    count = 0
    lower_bmi_count = 0
    higher_bmi_count = 0
    lower_bmi_cost = 0
    higher_bmi_cost = 0

    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            total_bmi += float(row['bmi'])
            count += 1

    if count == 0:
        return 0, 0, 0

    average_bmi = total_bmi / count

    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            bmi = float(row['bmi'])
            charges = float(row['charges'])
            if bmi < average_bmi:
                lower_bmi_cost += charges
                lower_bmi_count += 1
            elif bmi > average_bmi:
                higher_bmi_cost += charges
                higher_bmi_count += 1

    if lower_bmi_count == 0:
        lower_bmi_avg = 0
    else:
        lower_bmi_avg = round(lower_bmi_cost / lower_bmi_count, 1)

    if higher_bmi_count == 0:
        higher_bmi_avg = 0
    else:
        higher_bmi_avg = round(higher_bmi_cost / higher_bmi_count, 1)

    return round(average_bmi,2), lower_bmi_avg, higher_bmi_avg

## Printing out results

### Printing out the average insurance cost

In [11]:
print(f"The average insurance charge is {calculate_average_cost(filename)}$")

The average insurance charge is 13270.4$


### Range in insurance costs

In [12]:
max_insurance, min_insurance, insurance_value = range_cost(filename)
print(f"The highest insurance cost is: {max_insurance}$")
print(f"The lowest insurance cost is: {min_insurance}$")
print(f"Range in insurance: {insurance_value}$")

The highest insurance cost is: 63770.42801$
The lowest insurance cost is: 1121.8739$
Range in insurance: 62648.6$


### Printing out the number of people belonging to male and female

This function shows that there are slightly more men in the dataset than women.
The insurance charge is on average higher for males than for females.

In [13]:
average_cost_male, average_cost_female, num_male, num_female = average_cost_by_sex(filename)
print(f"Average cost of insurance for males: {average_cost_male}$")
print(f"Average cost of insurance for females: {average_cost_female}$")
print(f"Number of males: {num_male}")
print(f"Number of females: {num_female}")

Average cost of insurance for males: 13956.8$
Average cost of insurance for females: 12569.6$
Number of males: 676
Number of females: 662


### Printing out the number of people belonging to each region
This function shows the population of each region in the US, the southeast has a larger population than the other three which are virtually the same with northeast having just one fewer person.

In [14]:
region_counts = calculate_region(filename)
for region, count in region_counts.items():
    print(f"{region}: {count}")

southwest: 325
southeast: 364
northwest: 325
northeast: 324


### Printing out average age
This function shows the average age of the individuals and the difference in costs between people above and below the average age
The results show that the insurance costs are higher for people older than the average age

In [15]:
avg_age, lower_age_avg_cost, higher_age_avg_cost = calculate_average_age(filename)
print(f"The average age is: {avg_age} years")
print(f"The insurance cost for people younger than the average is: {lower_age_avg_cost}$")
print(f"The insurance cost for people older than the average is: {higher_age_avg_cost}$")

The average age is: 39 years
The insurance cost for people younger than the average is: 10157.2$
The insurance cost for people older than the average is: 16430.5$


### Printing out Average smoker costs and number of smokers and non-smokers and the figures for both

The function prints out the aveage insurance cost for smokers and non-smokers, showing that the average is higher for smokers than for non-smokers

In [16]:
average_cost_smoker, average_cost_non_smoker, num_smokers, num_non_smokers = smoker_average(filename)
print(f"Average cost of insurance for smokers: {average_cost_smoker}$")
print(f"Average cost of insurance for non-smokers: {average_cost_non_smoker}$")
print(f"Number of smokers: {num_smokers}")
print(f"Number of non-smokers: {num_non_smokers}")

Average cost of insurance for smokers: 32050.2$
Average cost of insurance for non-smokers: 8434.3$
Number of smokers: 274
Number of non-smokers: 1064


### Printing out the average cost for parents vs non-parents

This function shows that the average cost for people with at least one child is higher than the cost for people with no children

In [17]:
average_cost_parent, average_cost_non_parent = average_cost_parents(filename)
print(f"Average cost of insurance for parents is: {average_cost_parent}$")
print(f"Average cost of insurance for non-parents is: {average_cost_non_parent}$")

Average cost of insurance for parents is: 13949.9$
Average cost of insurance for non-parents is: 12366.0$


### Printing out average BMI and charges for people above and below the average bmi

This function returns the average BMI of the individuals in the dataset as well as calculating the average insurance cost of those above and below the average BMI.
The results show that individuals with a higher BMI have a greater insurance charge than those below the average BMI.
This is consistent as people above the 30.66 BMI suffer from more health problems as this BMI is considered in the range of Obese

In [18]:
avg_bmi, lower_bmi_avg_cost, higher_bmi_avg_cost = bmi_average(filename)
print(f"The average bmi is: {avg_bmi}")
print(f"The lower bmi average charge is: {lower_bmi_avg_cost}$")
print(f"The higher bmi average charge is: {higher_bmi_avg_cost}$")

The average bmi is: 30.66
The lower bmi average charge is: 10907.3$
The higher bmi average charge is: 15801.8$


## Conclusions

This project was able to answer some of the questions first posed in the beginning:

- What is the distribution of males and females?
    - The number of males is slightly higher than females 676 vs 662
- Are certain variables likely to affect insurance costs more?
    - Yes, while all variables proved to affect the insurance cost Smoking had the biggest difference with the non-smoker average being only 26% of the average cost for smokers. 
- What is the average insurance cost for a US citizen?
    - The average insurance cost is 13270$
- Does the age affect the insurance cost?
    - Yes, on average older people tend to have a greater insurance cost than younger people.