# U.S. Medical Insurance Costs

## 1. Introduction 

 - **Purpose:** To investigate a medical insurance costs dataset in a .csv file using pythong fundamentals only. This is part of my Machine Learning journey and serves as an opportunity to put my python knowledge to the test. While I will perform basic statistical analysis on the dataset to gain more insight, the core of the project is based on my question below:
 - **Question:** Do smokers from this dataset have higher average medical insurance cost than non-smokers?

 - **Dataset Description:** Data originally  from [Kaggle](https://www.kaggle.com/datasets/mirichoi0218/insurance), but has been prepared by Codecademy in a .csv file. variables include age, sex, body mass index (BMI), number of children, smoking, and charges.

"Note: One of the columns contains BMI data. While insurance companies do use BMI in their calculations, and that is reflected in this project, BMI is not necessarily an accurate predictor of health. As data scientists, we should always be skeptical of quantitative measures like BMI that reduce complex phenomena to a single number."


## 2. Data Loading

I am loading the data as a dictionary using the csv module and storing it in a list.
what you have is a list of dictionaries where each dictionary represent the data for an individual. Having the dataset in a dictionary makes it easy for analysis later on.
Note that python automatically uses the variables as the key for the dict.

In [1]:
# Data Loading code
import csv

dataset_storage = []
with open ('insurance.csv', newline='') as insurance_file:
    csv_reader = csv.DictReader(insurance_file, delimiter=',')
    for row in csv_reader:
        dataset_storage.append(row)

## 3. Initial Data Inspection

This is what the data looks like and its size.

In [2]:
# View 10 rows from the dataset
for i in range(10):
    print(dataset_storage[i])

# This is the size of the data
print(f"Size: {len(dataset_storage)}")

{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}
{'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}
{'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}
{'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}
{'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}
{'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}
{'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '8240.5896'}
{'age': '37', 'sex': 'female', 'bmi': '27.74', 'children': '3', 'smoker': 'no', 'region': 'northwest', 'charges': '7281.

## 4. Data Preparation

I am preparing my data by storing relevant columns in their respective variables. This will make it easy to perform statistical analysis on the data.


In [3]:
ages = [int(insurance_data_dict["age"]) for insurance_data_dict in dataset_storage]
bmis = [float(insurance_data_dict["bmi"]) for insurance_data_dict in dataset_storage]

## 5. Data Analysis

### 1. Age Summary Statistics Calculation

This analysis will discover the average, standard deviation, max, and min of ages of the individuals in the dataset. This is importatant to understand the basic demographic profile of the dataset.
.

In [4]:
import math
def average_age(list_of_ages):
    age_sum = 0
    for age in list_of_ages:
        age_sum += age
    average = age_sum / len(list_of_ages)
    return average

def standard_dev(average, list_of_ages):
    sum_of_squared_diff = 0
    for age in list_of_ages:
        sum_of_squared_diff += (average - age)**2
    standard_dev = math.sqrt(sum_of_squared_diff / len(list_of_ages))
    return standard_dev

def max_age(list_of_ages):
    return max(list_of_ages)

def min_age(list_of_ages):
    return min(list_of_ages)

print(f"The average age of individuals in the dataset is {average_age(ages)} years.")
print(f"The standard deviation of ages is {standard_dev(average_age(ages), ages)}")
print(f"The max age in the dataset is {max_age(ages)} years.")
print(f"The min age in the dataset is {min_age(ages)} years.")

The average age of individuals in the dataset is 39.20702541106129 years.
The standard deviation of ages is 14.04470903895454
The max age in the dataset is 64 years.
The min age in the dataset is 18 years.


#### Result Discussion

The average age of this dataset, 39 might suggest that this dataset largely consist of middle-ages adults. At this age, I would assume most people are getting married with children so I would expect to see higher insurance charges overall.

With a standard deviation of approximately 14 on the ages, this suggests that, on average, indivudal ages differ from the mean by about 14 years. Assuming this dataset is normal, we can safely say that most people fall between the ages of 25(39-14) and 53 years (39+14). I would say that ages vary quite a bit from the mean with a typical range extending about 14 years younger and older than the average. 

The oldest individual in the dataset is 64 years old.
The youngest individual in the dataset is 18 years old.

I would assume insurance companies use age differences of a population to segment market targets for young individuals and older individuals. 

### 2. Bmi Summary Statistics Calculation

This analysis will discover the average, standard deviation, max, and min bmi of the individuals in the dataset. This is importatant to understand the basic demographic profile of the dataset.

In [5]:
def average_bmi(list_of_bmis):
    sum_of_bmis = 0
    for bmi in list_of_bmis:
        sum_of_bmis += bmi
    average_bmi = sum_of_bmis / len(list_of_bmis)
    return average_bmi

def standard_dev(average, list_of_bmis):
    sum_of_square_diff = 0
    for bmi in list_of_bmis:
        sum_of_square_diff += (average - bmi)**2
    standard_dev = math.sqrt(sum_of_square_diff / len(list_of_bmis))
    return standard_dev

def max_bmi(list_of_bmis):
    return max(list_of_bmis)

def min_bmi(list_of_bmis):
    return min(list_of_bmis)

print(f"THe average bmi in the dataset is {average_bmi(bmis)}")
print(f"The standard deviation of the bmi is {standard_dev(average_bmi(bmis), bmis)}")
print(f"The max bmi is {max_bmi(bmis)}")
print(f"The min bmi is {min_bmi(bmis)}")

THe average bmi in the dataset is 30.663396860986538
The standard deviation of the bmi is 6.095907641589428
The max bmi is 53.13
The min bmi is 15.96


#### Result Discussion

- The average bmi in the dataset is approximately 30.7, which according to the [CDC](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html) is considered obese. With this information, I can expect to see high overall medical insurance costs because high bmi is associated with increased risk of health issues.

- With a standard deviation of approximately 6.1, this suggests that, on average, individual bmi differs from the mean by about 6.1. Assuming this dataset is normal, we can safely say that most people fall between the bmi range of 24.6(30.7 - 6.1) and 36.8(30.7 + 6.1). While a variation of 6.1 does not seem like much, however, within the context of health, that is the difference between a healthy individual and an obese individual. I would call that a significant difference.
e- 
he  highest mi  in the datast is  53.13(obee)s and th  lowest bmi in the dataset is 15.96(underweigt)..

### 3. Smoking Proportion

This analysis will calculate the occurences of smokers and non-smokers to gain insight on the distribution of smokers vs. non-smokers

In [6]:
# dataset_list is a list of dictionaries. Each dictionary represents data for one individual including their smoking status. See code cell 2 
def count_of_smokers_vs_non_smokers(dataset_list):
    count_of_smokers = 0
    count_of_non_smokers = 0
    for record in dataset_list:
        if record["smoker"] == "yes":
            count_of_smokers += 1
        elif record["smoker"] == "no":
            count_of_non_smokers += 1
        else:
            pass
    return count_of_smokers, count_of_non_smokers
        
total_smokers, total_non_smokers = count_of_smokers_vs_non_smokers(dataset_storage)
print(f"There are {total_smokers} individuals in the dataset that smoke.")
print(f"There are {total_non_smokers} individuals in the dataset that do not smoke.")

ratio_non_smoke_to_smoke = total_non_smokers / total_smokers
print(f"ratio: {ratio_non_smoke_to_smoke}")

There are 274 individuals in the dataset that smoke.
There are 1064 individuals in the dataset that do not smoke.
ratio: 3.883211678832117


#### Result Discussion

- Well, this is a little surprising to me. From the calculations we can see there are significantly more non-smokers(1064) to smokers(274) with the ratio of approximately 1 smoker per 4 non smokers. Very interesting.


### 4. Cost comparison between smokers vs. non-smokers

This calculation will determine the total medical insurance costs of smokers vs. non-smokers. Then another function to calculate the average medical insurance cost between smokers vs. non_smokers. 

Considering the proportion of smokers to non-smokers, it will be interesting to see what the costs would be. 

I hypothesize that the total cost difference between the two demographics will not be significant (the layman's definition of significance). Though this will tell me that smoking status has a really high impact on medical insurance cost. Let's see...

In [7]:
# dataset_list is a list of dictionaries. Each dictionary represents data for one individual including their charges and smoking Status. 
# See "Initial Data Inspection" above
def smokers_vs_non_smokers_total_costs(dataset_list):
    smoker_cost = 0.0
    non_smoker_cost = 0.0
    for record in dataset_list:
        if record["smoker"] == "yes":
            smoker_cost += float(record["charges"])
        elif record["smoker"] == "no":
            non_smoker_cost += float(record["charges"])
        else:
            pass
    return smoker_cost, non_smoker_cost

smoker_total_cost, non_smoker_total_cost = smokers_vs_non_smokers_total_costs(dataset_storage)
print(f"Total costs for smokers in this dataset is ${smoker_total_cost:,.2f}")
print(f"Total costs for non-smokers in this dataset is ${non_smoker_total_cost:,.2f}")

Total costs for smokers in this dataset is $8,781,763.52
Total costs for non-smokers in this dataset is $8,974,061.47


#### Result Discussion

- Where there any surprises? nope. not at all. when we consider the huge difference in the proportion of smokers to non-smokers, this result tells us that smoking has a big impact on medical insurance costs. If this is not obvious, then navigate to the next cell on average_costs.

In [8]:
# The arguments for total_cost and total_count are both functions that return a tuple for smokers and non_smokers. 
def average_costs(total_cost, total_count):
    smoker_average = total_cost[0] / total_count[0]
    non_smoker_average = total_cost[1] / total_count[1]
    return smoker_average, non_smoker_average

smoker_average, non_smoker_average = average_costs(smokers_vs_non_smokers_total_costs(dataset_storage), count_of_smokers_vs_non_smokers(dataset_storage))

print(f"The avaerage insurance cost for smokers is ${smoker_average:,.2f}")
print(f"The average insurance cost for non smokers is ${non_smoker_average:,.2f}")

The avaerage insurance cost for smokers is $32,050.23
The average insurance cost for non smokers is $8,434.27


#### Result Discussion

- The average insurance cost for smokers(\\$32,050.23) is approximately 4 times higher than the average insurance cost for non-smokers(\\$8,434.27).

- Within the context of this dataset, it is safe to say that any smoker's medical insurance cost is likely to be 4 timers higher than any non-smoker's medical insurrance cost.


## 6. Key Findings Summary

### Objective 1: Analyze Age Distribution

- **Finding:** The average age of the dataset participants is 39 years, which suggests a middle-aged demographic predominates.

### Objective 2: Analyze Bmi Distribution

- **Finding:** The average bmi of the dataset participants is 30.6, which suggests an obese demographic predominates.

### Objective 3: Cost Analysis by Smoking Status

- **Finding:** Smokers have an average insurance cost about 4 times higher than non-smokers. This underscores the potential savings in cost through smoking cessation.

### Suggestions for Future Research

- Future studies could focus on analyzing the dataset based on the region variable to better understand any regional influences on medical insurance costs.


