# U.S. Medical Insurance Costs
by Charalampos Spanias ([Webite](https://cspanias.github.io/aboutme/), [Codecademy](https://www.codecademy.com/profiles/CSpanias)) <br>

This project is based on the **U.S. Medical Insurance Costs** [dataset](https://www.kaggle.com/mirichoi0218/insurance) and is developed as a portfolio project for **Codecademy's [Data Scientist career path](https://www.codecademy.com/learn/paths/data-science)**.

# CONTENT
1. [Project Scope](#ProjectScope)
2. [Reading the Dataset](#ImportDataset)
3. [Answering Questions](#Questions)
    1. [Basic Questions](#Basic)
    2. [Project Extensions](#Extensions)

<a name="ProjectScore"></a>
## 1. Project Scope

**Basic Questions** proposed by Codecademy:

1. Find out the average age of the patients in the dataset.
2. Analyze where a majority of the individuals are from.
3. Look at the different costs between smokers vs. non-smokers.
4. Figure out what the average age is for someone who has at least one child in this dataset.

**Project Extentions** proposed by Codecademy (*slightly modified*):

1. Perform **Code Refactoring**.
2. Perform an **Exploratory Data Analysis** (EDA).
2. Build a **Machine Learning model** to make cost predictions.
3. Explore areas where the data may include **bias** and how that would impact potential use cases.

<a name="ImportDataset"></a>
## 2. Reading the Dataset
I first **imported the dataset** using the `csv` library and the `with` keyword.

Then, I used `csv.DictReader` for **converting it into a dictionary** so I can easily access what I need.

In [2]:
# import the required module
import csv

# create a count variable for stopping the printing early later
count = 0
# load the dataset and assign it to a temp variable
with open("insurance.csv", newline="") as insurance_data:
    # instantiate csv.DictReader and convert the file into a dictionary
    reader = csv.DictReader(insurance_data)
    # for every row in the dictionary
    for row in reader:
        # i want to print only the fist 5 rows
         if count < 6:
            # print row
            print(row);
            # increase count by 1
            count += 1

{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}
{'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}
{'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}
{'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}
{'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}
{'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}


Make **a list for storing each feature** so the desired **calculations** can be performed and answer the project's questions.

In [3]:
# create the desired lists
age_list, sex_list, bmi_list, children_list = [], [], [], []
smoker_list, region_list, charges_list = [], [], []

# load the dataset and assign it to a temp variable
with open("insurance.csv", newline="") as insurance_data:
    # instantiate csv.DictReader and convert the file into a dictionary
    reader = csv.DictReader(insurance_data)
    # for every row in the dictionary
    for row in reader:
        # append each element into the corresponding list
        age_list.append(int(row["age"]))
        sex_list.append(row["sex"])
        bmi_list.append(row["bmi"])
        children_list.append(row["children"])
        smoker_list.append(row["smoker"])
        region_list.append(row["region"])
        charges_list.append(row["charges"])

<a name="Questions"></a>
# 3. Answering Questions

<a name="Basic"></a>
## 3.1 Basic Questions
**Basic Questions** proposed by Codecademy:
1. [Find out the average age of the patients in the dataset.](#MeanAge)
2. [Analyze where a majority of the individuals are from.](#Region)
3. [Look at the different costs between smokers vs. non-smokers.](#Smokers)
4. [Figure out what the average age is for someone who has at least one child in this dataset.](#AgeChild)

<a name="MeanAge"></a>
## 3.1.1. Mean Age
1. Create `list_mean` function for calculating the mean of a list's numerical elements.
2. Create `test_list_mean` function using the `assert()` keyword to test `list_mean`.
3. Call `list_mean` to calculate mean age for `age_list`.

***Step `2` was not included in Codecademy's curriculum so far***

In [4]:
def list_mean(a_list):
    """Calculate the mean of the list's numerical elements."""
    
    # create a variable to store the sum of the list's elements
    sum = 0
    # create a var to store the list length
    list_len = 0
    
    # for each element within the list
    for element in a_list:
        # use a try-except block to control for TypeError
        try:
            # increase sum by the value of this element
            sum += element
            list_len += 1
        except TypeError:
            # continue to the next element
            continue
        
    # use a try-except block to control for ZeroDivisionError
    try:         
        # return the mean value
        return sum / list_len
    except ZeroDivisionError:
        # print a message explaining the error
        return "Your list does not contain any numerical value!"

Create a new function using the `assert` keyword** to test if the `list_mean` function works as expected.

If something is wrong, when `test_list_mean` is called, it should return an `AssertionError`. If everything works fine it should return nothing.

More info `assert` [here](https://www.w3schools.com/python/ref_keyword_assert.asp).

In [5]:
def test_list_mean():
    """Test the list_mean function."""
    
    # test for a short numerical list
    assert list_mean([15, 10, 5]) == 10
    # test for a longer numerical list
    assert list_mean([10, 10, 10, 5, 6, 12, 34, 63]) == 18.75
    # test for list that contains non-numbers
    assert list_mean([15, "hi", 15]) == 15
    # test for a list with no numbers at all
    assert list_mean(["hi", "bye"]) == "Your list does not contain any numerical value!"

    return

# call function
test_list_mean()

Since `test_list_mean` returns nothing, it means that everything works as expected!

In [6]:
# call function to calculate the mean age of age_list
print("The mean age of the sample is: {:.2f} years.".format(list_mean(age_list)))

The mean age of the sample is: 39.21 years.


<a name="Region"></a>
## 3.1.2. Most popular regions
1. Create `popular_regions` function to find the three most popular regions of a list.
2. Call `popular_regions` on `region_list` to return the three most popular regions.

***For step `1`, I used the `operator` library and its attribute `itemgetter`, which is not included in Codecademy's curriculum so far. More info about that [here](https://stackabuse.com/how-to-sort-dictionary-by-value-in-python/).***

In [7]:
import operator

def popular_regions(a_list):
    """Sort the list's string elements alphabetically."""

    # create list for storing each region's name
    unique_regions = []
    # find every region in list
    for region in region_list:
        # if this region is not already in list
        if region not in unique_regions:
            # append region to list
            unique_regions.append(region)

    # create a dict for storing each region with number of occurences
    regions_dict = {}
    # for each region in list
    for unique_region in unique_regions:
        # set region's name as key and number of occurences as value
        regions_dict[unique_region] = region_list.count(unique_region)

        # sort the dictionary in ascending order
        sorted_tuples = sorted(regions_dict.items(), key=operator.itemgetter(1))

    print("""The most popular region is: {}.\nThe 2nd most popular region is: {}.\nThe 3rd most popular region is: {}."""
          .format(sorted_tuples[-1][0].title(), sorted_tuples[-2][0].title(), sorted_tuples[-3][0].title()))

In [8]:
popular_regions(region_list)

The most popular region is: Southeast.
The 2nd most popular region is: Northwest.
The 3rd most popular region is: Southwest.


An **alternative way** is demonstrated below.

In [25]:
# initiate count variables
south_west, north_west, south_east, north_east = 0, 0, 0, 0

for region in region_list:
    if region.startswith("southw"):
        south_west += 1
    elif region.startswith("southe"):
        south_east += 1
    elif region.startswith("northw"):
        north_west += 1
    else: # region.startswith("northe")
        north_east += 1

# create a list with the counts
counts = [south_west, north_west, south_east, north_east]

# check counts for each region
print(counts, "\n")

# remove duplicate values
print(set(region_list), "\n")

# create a dict with region_name as key and counts as its value
region_counts_dict = {key:value for key, value in zip(set(region_list), counts)}

# check dict
print(region_counts_dict)

[325, 325, 364, 324] 

{'southwest', 'northwest', 'southeast', 'northeast'} 

{'southwest': 325, 'northwest': 325, 'southeast': 364, 'northeast': 324}


In [28]:
def myFunc(dict):
    for i in range(len(dict)):
        return dict[i]

cars = [
  {'car': 'Ford', 'year': 2005},
  {'car': 'Mitsubishi', 'year': 2000},
  {'car': 'BMW', 'year': 2019},
  {'car': 'VW', 'year': 2011}
]

print(list(region_counts_dict.sort(key=myFunc))

TypeError: object of type 'type' has no len()

<a name="Smoker"></a>
## 3.1.3. Smoker vs non-Smoker costs
1. Create `smoking_cost` function that:
    1. Splits a single list into two sublists: `smokers` and `non-smokers`.
    2. Caclulates the mean insurance cost of each sublist.
    3. Compares the mean difference between the two costs.
2. Call `smoking_costs` on `smoker_list` to return the mean difference between the two costs.

<a name="Extensions"></a>
## 3.2 Project Extensions
**Project Extentions** proposed by Codecademy (*slightly modified*):

1. [Code Refactoring.](#Refactoring)
2. [Perform an Exploratory Data Analysis (EDA).](#EDA)
2. [Build a **Machine Learning model** to make cost predictions.](#ML)
3. [Explore areas where the data may include **bias** and how that would impact potential use cases.](#Bias)

<a name="Refactoring"></a>
## 3.2.1 Perform Code Refactoring

You can find a part of [Wikipedia's term definition](https://en.wikipedia.org/wiki/Code_refactoring) below: 

>**Restructuring existing computer code**—changing the factoring—without changing its external behavior. Refactoring is intended to **improve** the **design**, **structure**, and/or **implementation** of the software (its non-functional attributes), while preserving its functionality. Potential advantages of refactoring may include **improved code readability** and **reduced complexity**.