# U.S. Medical Insurance Costs

In this project, I have **.csv** file with medical insurance costs which will be investigated with Python. The goal of project is analyze various attributes of **insurance.csv** to learn more about patient information and find a future potential use of it

In [4]:
# import all important modules for our project. CSV to iterate through insurance.csv file, 
# collections to create new dictionaries
import csv
from collections import defaultdict

To start, we need import all important modules.

Next, we need to iterate through **insurance.csv** to get our data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:

    - The names of columns and rows
    - Any noticeable missing data
    - Types of values (numerical vs. categorical)


In [5]:
# create lists for every column in our insurance.csv
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

Our **insurance.csv** conrtains the following columns:
* Patient Age
* Patient Sex
* Patient BMI
* Patient Number of Children
* Patient Smoker Status
* Patient Region
* Patient Charges

There are no missing information. Now we are able to store information from **insurance.csv** in our empty lists with same name as columns name.

In [6]:
def load_data(file_name):
    """helper function to read file and add information to our lists"""
    with open("insurance.csv") as insurance_file:
        reader = csv.DictReader(insurance_file)
        for row in reader:
            age.append(row['age'])
            sex.append(row['sex'])
            bmi.append(row['bmi'])
            children.append(row['children'])
            smoker.append(row['smoker'])
            region.append(row['region'])
            charges.append(row['charges'])

**load_data()** has only one parameter with file name and can organize our information in lists. 

In [8]:
load_data('insurance.csv')

Now that all the data from **insurance.csv** neatly organized into labeled lists, the analysis can be started. This is where one must plan out what to investigate and how to perform the analysis. There are many aspects of the data that could be looked into. The following operations will be implemented:
* Calculate average age of all patients
* Calculate average age of males and females
* Calculate % of smokers 
* Calculate average % difference in insurance cost for smokers and non-smokers
* Calculate average age of patients who has children
* Find region where most of patients live

To perform this inspections, 2 helper functions and 5 functions was created:
* `create_dictionary(list_age, list_sex, list_bmi, list_children, list_smoker, list_region, list_charges)`
* `calculate_average_age(list_of_ages)`
* `calculate_average_age_by_gender(my_dictionary)`
* `calculate_smoker_percentage(my_dictionary)`
* `difference_in_price_for_smokers(my_dictionary)`
* `calculate_average_age_with_children(my_dictionary)`
* `calculate_regions(my_dictionary)`

This all function will be reprsesented below.


In [20]:
def create_dictionary(list_age, list_sex, list_bmi, list_children, list_smoker, list_region, list_charges):
    """Helper function creates a dictionary from our lists with patient number as a key"""
    my_dictionary = {}
    counter = 1
    for x in range(len(list_age)):
        my_dictionary.update({
            counter: {
                'Age': list_age[x],
                'Sex': list_sex[x],
                'BMI': list_bmi[x],
                'Children': list_children[x],
                'Smoker': list_smoker[x],
                'Region': list_region[x],
                'Charges': list_charges[x]
            }
        })
        counter += 1
    return my_dictionary

def calculate_average_age(list_of_ages):
    """Helper function which calculates average age of all ages in list"""
    total = 0
    for patient_age in list_of_ages:
        total += int(patient_age)
    return round(total / len(list_of_ages), 2)

def calculate_average_age_by_gender(my_dictionary):
    """Function which calculate average age of males and females from dictionary and returns 2 ages"""
    list_of_males = []
    list_of_females = []
    for value in my_dictionary.values():
        if value['Sex'] == 'female':
            list_of_females.append(value['Age'])
        elif value['Sex'] == 'male':
            list_of_males.append(value['Age'])

    return calculate_average_age(list_of_males), calculate_average_age(list_of_females)

def calculate_smoker_percentage(my_dictionary):
    """Function which calculate percent of smokers in list"""
    counter = 0
    for value in my_dictionary.values():
        if value['Smoker'] == 'yes':
            counter += 1
    return round(counter / len(my_dictionary) * 100, 2)

def difference_in_price_for_smokers(my_dictionary):
    """Function which calculates average price difference between smokers and non-smokers"""
    total_for_smoker = 0
    total_for_non_smoker = 0
    for value in my_dictionary.values():
        if value['Smoker'] == 'yes':
            total_for_smoker += float(value['Charges'])
        else:
            total_for_non_smoker += float(value['Charges'])

    return round(total_for_smoker / total_for_non_smoker * 100, 2)

def calculate_average_age_with_children(my_dictionary):
    """Calculate average age of people with minimum one child"""
    list_of_people_with_children = []
    for value in my_dictionary.values():
        if int(value['Children']) > 0:
            list_of_people_with_children.append(value['Age'])
    return calculate_average_age(list_of_people_with_children)

def calculate_regions(my_dictionary):
    """Calculate from which region is people and where is the biggest number of people"""
    d = defaultdict(int)
    number_of_people = 0
    area = ""
    for key, information in my_dictionary.items():
        d[information['Region']] += 1

    for item in d.items():
        if item[1] > number_of_people:
            number_of_people = item[1]
            area = item[0]
    return area, number_of_people

Now lets create a dictionary with all our patient info

In [10]:
insurance_dictionary = create_dictionary(age, sex, bmi, children, smoker, region, charges)

In [12]:
average_age = calculate_average_age(age)
print("Average age of all people in list is {}".format(average_age))

Average age of all people in list is 39.21


As we see the average age of all patients in list is **39** years which show that list can be used for a broader population. 

In [13]:
average_male_age, average_female_age = calculate_average_age_by_gender(insurance_dictionary)
print("Average male age is {}, average female age is {}".format(average_male_age, average_female_age))

Average male age is 38.92, average female age is 39.5


Average males age is lower then females almost on 0.6 years.

In [14]:
smokers_percentage = calculate_smoker_percentage(insurance_dictionary)
print("{}% people are smokers".format(smokers_percentage))

20.48% people are smokers


We can see that only **20.5%** of all our patients are smokers which is higher then smokers rate at USA in 2021 according to [worldpopulationreview.com](https://worldpopulationreview.com/country-rankings/smoking-rates-by-country)

In [15]:
difference_percentage_in_price_for_smokers = difference_in_price_for_smokers(insurance_dictionary)
print('The price is more expensive for smokers then non-smokers on average {}%'.format(
    difference_percentage_in_price_for_smokers))

The price is more expensive for smokers then non-smokers on average 97.86%


Average insurance cost for smokers is almost twice higher then non-smokers. 98% is a very big number and if smoker can quit his insurance price will be decrease. 

In [16]:
age_of_people_with_children = calculate_average_age_with_children(insurance_dictionary)
print('Average age of people with children is {}'.format(age_of_people_with_children))

Average age of people with children is 39.78


Average age of people with minimum one child is around 40 years which shows that people spend more time on their lifes, careers then it was before.

In [21]:
area_of_highest_number, number_of_people_from_region = calculate_regions(insurance_dictionary)
print('From {} region has the biggest number of people {}.'.format(area_of_highest_number, number_of_people_from_region))

From southeast region has the biggest number of people 728.


Most people are leaving in southeast region.  

All patient data is now neatly organized in a dictionary. This is convenient for further analysis if a decision is made to continue making investigations for the attributes in **insurance.csv**.