# U.S. Medical Insurance Costs

This is the first project on Codecademy Data Scientist: Machine Learning Career Path.
The given by Codeacademy dataset has been taken from [kaggle](https://www.kaggle.com/datasets/mirichoi0218/insurance) and stored in **insurance.csv**.
It has 1338 records about different insurants. The data includes information about age, sex, bmi, number of children, smoker or not, region and final charge.

In this project, we will use Python fundamentals to explore a CSV file containing medical insurance costs. The objective is to analyze various attributes within the insurance.csv file to gain a deeper understanding of the patient information it contains and to identify potential use cases for the dataset.

## The first stage: Load data

In [3]:
# import csv library
import csv

We are working with a CSV file, that obviously means we need csv library.

Afters the first view on the file we have noticed next essentials: 
* There are 1338 insurant records
* Every record has 4 columns containing quantitative values: Age, BMI, Number of children, Final charge
* Every record has 3 columns containing categorical values: Sex, Smoker or not, Region
* There is not any missing data

In [6]:
# create empty lists for the various attributes from insurance.csv
ages = []
sexes = []
bmis = []
num_children = []
smoker_statuses = []
regions = []
insurance_charges = []

To make the analysis easier we will separate each columns to its own list. Now we have created empty lists that will be filled in next steps.

In [8]:
# the function which takes file name, column name and load the data from this file and this column to the corresponding list
def load_data_to_list(file_name, column_name, list):
    with open(file_name) as data_csv:
        data_csv_dict = csv.DictReader(data_csv)
        for row in data_csv_dict:
            list.append(row[column_name])
    return list

We have created the helper function which loads data from csv to the corresponding list.

In [10]:
# the constant which keeps the name of the data file
INSURANCE_FILE_NAME = 'insurance.csv'

We are working with the only one file, so we can save its name into the variable like a constant.

In [12]:
# call the function for every column
load_data_to_list(INSURANCE_FILE_NAME, 'age', ages)
load_data_to_list(INSURANCE_FILE_NAME, 'sex', sexes)
load_data_to_list(INSURANCE_FILE_NAME, 'bmi', bmis)
load_data_to_list(INSURANCE_FILE_NAME, 'children', num_children)
load_data_to_list(INSURANCE_FILE_NAME, 'smoker', smoker_statuses)
load_data_to_list(INSURANCE_FILE_NAME, 'region', regions)
load_data_to_list(INSURANCE_FILE_NAME, 'charges', insurance_charges)
print("Data has been loaded")

Data has been loaded


After the function is implemented we can call it as many times as we need

---

## The end of the first stage: The first questions

We succesfully have loaded all the data from **insurance.csv** to the corresponding lists. Here is the point where we can concentrate on the questions what exactly do we want and can extract from the data.

The following tasks will be implemented in next steps:

 * Find out the average age of the patients in the dataset
 * Analyze where a majority of the individuals are from
 * Look at the different costs between smokers vs. non-smokers
 * Figure out what the average age is for someone who has at least one child in this dataset

## The second stage: Implementation of the functions for answering the first questions

In [17]:
# the function which returns the average age between the insurants
def get_average_age():
    total_age = 0
    for age in ages:
        total_age += float(age)
    return total_age / len(ages)

In [18]:
# the function which determines how many insurants are from different regions and returns dictionary {region: number of insurants}
def get_regions_by_number_of_insurants():
    regions_by_number_of_insurants = {}
    for region in regions:
        if (region in regions_by_number_of_insurants): 
            regions_by_number_of_insurants[region] += 1
        else:
            regions_by_number_of_insurants[region] = 1
    return regions_by_number_of_insurants

In [19]:
# the function which calculates the region with the maximum number of insurants and return tuple (region, number of insurants)
def get_region_max_of_insurants():
    regions_by_number_of_insurants = get_regions_by_number_of_insurants()
    max_number = float('-inf')
    max_region = ""
    for region, number in regions_by_number_of_insurants.items():
        if number > max_number:
            max_number = number
            max_region = region
    return (max_region, max_number)

In [20]:
# the function which calculates the average charge for corresponding insurants with attribute smoker or not equaling the input one
def get_average_charge_by_smoker_attr(smoker):
    total_charge = 0
    total_insurants = 0
    for i in range(len(smoker_statuses)):
        if smoker_statuses[i] == smoker:
            total_charge += float(insurance_charges[i])
            total_insurants += 1
    return total_charge / total_insurants

In [21]:
# the function which calculates the average age for insurants who have at least 1 child
def get_average_age_of_parents():
    total_parents = 0
    total_age = 0
    for i in range(len(num_children)):
        if float(num_children[i]) > 0:
            total_parents += 1
            total_age += float(ages[i])
    return total_age / total_parents

Above there are implemention of several functions which will help us in analysis and will give us answers to the questions.

---

## The third stage: The answers to the first questions

Often the common question is what does the average customer look like. And now we have known the age of the average customer.

In [25]:
print("The age of an average insurant equals {} years.".format(round(get_average_age())))

The age of an average insurant equals 39 years.


*But how does the entire insurant population look? Is there a lot of mid-age people or an insurant could be an any-age person?*

Also we have determined the region with the maximal number of insurants.

In [28]:
region_with_max_insurants = get_region_max_of_insurants()[0]
max_insurants = get_region_max_of_insurants()[1]
print("{} is the region with the maximal number of insurants. This amount equals {}.".format(region_with_max_insurants.title(), max_insurants))

Southeast is the region with the maximal number of insurants. This amount equals 364.


From this point we can formulate that an average insurant is a 39-year-old person from the Southeast region.
This knowledge will be important when an insurance company will be developing its products.

*But can we really be sure that the large majority of insurants are from this region?*

In addition we can definitely distinguish influence of habbits to the insurance price.

In [32]:
average_charge_smoker = get_average_charge_by_smoker_attr('yes')
average_charge_non_smoker = get_average_charge_by_smoker_attr('no')
print("The insurance averagely costs {} for smokers and {} for not-smokers! The difference equals {}.".format(round(average_charge_smoker), round(average_charge_non_smoker), round(abs(average_charge_smoker - average_charge_non_smoker))))

The insurance averagely costs 32050 for smokers and 8434 for not-smokers! The difference equals 23616.


Imagine, smoking increases your final charge more than 23 thousands dollars.

*But is there another significant dependence? What's about genders?*

Extra knowledge which we can extract from the origin dataset is about parenthood. Hardly we can consider the given dataset as a representative sample for the entire population, but at least we can look at it from perspective the part of population who have the insurance.

In [36]:
print("The average age for insurants who have at least 1 child equals {}.".format(round(get_average_age_of_parents())))

The average age for insurants who have at least 1 child equals 40.


That is really interesting, because we can see that average age for parents with insurance almost equals the average age between all the insurants. 40 years and 39 years accordingly. 

*But might there be the significant difference between who prefer the insurance more: parents or not-parents?*

## The end of the third stage: The answers give more questions

During analysis we have recieved answers for asked above questions. And we can make some conclusions. But the more answers we get the more questions appear. So the next steps of analysis will answer to next questions:
 * What is the standard deviation of the mean ages?
 * How much is difference between the most populated by insurants region and others?
 * Is there the different costs between males and females
 * Who buys the insurance more: parent or not-parents?

## The forth stage: Implementation of the functions for answering the next questions

In [42]:
# the function which returns the standart deviation from average_age between the insurants;
#  obviously would be better to implement this functionality inside the only one function for to go through the list only once, 
#  but imagine we can't change previous work and we are working iteratively
def get_standart_deviation_from_average_age():
    average_age = get_average_age()
    sum_squares_deviations = 0
    for age in ages:
        sum_squares_deviations += (average_age - float(age)) ** 2
    return (sum_squares_deviations / len(ages)) ** 0.5

In [43]:
# the function which returns dictionary with regions which DO NOT have maximal number of insurance and their difference with region which DO have
#  again, obviously would be better to implement this functionality inside one previous function for not to go through lists and dictionaries several times
#  but imagine we can't change previous work and work iteratively
def get_regions_insurants_difference_with_max():
    regions_by_number_of_insurants = get_regions_by_number_of_insurants()
    region_max_of_insurants = get_region_max_of_insurants()
    region_difference_with_max_insurant = {}
    for region, num_insurants in regions_by_number_of_insurants.items():
        if region == region_max_of_insurants[0]: continue
        region_difference_with_max_insurant[region] = region_max_of_insurants[1] - num_insurants
    return region_difference_with_max_insurant

In [44]:
# the function which returns dictionary with average charge for males and females
def get_average_charge_for_genders():
    total_charges_male = 0
    total_male = 0
    total_charges_female = 0
    total_female = 0
    for i in range(len(insurance_charges)):
        if sexes[i] == 'male':
            total_charges_male += float(insurance_charges[i])
            total_male += 1
        else:
            total_charges_female += float(insurance_charges[i])
            total_female += 1
    sex_charges = {'male': total_charges_male / total_male, 'female': total_charges_female / total_female}
    return sex_charges

In [45]:
# the function which returns dictionary with number of insurants by having children(parent) or not (not-parent)
def get_number_of_insurants_by_parent_or_not():
    number_of_insurants_by_parent_or_not = {'parent': 0, 'not-parent': 0}
    for num in num_children:
        if float(num) > 0:
            number_of_insurants_by_parent_or_not['parent'] += 1
        else:
            number_of_insurants_by_parent_or_not['not-parent'] += 1
    return number_of_insurants_by_parent_or_not

Above there are implemention of several functions which will help us in analysis and will give us answers to the questions.

---

## The fifth stage: The answers to the second questions

We remember that the average age is 39 years. And now we can calculate the standart deviation.

In [49]:
print("Standart deviation for the age of insurants equals {}".format(round(get_standart_deviation_from_average_age(), 0)))

Standart deviation for the age of insurants equals 14.0


That all means that the insurants are represented as any-age persons and there is not any preconditions to distinguish age group of people who buy insurance. At least we can not say so standing on this point. Undoubtely the situation can become clearer due to exploration an age distribution.

Previous analysis gave us that the Southeast region is the region with the maximal number of insurants and has 364 insurants.
And now we have looked the difference between other regions.

In [52]:
regions_insurants_difference_with_max = get_regions_insurants_difference_with_max()
for region, diff_num_insurants in regions_insurants_difference_with_max.items():
    print("In the {} region total amount of insurants equals {} and it is {} less than the maximal amount.".format(region.title(), get_regions_by_number_of_insurants()[region], diff_num_insurants))

In the Southwest region total amount of insurants equals 325 and it is 39 less than the maximal amount.
In the Northwest region total amount of insurants equals 325 and it is 39 less than the maximal amount.
In the Northeast region total amount of insurants equals 324 and it is 40 less than the maximal amount.


Actually we don't have permissions to say that the Southeast is not the largest by insurants, but we see that the difference from other regions is not essential and suprisingly pretty equal between other regions.

We have defined the significant difference in charges depending on an insurant is smoker or is not.
And now the more worrying question. What is the difference between genders?

In [55]:
average_charge_for_genders = get_average_charge_for_genders()
average_price_male = average_charge_for_genders['male']
average_price_female = average_charge_for_genders['female']
print("The insurance for males averagely costs {} and for females {}. The difference is {}.".format(round(average_price_male, 0), round(average_price_female, 0), round(abs(average_price_male - average_price_female), 0)))

The insurance for males averagely costs 13957.0 and for females 12570.0. The difference is 1387.0.


It is not an ideal situation. There is a difference between costs and insurance companies sell insurance a little costlier to males. But this difference is not really essential.

And the last question we have is about parents and not-parent.

In [58]:
number_of_insurants_by_parent_or_not = get_number_of_insurants_by_parent_or_not()
print("From all the insurants there are {} parents and {} not-parents".format(number_of_insurants_by_parent_or_not['parent'], number_of_insurants_by_parent_or_not['not-parent']))

From all the insurants there are 764 parents and 574 not-parents


As we can see there are more insurants who are parents. But still we can not say why exactly this is so. Let's consider that as a fact while we are standing on this point.

## The final stage: Conslusions

In the beggining we had the dataset with records of different insurants. We loaded it to specially organized lists. Right after we started analysis and wondered the first several questions. Then we implemented proper functions which helped us to find answers. Answers only inspired us to make another seria of questions. And then again we implemented proper functions to find answers interested us. That what we have after our work:
 * The average age of insurants equals 39.0 years with the standart deviation 14.0 years and it looks like people with different ages buy insurance
 * The most insurants are from the Southeast region and their amount equals 364. But there are a lot of insurants from other regions either and the difference is not large and equals 39-40 insurants
 * Your habbits can extremaly increase the cost of insurance if you are a smoker. Averagely not-smokers pay 8434.0 US dollars for the insurance and smokers pay 32050.0 US dollars and it is 23616.0 US dollars more! But the cost not essentially depends on you are male or female and equals 13957.0 and 12570.0 US dollars respectively.
 * The average age of insurants who are parents equals 40.0 years and it is just a little more than the average age of all insurants which equals 39.0 years. Interesting to note that from insurants there are more parents and their amount is 764 when insurants who are not-parents only 574.

## Afterword

Definitely these are not all questions which could be interesting us and be answered but we have to leave them outside the scope of the project. The primary objective of this project was to demonstrate proficiency in Python programming, data manipulation, and analysis techniques. By investigating a CSV file with medical insurance costs, the project showcases the ability to derive insights from real-world data.