# U.S. Medical Insurance Costs

#### Short description of the project
Open insurance.csv and take a look at the file. Take note of how information is organized. How will this affect how you analyze the data in Python? Is there anything of particular interest to you in the dataset that you want to investigate? Think about these things before you jump into analyzing it

In [1]:
# Libraries
import csv
%run 01_functions.ipynb

In [2]:
# Opening the .csv database to have a look at data
with open("insurance.csv", newline='') as insurance_database:
    #content = insurance_database.read()
    content_6_lines = []
    [content_6_lines.append(insurance_database.readline()) for line in range(6)]
content_6_lines

['age,sex,bmi,children,smoker,region,charges\r\n',
 '19,female,27.9,0,yes,southwest,16884.924\r\n',
 '18,male,33.77,1,no,southeast,1725.5523\r\n',
 '28,male,33,3,no,southeast,4449.462\r\n',
 '33,male,22.705,0,no,northwest,21984.47061\r\n',
 '32,male,28.88,0,no,northwest,3866.8552\r\n']

## About the data
The database contains people aged from 18 to close to 70 years old, reporting their sex and BMI.
Additionally, it is said if they have children or not and in which region they reside.
Finally, the total cost of their insurance is provided.<br>
About how the data is stored:
>a) There is no missing data.<br>
>b) There are seven columns.<br>
>c) Some columns are numerical while some are categorical.
***
## Things to investigate
1.) Are woman and male equally represented?<br>
>1.1) Is there a difference in charges between men and women?<br>

2.) How about having children: is the data 50-50 or there is a majority?
>2.1) Are there more woman or male listed with children?<br>
>2.2) What is the average age of parents and non parents?

3.) How about smokers?
>3.1) Do smokers pay more?

4.) Is there a region that appear more expensive than others and which is the less expensive?
>4.1) How many regions are there and are them proportional in the database?<br>

 5.) Dividing the data based on age 18-29, 30-39, 40-49, ..., which is the larger group? Is each category well represented?
***

An ID is assigned to each line of the .csv database, which is used as a key to build a dictionary containing all the information of that person.
Additionally, the numerical values are converted and same for the boolean parameters, such as "smoker" or not "smoker".
The result is a comprehensive dictionary easy to access.

In [3]:
Insurance_dictionary_by_ID = from_csv_to_dict_by_ID("insurance.csv")

A new dictionary was created where to each user is assigned an ID that serves as 'key' for its corresponding information.
As an example, the first entry now looks like this: 
 ID_1 :  {'age': 19.0, 'sex': 0, 'bmi': 27.9, 'children': 0, 'smoker': 1, 'region': 'southwest', 'charges': 16884.924}


#### 1.) Are woman and man equally represented? What's the percentage?

In [4]:
list_ID_male, male_percent, list_ID_female, female_percent = proportion_on_parameter(Insurance_dictionary_by_ID, "sex")

print("Totally the US Insurance has " + str(len(Insurance_dictionary_by_ID.keys())) + " clients.\n"
      "Specifically there are {n_males} males and {n_females} females.\n"
      "The proportion is then {male_perc}% and {female_perc}% respectively."
      .format(n_males=len(list_ID_male), male_perc=male_percent,
              n_females=len(list_ID_female), female_perc=female_percent))

Totally the US Insurance has 1338 clients.
Specifically there are 676 males and 662 females.
The proportion is then 50.52% and 49.48% respectively.


Therefore, we can say that men and women are equally represented. How is cost insurance for the two group on average?

In [5]:
avg_male_cost = average(list_ID_male, 'charges')
avg_female_cost = average(list_ID_female, 'charges')
print("Clients who are men pay on average:", avg_male_cost, "dollars.")
print("Clients who are women pay on average:", avg_female_cost, "dollars.")

Clients who are men pay on average: 13956.8 dollars.
Clients who are women pay on average: 12569.6 dollars.


In [6]:
# Calculating how much more men pay compare to non women
print("Therefore on average, men pay", round(abs(avg_female_cost-avg_male_cost)/avg_female_cost * 100,1),
      "% more than women.")

Therefore on average, men pay 11.0 % more than women.


#### 2.) How about having children: is the data 50-50 or there is a majority? And what's the average of children for those who have them?

In [7]:
list_ID_parents, parents_percent, list_ID_no_parents, no_parent_percent = proportion_on_parameter(Insurance_dictionary_by_ID, "children")

print("Totally there are {n_parents} parents and {n_no_parents} who do not have children.\n"
      "The proportion is then {parents_percent}% and {no_parent_percent}% respectively."
      .format(n_parents=len(list_ID_parents), parents_percent=parents_percent,
              n_no_parents=len(list_ID_no_parents), no_parent_percent=no_parent_percent))

Totally there are 764 parents and 574 who do not have children.
The proportion is then 57.1% and 42.9% respectively.


This means that more than half of the people insured is a parent. It can be interesting to explore if on average their cost insurance is higher compare to the group of people without children.

In [8]:
# Checking the average cost of insurance paid by parents vs that paid by clients without children.abs
avg_parent_cost = average(list_ID_parents, 'charges')
avg_no_parent_cost = average(list_ID_no_parents, 'charges')
print("Clients with children pay on average:", avg_parent_cost, "dollars.")
print("Clients who do not have children pay on average:", avg_no_parent_cost, "dollars.")

Clients with children pay on average: 13949.9 dollars.
Clients who do not have children pay on average: 12366.0 dollars.


In [9]:
# Calculating how much more parents pay compare to non parents
print("Therefore on average, parents pay", round(abs(avg_no_parent_cost-avg_parent_cost)/avg_no_parent_cost * 100,1),
      "% more than clients without children.")

Therefore on average, parents pay 12.8 % more than clients without children.


In [10]:
average_number_children = average(list_ID_parents, "children")
print("Among the parents user_ID, the average number of children is " + str(average_number_children))

Among the parents user_ID, the average number of children is 1.9


##### 2.1.) Are there more woman or male listed with children?

In [11]:
# I want to count how many men and women are there among the list of parents id previously retrieved
fathers = 0; mothers = 0;
for id in list_ID_parents:
    if Insurance_dictionary_by_ID[id]['sex'] == 1:
        fathers += 1
    else:
        mothers += 1
print("Totally there are {n_dads} fathers and {n_mums} mothers.\n"
      "The proportion is than {dads_perc}% and {mums_perc}% respectively."
      .format(n_dads=fathers, dads_perc=round(fathers/(fathers+mothers) * 100, 2),
              n_mums=mothers, mums_perc=round(mothers/(fathers+mothers) * 100, 2)))

Totally there are 391 fathers and 373 mothers.
The proportion is than 51.18% and 48.82% respectively.


>>Therefore, it is possible to assume that mothers and fathers are equally represented.

##### 2.2) What is the average age of parents and non parents?

In [12]:
avg_parent_age = average(list_ID_parents, 'age')
avg_no_parent_age = average(list_ID_no_parents, 'age')
print("Clients who have children are on average:", avg_parent_age, "years old.")
print("Clients who don't have children are on average:", avg_no_parent_age, "years old.")

Clients who have children are on average: 39.8 years old.
Clients who don't have children are on average: 38.4 years old.


#### 3.) How about smokers?

In [13]:
list_ID_smokers, smokers_percent, list_ID_no_smokers, no_smokers_percent = proportion_on_parameter(Insurance_dictionary_by_ID, "smoker")

print("Totally there are {n_smokers} smokers and {n_no_smokers} who do not smoke.\n"
      "The proportion is then {smokers_percent}% and {no_smokers_percent}% respectively."
      .format(n_smokers=len(list_ID_smokers), smokers_percent=smokers_percent,
              n_no_smokers=len(list_ID_no_smokers), no_smokers_percent=no_smokers_percent))

Totally there are 274 smokers and 1064 who do not smoke.
The proportion is then 20.48% and 79.52% respectively.


Hence, the majority of the users are non smoker. It might be interesting to check if the average cost of insurance is higher for smokers and therefore if smoking has a major impact on the cost they pay.

In [14]:
avg_smoker_cost = average(list_ID_smokers, 'charges')
avg_no_smoker_cost = average(list_ID_no_smokers, 'charges')
print("Clients who smoke pay on average:", avg_smoker_cost, "dollars.")
print("Clients who don't smoke pay on average:", avg_no_smoker_cost, "dollars.")

Clients who smoke pay on average: 32050.2 dollars.
Clients who don't smoke pay on average: 8434.3 dollars.


In [15]:
# Calculating how much more smokers pay compare to non smokers
print("Therefore on average, smokers pay", round(abs(avg_no_smoker_cost-avg_smoker_cost)/avg_no_smoker_cost * 100,1),
      "% more than non smokers.")

Therefore on average, smokers pay 280.0 % more than non smokers.


#### 4.) Is there a region that appear more expensive than others and which is the less expensive?
##### 4.1) How many regions are there and are them proportional in the database?<br>

In [16]:
# Checking number of different regions
## Defining list containing region of every user_id and then making list of distinct regions thanks to "dict.fromkeys()"
regions_all_id  = [Insurance_dictionary_by_ID[id]['region'] for id in Insurance_dictionary_by_ID.keys()]

regions = list(dict.fromkeys(regions_all_id))
print("Totally there are {n_reg} regions:\n"
     "{regions}.".format(n_reg=len(regions), regions=regions))

Totally there are 4 regions:
['southwest', 'southeast', 'northwest', 'northeast'].


In [17]:
# Checking if users are proportional in each region: numbers and percentages
id_by_regions_dict = {}
n_and_percent_users_regions = {}
avg_region_cost_dict = {}
for site in regions:
    id_by_regions_dict["{site}".format(site=str(site))] = [id for id in Insurance_dictionary_by_ID.keys()
                                                          if Insurance_dictionary_by_ID[id]['region'] == str(site)]
    n_and_percent_users_regions["{site}".format(site=str(site))] = [ len(list(id_by_regions_dict[str(site)])),
                                                                    round(len(list(id_by_regions_dict[str(site)])) /
                                                                          len(Insurance_dictionary_by_ID.keys()) * 100, 1) ]
    avg_region_cost_dict["{site}".format(site=str(site))] = average(id_by_regions_dict[site], 'charges')
print('Number of clients and their percentage among the total for eache region:\n',n_and_percent_users_regions)
print('')
# Rearrangin the average cost dictionary from less expensive to more expensive and overwriting it
avg_region_cost_dict = {key : avg_cost for key, avg_cost in sorted(avg_region_cost_dict.items(), key = lambda item: item[1])}
print('Average cost depending on region are:\n', avg_region_cost_dict)

Number of clients and their percentage among the total for eache region:
 {'southwest': [325, 24.3], 'southeast': [364, 27.2], 'northwest': [325, 24.3], 'northeast': [324, 24.2]}

Average cost depending on region are:
 {'southwest': 12346.9, 'northwest': 12417.6, 'northeast': 13406.4, 'southeast': 14735.4}


In [18]:
# Calculating delta between cheapest and more expensive region
delta_cheap_expensive = round( (list(avg_region_cost_dict.values())[-1] - list(avg_region_cost_dict.values())[0])
                              /list(avg_region_cost_dict.values())[0] * 100, 2 )
print(delta_cheap_expensive,'%')

19.34 %


 5.) Dividing the data based on age 18-29, 30-39, 40-49, ..., which is the larger group? Is each category well represented?

In [19]:
# Determining max and min age in the database
max_age= 50 # initialize max at 50
min_age = 25 # # initialize min at 20
for id in Insurance_dictionary_by_ID:
    if Insurance_dictionary_by_ID[id]['age'] > max_age:
        max_age = int(Insurance_dictionary_by_ID[id]['age'])
        id_max = id
    if Insurance_dictionary_by_ID[id]['age'] < min_age:
        min_age = int(Insurance_dictionary_by_ID[id]['age'])
        id_min = id
print('The first oldest client', str(id_max),'is', max_age, 'years old')
print('The first youngest client', str(id_min),'is', min_age, 'years old\n')

# Creating dictionary with key values that covers range of 10 years: 18-29, 30-39,..., and that sort all ids by them
lower_age_list = []
id_by_age_group_dict = {}
for i in range(round(min_age, -1), round(max_age,-1) + 10 , 10):
    lower_age_list.append(i)
    id_by_age_group_dict["age_range_{i}".format(i=str(i))] = []
print('The dictionary id of people grouped by age range looks like this:\n', id_by_age_group_dict,'\n')

# Filling the dictionary with all the id:
for id in Insurance_dictionary_by_ID:
    # seprate if clause because first range starts also below 20 (example 18)
    if Insurance_dictionary_by_ID[id]['age'] < 30:
        id_by_age_group_dict['age_range_20'] += [id]
    else:
        for lower_age in lower_age_list[1:]:
            if lower_age <= Insurance_dictionary_by_ID[id]['age'] < lower_age +10:
                id_by_age_group_dict['age_range_{age}'.format(age=str(lower_age))] += [id]
print('After filling it, as an example the people older than 60 and under 70 are:\n',
      id_by_age_group_dict['age_range_60'], '\n')

# Calculating percentage of each group among the total and their respective average insurance cost
for age_range in id_by_age_group_dict.keys():
    avg_cost_by_age = average(id_by_age_group_dict[age_range], 'charges')
    print('Clients in the',str(age_range),'are '+
          str(round(len(id_by_age_group_dict[age_range])/
                    len(Insurance_dictionary_by_ID.values()) * 100,2))+'% of the total.')
    print('On average their insurance costs are: '+str(avg_cost_by_age)+'$.\n')

The first oldest client ID_63 is 64 years old
The first youngest client ID_2 is 18 years old

The dictionary id of people grouped by age range looks like this:
 {'age_range_20': [], 'age_range_30': [], 'age_range_40': [], 'age_range_50': [], 'age_range_60': []} 

After filling it, as an example the people older than 60 and under 70 are:
 ['ID_10', 'ID_12', 'ID_21', 'ID_27', 'ID_34', 'ID_37', 'ID_40', 'ID_49', 'ID_63', 'ID_67', 'ID_95', 'ID_104', 'ID_110', 'ID_116', 'ID_132', 'ID_171', 'ID_176', 'ID_191', 'ID_200', 'ID_203', 'ID_209', 'ID_245', 'ID_247', 'ID_252', 'ID_288', 'ID_329', 'ID_331', 'ID_333', 'ID_336', 'ID_337', 'ID_338', 'ID_342', 'ID_343', 'ID_344', 'ID_371', 'ID_379', 'ID_380', 'ID_399', 'ID_403', 'ID_419', 'ID_420', 'ID_421', 'ID_422', 'ID_434', 'ID_436', 'ID_447', 'ID_463', 'ID_467', 'ID_476', 'ID_481', 'ID_492', 'ID_494', 'ID_500', 'ID_532', 'ID_535', 'ID_543', 'ID_551', 'ID_553', 'ID_574', 'ID_589', 'ID_604', 'ID_636', 'ID_643', 'ID_665', 'ID_669', 'ID_678', 'ID_716', 

In [20]:
global_cost_average = average(Insurance_dictionary_by_ID, 'charges')
print(global_cost_average)

13270.4


## Summary
The US insurance has overall 1338 clients: 676 men and 662 women, meaning that they are equally represented in the database.<br>
However, on average it seems that men pay 11% more than women for their insurance cost.

More than half 57% is a parent, mothers and fathers are in the same proportion: 51% vs 49% respectively. On average they have 2 kids.<br>
There is not much difference in age: parents have on average nearly 40 years, while non parents are slightly younger and are 38.<br>
Analyzing the fact of having children, it can be observed that clients who are parents pay on average 12.8% more that clients without them.

Checking the smokers, the majority of the clients, nearly 80%, do not smoke. Looking at the average cost for smokers, their insurance charges are 280 % more expensive that those who do not smoke. This means that smoking appears to be a primary key factor in the final cost, but it is important to notice that the sample of smoking users is smaller and can lead misleading conclusions.

Looking at the regions, the list contains an heterogeneous group of people that lives in all the four location of US. More specifically:<br>
    - southwest: 24.3%<br>
    - southeast: 27.2%<br>
    - northwest: 24.3%<br>
    - northeast: 24.2%<br>
The average cost between the regions is more or less aligned with the greatest gap measured between southwest and southeast, that on average appears to be nearly 20% more expensive.<br>

When it comes to age group, all clients were divided in 10 years ranges, such as 18-29, 30-39, 40-49,..., to identify which one is most represented and how does their respective average insurance cost vary.
As a result:<br>

    Clients in the age_range_20 are 31.17% of the total.
        On average their insurance costs are: 9182.5$.
        
    Clients in the age_range_30 are 19.21% of the total.
        On average their insurance costs are: 11738.8$.
    
    Clients in the age_range_40 are 20.85% of the total.
        On average their insurance costs are: 14399.2$.
        
    Clients in the age_range_50 are 20.25% of the total.
        On average their insurance costs are: 16495.2$.
    
    Clients in the age_range_60 are 8.52% of the total.
    On average their insurance costs are: 21248.0$.

Therefore, most of the clients are aged between 18 and 30 years old, but they also pay less charges, this can be partially explain with the fact that generally they also don't have children yet.
Clients who pay more are those aged above 60 years old, but they are less than 9% of the total. Excluding them, the older clients between 50 and 60 are still those who pay more, but their charges are more aligned to the global average of 13270.4$
