Step 1 - Look over your dataset

Since the data is in a csv, I want to use the csv library built into Python.

In [32]:
import csv
from decimal import Decimal

**Step 2 - Scoping Your Project **

This dataset has plenty of information stored. The categories are age, sex, bmi, children, smoker, region, and charges. I want to see what contributes to the highest cost of insurance. Does region matter? In the U.S. smoking was treated with prejudice not so long ago, so I'm expecting to see higher costs associated with those that smoke. Is BMI and the number of children related or what is the factor it plays in cost. We know that on average women live longer than men, and women are more likely to visit the doctor, so does this factor into the cost? Also with the average woman's ability to give birth and the toll pregnancy plays on the body, one can reasonably anticipate that costs for women will be higher based on this. 

While it would be great to get charts and plots of this information, that is beyond the scope of this current project. In the future, I'd also love to formulate a prediction of cost.

From Codecademy these are some things to look for within the data: 
- Find out the average age of the patients in the dataset.
- Analyze where a majority of the individuals are from.
- Look at the different costs between smokers vs. non-smokers.
- Figure out what the average age is for someone who has at least one child in this dataset.

My own additions:
- Of the dataset, how many children do the 2 sexes have on average
- Are men or women more likely to smoke from this dataset
- Which region has the highest BMIs?
- Which region has the lowest costs? 

Should I come across more questions, they will be appropriately added. 


****Step 2 (cont.) - Bias****

As far as age, we see the data has solely adults within working ages. There isn't data on senior or elderly folks, so there is no expectation for this data to represent an older age group who has insurance costs. 

So far, there is no clear bias based on sex. 

For BMI, I am contemplating putting the numbers into categories based on what is considered healthy or not. While there is contention around BMI and its usefulness to determine health, for this dataset BMI is used to determine insurance costs. 



**Step 3 - Import your dataset**



In [216]:
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    ages = []
    for row in insurance:
        print(row, "\n")
            
    
            
     

{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'} 

{'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'} 

{'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'} 

{'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'} 

{'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'} 

{'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'} 

{'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '8240.5896'} 

{'age': '37', 'sex': 'female', 'bmi': '27.74', 'children': '3', 'smoker': 'no', 'region': 'northwest', 'ch

{'age': '38', 'sex': 'female', 'bmi': '19.475', 'children': '2', 'smoker': 'no', 'region': 'northwest', 'charges': '6933.24225'} 

{'age': '61', 'sex': 'male', 'bmi': '36.1', 'children': '3', 'smoker': 'no', 'region': 'southwest', 'charges': '27941.28758'} 

{'age': '53', 'sex': 'female', 'bmi': '26.7', 'children': '2', 'smoker': 'no', 'region': 'southwest', 'charges': '11150.78'} 

{'age': '44', 'sex': 'female', 'bmi': '36.48', 'children': '0', 'smoker': 'no', 'region': 'northeast', 'charges': '12797.20962'} 

{'age': '19', 'sex': 'female', 'bmi': '28.88', 'children': '0', 'smoker': 'yes', 'region': 'northwest', 'charges': '17748.5062'} 

{'age': '41', 'sex': 'male', 'bmi': '34.2', 'children': '2', 'smoker': 'no', 'region': 'northwest', 'charges': '7261.741'} 

{'age': '51', 'sex': 'male', 'bmi': '33.33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '10560.4917'} 

{'age': '40', 'sex': 'male', 'bmi': '32.3', 'children': '2', 'smoker': 'no', 'region': 'northwest',

**Step 4 - Save your dataset via Python variables** 

- Ages list
- List of Ages for all those with children
- List of Ages for all those without children
- Dictionary of Regions to persons in regions
- Total of costs for smokers
- Total of costs for non-smokers
- Average of costs for smokers and for non-smokers
- List of children for women
- List of children for men
- Average of children for women
- Average of children for men
- Dictionary of smokers by sex example: {female : 5, male : 5}
- Dictionary of non-smokers
- List of BMIs per region
- Average of BMIs per region
- List of costs per region
- Average of Cost per region
- Max and Min of cost per region



*Question 1: What is the average age of patients in this dataset?*

In [38]:
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    ages = []
    for row in insurance:
        for profile, data in row.items():
            if profile == 'age':
                #print(profile, " : ", data)
                ages.append(row[profile])
    #print(ages)
    ages_int = []
    for age in ages:
        ages_int.append(int(age))
    #print(ages_int)
    sum_ages = sum(ages_int)
    ages_len = len(ages_int)
    average_age = sum_ages/ages_len
    print(average_age)

39.20702541106129


The average age for all persons included in the dataset is 39. 

*Question 2: What is the average age of those who have at least 1 child in the dataset?*


In [49]:
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    parent_ages = []
    no_kids_ages = []
    for row in insurance:
        for profile, data in row.items():
            if profile == 'children':
                if row[profile] != '0':
                    parent_ages.append(int(row['age']))
                elif row[profile] == '0':
                    no_kids_ages.append(int(row['age']))
    #print(parent_ages)
    #print(no_kids_ages)
    
    how_many_parents = len(parent_ages)
    print(f'There are {how_many_parents} parents in this dataset.')
    sum_parent_ages = sum(parent_ages)
    average_parent_age = sum_parent_ages/how_many_parents
    print(f'The average age of parents in this dataset is {average_parent_age}')
    
    non_parents = len(no_kids_ages)
    print(f'There are {non_parents} people without children in this dataset.')
    sum_non_parent = sum(no_kids_ages)
    average_no_kids = sum_non_parent/non_parents
    print(f'The average age of those without children in this dataset is {average_no_kids}.')
    

There are 764 parents in this dataset.
The average age of parents in this dataset is 39.78010471204188
There are 574 people without children in this dataset.
The average age of those without children in this dataset is 38.444250871080136.


Question 3: 
3a) What is the average cost difference between smokers and non-smokers? 
3b) What is the lowest cost and highest cost for each group? 
3c) Are smokers older? 


In [150]:
#separate smokers costs from non-smokers costs
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    smoker_costs = []
    nonsmoker_costs = []
    for row in insurance:
        for profile, data in row.items():
            if profile == 'smoker':
                if row[profile] == 'no':
                    nonsmoker_costs.append(float(row['charges']))
                elif row[profile] == 'yes':
                    smoker_costs.append(float(row['charges']))
    dollar_smoker_costs = [round(cost,2) for cost in smoker_costs]
    dollar_nonsmoker_costs = [round(cost,2) for cost in nonsmoker_costs]
    #print(dollar_smoker_costs)
    
    #averages of cost based on smoker status
    average_smoker_cost = round((sum(dollar_smoker_costs))/(len(dollar_smoker_costs)), 2)
    print(f"The average medical insurance cost for smokers of this dataset is ${average_smoker_cost}.")
    
    average_nonsmoker_cost = round((sum(dollar_nonsmoker_costs))/(len(dollar_nonsmoker_costs)), 2)
    print(f"The average medical insurance cost for those who do not smoke in this dataset is ${average_nonsmoker_cost}.")
    
    
    #difference in cost based on smoker status
    smoker_nonsmoker_difference = average_smoker_cost - average_nonsmoker_cost
    print(f" The difference in cost is ${smoker_nonsmoker_difference}.")
    
    

    
    

The average medical insurance cost for smokers of this dataset is $32050.23.
The average medical insurance cost for those who do not smoke in this dataset is $8434.27.
 The difference in cost is $23615.96.


In [82]:
    
#3b - Highest and Lowest Costs
#smokers
highest_smoker_cost = max(dollar_smoker_costs)
lowest_smoker_cost = min(dollar_smoker_costs)
print('The most expensive insurance cost for smokers is ${high} \
and the least expensive insurance cost is ${low}.'.format(high=highest_smoker_cost, \
                                                          low=lowest_smoker_cost))

#non-smokers
highest_nonsmoker_cost = max(dollar_nonsmoker_costs)
lowest_nonsmoker_cost = min(dollar_nonsmoker_costs)
print('The most expensive insurance cost for smokers is ${high} \
and the least expensive insurance cost is ${low}.'.format(high=highest_nonsmoker_cost, \
                                                          low=lowest_nonsmoker_cost))


The most expensive insurance cost for smokers is $63770.43 and the least expensive insurance cost is $12829.46.
The most expensive insurance cost for smokers is $36910.61 and the least expensive insurance cost is $1121.87.


In [253]:
#3c - Are smokers older?
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)

#function creation
    def smoker_non_smoker_ages(file):
        smoker_ages = []
        nonsmoker_ages = []
        for row in file:
            for profile, data in row.items():
                if profile == 'smoker':
                    if row[profile] == 'no':
                        nonsmoker_ages.append(int(row['age']))
                    elif row[profile] == 'yes':
                        smoker_ages.append(int(row['age']))
        
        avg_smoker_age = round(sum(smoker_ages)/len(smoker_ages), 1)
        avg_nonsmoker_age = round(sum(nonsmoker_ages)/len(nonsmoker_ages), 1)
        
        #print(max(smoker_ages), max(nonsmoker_ages))
        #print(min(smoker_ages), min(nonsmoker_ages))
        
        num_of_smokers = len(smoker_ages)
        num_of_nonsmokers = len(nonsmoker_ages)
        percent_smokers = round((num_of_smokers/(num_of_smokers + num_of_nonsmokers) * 100), 2)
        
        print(f'The number of smokers in this dataset is {num_of_smokers}.')
        print(f'The percent of smokers of this dataset is {percent_smokers}%.')
        print(f'The average smoker age is {avg_smoker_age}. \n\
The average non-smoker age is {avg_nonsmoker_age}.')
        return smoker_ages, nonsmoker_ages, num_of_smokers, num_of_nonsmokers
    
    #function call
    smoker_non_smoker_ages(insurance)

The number of smokers in this dataset is 274.
The percent of smokers of this dataset is 20.48%.
The average smoker age is 38.5. 
The average non-smoker age is 39.4.


*Question 4: Are men or women more likely to smoke from this dataset?*

In [254]:
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    
    def sex_of_smokers(file):
        smoker_gender = {
            "male": 0,
            "female": 0
        }
        
        for row in file:
            for profile, data in row.items():
                if profile == 'smoker':
                    if row[profile] == 'yes':
                        if row['sex'] == 'male':
                            smoker_gender['male'] += 1
                        elif row['sex'] == 'female':
                            smoker_gender['female'] +=1
        return smoker_gender
    
    smoker_sex = sex_of_smokers(insurance)
    print("There are {y} male smokers and {x} female smokers.".format(x=smoker_sex['female'], y=smoker_sex['male']))

There are 159 male smokers and 115 female smokers.


*Question 5 : Where are the majority of individuals from?*


In [160]:
#What are all the regions? 
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    
    def all_regions(file):
        every_place = []
        for row in file:
            for data in row.keys():
                if data == 'region' and row[data] not in every_place:
                    every_place.append(row[data])
        return every_place
    
    every_region = all_regions(insurance)
    print(f"There are all the regions of the dataset: \n {every_region}")
    
    
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    
#How many people per region
    def people_in_places(file, region_list):
        place_count = {}
        for location in region_list:
            place_count[location] = 0

            
        for row in file:
            for profile, data in row.items():
                for location in place_count:
                    if profile == 'region' and row[profile] == location:
                        place_count[location] +=1
 
        return place_count
    
    
    people_per_place = people_in_places(file=insurance, region_list=every_region)
    print(f"Number of people per area:\n {people_per_place}")
    
    
#What is that in percents?
    def region_percents(region_dictionary):
        add_up = 0
        percents_dictionary = {}
        for location, people in region_dictionary.items():
            add_up += people
        
        for location, people in region_dictionary.items():
            percents_dictionary[location] = round((people/add_up) * 100, 2)
        return percents_dictionary
    
    percent_per_region = region_percents(people_per_place)
    print(f"This is the percentage of people per region:\n {percent_per_region}")

There are all the regions of the dataset: 
 ['southwest', 'southeast', 'northwest', 'northeast']
Number of people per area:
 {'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
This is the percentage of people per region:
 {'southwest': 24.29, 'southeast': 27.2, 'northwest': 24.29, 'northeast': 24.22}


We see that there is a slight skew of the southeast region being more represented than the other 3 regions in this dataset. 


Question 6: Which region has the highest BMIs?

In [215]:
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    
    def regions_bmis(file, region_list):
        bmis_per_region = {}
        for location in region_list:
            bmis_per_region[location] = []
            
        
    
        for row in file:
            for profile, data in row.items():
                for location in region_list:
                    if row[profile] == location:
                        bmis_per_region[location].append(float(row['bmi']))
                        
                        
#Check length of each       
        #for location in bmis_per_region:
            #print(len(bmis_per_region[location]))
                      
        return bmis_per_region

    
#Average BMI per region
    def avg_region_bmi(bmi_dictionary):
        region_bmi_averages = {}
        
        for region in bmi_dictionary:
            region_bmi_averages[region] = {}
            region_bmi_averages[region].update({
            "Total BMI": round(sum(bmi_dictionary[region]), 3),
            "Number of People": len(bmi_dictionary[region]),
            "Average BMI": (round(sum(bmi_dictionary[region]) / (len(bmi_dictionary[region])) , 3))
            

            })
        #print(region_bmi_averages)

        round(sum(bmi_dictionary[region])/len(bmi_dictionary), 3)

        return region_bmi_averages
    
    
    
    bmi_regions = regions_bmis(insurance, every_region)
    #print(bmi_regions)
    region_bmi_averages = avg_region_bmi(bmi_regions)
    print(f"These are the averages {region_bmi_averages}")
    

These are the average{'southwest': {'Total BMI': 9943.9, 'Number of People': 325, 'Average BMI': 30.597}, 'southeast': {'Total BMI': 12141.58, 'Number of People': 364, 'Average BMI': 33.356}, 'northwest': {'Total BMI': 9489.93, 'Number of People': 325, 'Average BMI': 29.2}, 'northeast': {'Total BMI': 9452.215, 'Number of People': 324, 'Average BMI': 29.174}}


The southeast region has the highest BMI with 33.356 . The northwest region has the lowest BMI with 29.2 .

*Question 7: Which region's costs are the lowest? Highest?*


In [236]:
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    
    def regions_costs(file, region_list):
        cost_in_region = {}
        for location in region_list:
            cost_in_region[location] = []
            
        
    
        for row in file:
            for profile, data in row.items():
                for location in region_list:
                    if row[profile] == location:
                        cost_in_region[location].append(float(row['charges']))
                        
        return cost_in_region
    
    
    def region_cost_directory(region_dictionary):
        region_cost = {}
        
        for location in region_dictionary:
            region_cost[location] = {}
            #print(region_dictionary[location])
            region_cost[location].update({
                'Highest Cost': max(region_dictionary[location]),
                'Lowest Cost': min(region_dictionary[location]),
                'Sum of Costs': sum(region_dictionary[location]),
                'Average Cost': round((sum(region_dictionary[location]))/(len(region_dictionary[location])), 2)
            })
        
        return region_cost
    
    
    
    
    
    
    cost_regions = regions_costs(insurance, every_region)
    #print(cost_regions)
    
    costs_by_region = region_cost_directory(cost_regions)
    print(costs_by_region)

{'southwest': {'Highest Cost': 52590.82939, 'Lowest Cost': 1241.565, 'Sum of Costs': 4012754.647620001, 'Average Cost': 12346.94}, 'southeast': {'Highest Cost': 63770.42801, 'Lowest Cost': 1121.8739, 'Sum of Costs': 5363689.763290002, 'Average Cost': 14735.41}, 'northwest': {'Highest Cost': 60021.39897, 'Lowest Cost': 1621.3402, 'Sum of Costs': 4035711.9965399993, 'Average Cost': 12417.58}, 'northeast': {'Highest Cost': 58571.07448, 'Lowest Cost': 1694.7964, 'Sum of Costs': 4343668.583308999, 'Average Cost': 13406.38}}


The region with the highest cost of all regions is the southeast. The southeast also has the highest sum of costs, but they have the most amount of people represented in this dataset. 
The highest average of costs belongs to the southeast with $14,735.41 . 
The lowest average of costs belongs to the southwest with $12,346.94 .

Question 8: How many children do the sexes have on average?


In [257]:
with open('insurance.csv') as ins_data:
    insurance = csv.DictReader(ins_data)
    
    def parents_by_sex(file):
        male_parents_kids = []
        female_parents_kids = []
        parent_gen_diction = {}
        
        for row in file:
            for profile, data in row.items():
                if profile == 'children':
                    if row[profile] != '0':
                        if row['sex'] == 'female':
                            female_parents_kids.append(int(row[profile]))
                        elif row['sex'] == 'male':
                            male_parents_kids.append(int(row[profile]))
        #print(female_parents_kids)
        #print(male_parents_kids)
        
        parent_gen_diction['female'] = female_parents_kids
        parent_gen_diction['male'] = male_parents_kids
                            
        return parent_gen_diction

    sex_of_parents = parents_by_sex(insurance)
    #print(sex_of_parents)
 

    #how many parents by gender? 
    total_fparents = len(sex_of_parents['female'])
    total_mparents = len(sex_of_parents['male'])
    print(total_fparents, total_mparents)
    
    #averages
    f_avg_kids = round(((sum(sex_of_parents['female']))/total_fparents), 2)
    m_avg_kids = round(((sum(sex_of_parents['male']))/total_mparents), 2)
    
    print(f"Within this dataset, males have {m_avg_kids} on average and females have {f_avg_kids}.")

373 391
Within this dataset, males have 1.93 on average and females have 1.91.


In this dataset there are more male parents than female parents, and they are slightly more inclined to have more children than female parents. 