# U.S. Medical Insurance Costs

Looking at a database of insurance costs by customer profile, I would like to see if it's possible to identify:

1. What are the highest insurance cost averages for each of the categorical variables (sex, smoker and region).

Expected outputs
- "The average insurance costs for males is x, whereas the average insurance cost for females is y."
- "The average insurance costs for smokers is x, whereas the average insurance cost for non-smokers is y."
- "The average insurance costs based on region is: northwest x, northeast y, southwest w, southeast z"



2. For numerical variables, such as age, BMI and number of kids, I'd like to group them and compare insurance costs.
    - Age: Count the amount of people for each individual age.
    - BMI: Use standard classification of 'low', 'normal', 'overweight' and 'obese', depending on BMI values.
    - Number of kids: Create groups for each different amount of children.



3. From the analysis above, detect which customer groups have the highest insurance costs and merge them into one final group. For an insurance company, the goal could be to target this group with messaging to motive them to adopt better lifestyle habits.

In [1]:
import csv
with open('insurance.csv', 'r') as insurance_file:
    
    insurance_dictionary = csv.DictReader(insurance_file)

        
    
    # Function to calculate the average insurance cost of a given customer group.     
    
    def insurance_cost_average (insurance_dictionary):
        
        total_cost = 0
        total_customers = 0
        average_cost = 0
    
        for customer in insurance_dictionary:   
            total_cost += round(float(customer["charges"]))
            total_customer += 1
        
        average_cost = total_cost/total_customers
        
        return average_cost
    
    
    
    # Function to calculate the average insurance cost based on gender. 

    def ins_cost_avr_sex (insurance_dictionary):
        
        cost_male = 0
        total_male = 0
        avr_male = 0
        
        cost_female = 0
        total_female = 0
        avr_female = 0

        for customer in insurance_dictionary:
            if customer["sex"] == "male":
                cost_male += round(float(customer["charges"]))
                total_male += 1
            
            else:
                cost_female += round(float(customer["charges"]))
                total_female +=1
        
        avr_male = round(cost_male/total_male)
        avr_female = round(cost_female/total_female)
        
        return avr_male, total_male, avr_female, total_female

        
    avr_male, total_male, avr_female, total_female = ins_cost_avr_sex (insurance_dictionary)
    
    print ("The average insurance cost for males is",avr_male)
    print ("The total number of males in the dataset is",total_male)
    print()
    print ("The average insurance cost for females is",avr_female)
    print ("The total number of females in the dataset is",total_female)
    print()
    print ("The average insurance cost for males is around",round(avr_male/avr_female,2),"times the average insurance cost for females.")
    print ("The collected sample of male customers is around",round(total_male/total_female,2),"times the collected sample of female customers.")

The average insurance cost for males is 13957
The total number of males in the dataset is 676

The average insurance cost for females is 12570
The total number of females in the dataset is 662

The average insurance cost for males is around 1.11 times the average insurance cost for females.
The collected sample of male customers is around 1.02 times the collected sample of female customers.


In [2]:
import csv
with open('insurance.csv', 'r') as insurance_file:
    
    insurance_dictionary = csv.DictReader(insurance_file)

    
    # Function to calculate the average insurance cost based on smoking habits.     

    def ins_cost_avr_smoker (insurance_dictionary):

            cost_smoker = 0
            total_smoker = 0
            avr_smoker = 0
            
            cost_no_smoker = 0
            total_no_smoker = 0
            avr_no_smoker = 0

            for customer in insurance_dictionary:
                if customer["smoker"] == "yes":
                    cost_smoker += round(float(customer["charges"]))
                    total_smoker += 1

                else:
                    cost_no_smoker += round(float(customer["charges"]))
                    total_no_smoker +=1

            avr_smoker = round(cost_smoker/total_smoker)
            avr_no_smoker = round(cost_no_smoker/total_no_smoker)
            
            return avr_smoker, total_smoker, avr_no_smoker, total_no_smoker


    avr_smoker, total_smoker, avr_no_smoker, total_no_smoker = ins_cost_avr_smoker (insurance_dictionary)

    print ("The average insurance cost for smokers is",avr_smoker)
    print ("The total number of smokers in the dataset is",total_smoker)
    print()
    print ("The average insurance cost for non-smokers is",avr_no_smoker)
    print ("The total number of non-smokers in the dataset is",total_no_smoker)
    print()
    print ("The average insurance cost for smokers is around",round(avr_smoker/avr_no_smoker,2),"times the average insurance cost for non-smokers.")
    print ("The collected sample of smokers is around",round(total_smoker/total_no_smoker,2),"times the collected sample of non-smokers.")

The average insurance cost for smokers is 32050
The total number of smokers in the dataset is 274

The average insurance cost for non-smokers is 8434
The total number of non-smokers in the dataset is 1064

The average insurance cost for smokers is around 3.8 times the average insurance cost for non-smokers.
The collected sample of smokers is around 0.26 times the collected sample of non-smokers.


With the data above, we can say that smokers have much higher average insurance costs compared to non-smokers, as the average insurance cost for smokers is about 3.8 times that of non-smokers. Even though the sample size for smokers is smaller (274 vs. 1064 non-smokers), the significant cost difference likely reflects the higher health risks associated with smoking, which leads to higher insurance premiums. Despite the smaller sample size, this pattern suggests a strong correlation between smoking status and increased insurance costs.

In [3]:
import csv
with open('insurance.csv', 'r') as insurance_file:
    
    insurance_dictionary = csv.DictReader(insurance_file)

    
    # Function to calculate the average insurance cost based on region. 

    def ins_cost_avr_region (insurance_dictionary):

            cost_nw = 0
            total_nw = 0
            avr_nw = 0
            
            cost_ne = 0
            total_ne = 0
            avr_ne = 0

            cost_sw = 0
            total_sw = 0
            avr_sw = 0
            
            cost_se = 0
            total_se = 0
            avr_se = 0     
                
            for customer in insurance_dictionary:
                if customer["region"] == "northwest":
                    cost_nw += round(float(customer["charges"]))
                    total_nw += 1

                elif customer["region"] == "northeast":
                    cost_ne += round(float(customer["charges"]))
                    total_ne +=1
            
                elif customer["region"] == "southwest":
                    cost_sw += round(float(customer["charges"]))
                    total_sw +=1               
                
                else:
                    cost_se += round(float(customer["charges"]))
                    total_se +=1                   
                            
            avr_nw = round(cost_nw/total_nw)
            avr_ne = round(cost_ne/total_ne)
            avr_sw = round(cost_sw/total_sw)
            avr_se = round(cost_se/total_se)                
            
            return avr_nw, total_nw, avr_ne, total_ne, avr_sw, total_sw, avr_se, total_se 


    avr_nw, total_nw, avr_ne, total_ne, avr_sw, total_sw, avr_se, total_se = ins_cost_avr_region (insurance_dictionary)

    print ("The average insurance cost for customers living in the northwest region is",avr_nw)
    print ("The total number of customers living in the northwest region is",total_nw)
    print()
    print ("The average insurance cost for customers living in the northeast region is",avr_ne)
    print ("The total number of customers living in the northeast region is",total_ne)
    print()
    print ("The average insurance cost for customers living in the southwest region is",avr_sw)
    print ("The total number of customers living in the southwest region is",total_sw)
    print()
    print ("The average insurance cost for customers living in the southeast region is",avr_se)
    print ("The total number of customers living in the southeast region is",total_se)
    print()

The average insurance cost for customers living in the northwest region is 12418
The total number of customers living in the northwest region is 325

The average insurance cost for customers living in the northeast region is 13406
The total number of customers living in the northeast region is 324

The average insurance cost for customers living in the southwest region is 12347
The total number of customers living in the southwest region is 325

The average insurance cost for customers living in the southeast region is 14735
The total number of customers living in the southeast region is 364



The only conclusion to be taken from the data above about regions is that they don't seem to influence insurance costs. Sample sizes and insurance costs are on a similar range.

In [4]:
import csv
with open('insurance.csv', 'r') as insurance_file:
    
    insurance_dictionary = csv.DictReader(insurance_file)

    
    # Function to calculate the average insurance cost based on age groups. 

    def count_ages (insurance_dictionary):
    
        all_ages = {}
    
        for customer in insurance_dictionary:
            
            age = customer['age']
            
            if age not in all_ages:
                all_ages[age] = {'total customers': 1, 'total insurance costs': round(float(customer['charges']))}
                
            else:
                all_ages[age]['total customers'] += 1
                all_ages[age]['total insurance costs'] += (round(float(customer['charges'])))
               
        
        for age in all_ages:
            all_ages[age]['average insurance cost'] = round(float(all_ages[age]['total insurance costs']/all_ages[age]['total customers']))
        
          
        return all_ages

    
    counted_ages = count_ages (insurance_dictionary)
    
    
    
    # Print customers amounts by sorted age, from youngest to oldest.
    
    for age, data in sorted(counted_ages.items()):
        print ("Age:" ,age, " | Total customers:" ,data['total customers'], " | Average insurance cost:" ,data['average insurance cost'])


Age: 18  | Total customers: 69  | Average insurance cost: 7086
Age: 19  | Total customers: 68  | Average insurance cost: 9748
Age: 20  | Total customers: 29  | Average insurance cost: 10160
Age: 21  | Total customers: 28  | Average insurance cost: 4730
Age: 22  | Total customers: 28  | Average insurance cost: 10013
Age: 23  | Total customers: 28  | Average insurance cost: 12420
Age: 24  | Total customers: 28  | Average insurance cost: 10648
Age: 25  | Total customers: 28  | Average insurance cost: 9838
Age: 26  | Total customers: 28  | Average insurance cost: 6134
Age: 27  | Total customers: 28  | Average insurance cost: 12185
Age: 28  | Total customers: 28  | Average insurance cost: 9069
Age: 29  | Total customers: 27  | Average insurance cost: 10430
Age: 30  | Total customers: 27  | Average insurance cost: 12719
Age: 31  | Total customers: 27  | Average insurance cost: 10197
Age: 32  | Total customers: 26  | Average insurance cost: 9220
Age: 33  | Total customers: 26  | Average insur

In [5]:
import csv
with open('insurance.csv', 'r') as insurance_file:
    
    insurance_dictionary = csv.DictReader(insurance_file)

    
    # Function to calculate the average insurance cost based on amount of children. 

    def count_children (insurance_dictionary):
    
        all_children = {}
    
        for customer in insurance_dictionary:
            
            amount_children = customer['children']
            
            if amount_children not in all_children:
                all_children[amount_children] = {'total customers': 1, 'total insurance costs': round(float(customer['charges']))}
                
            else:
                all_children[amount_children]['total customers'] += 1
                all_children[amount_children]['total insurance costs'] += (round(float(customer['charges'])))
               
        
        for amount_children in all_children:
            all_children[amount_children]['average insurance cost'] = round(float(all_children[amount_children]['total insurance costs']/all_children[amount_children]['total customers']))
        
          
        return all_children

    
    counted_children = count_children (insurance_dictionary)
    

    
    # Print customers amounts by sorted age, from youngest to oldest.
    
    for amount_children, data in sorted(counted_children.items()):
        print ("Amount of children:" ,amount_children, " | Total customers:" ,data['total customers'], " | Average insurance cost:" ,data['average insurance cost'])


Amount of children: 0  | Total customers: 574  | Average insurance cost: 12366
Amount of children: 1  | Total customers: 324  | Average insurance cost: 12731
Amount of children: 2  | Total customers: 240  | Average insurance cost: 15074
Amount of children: 3  | Total customers: 157  | Average insurance cost: 15355
Amount of children: 4  | Total customers: 25  | Average insurance cost: 13851
Amount of children: 5  | Total customers: 18  | Average insurance cost: 8786


In [6]:
import csv
with open('insurance.csv', 'r') as insurance_file:
    
    insurance_dictionary = csv.DictReader(insurance_file)

    
    # Function to group customers by their BMI.

    def bmi_range (insurance_dictionary):
    
        all_bmi = {}
    
        for customer in insurance_dictionary:
            
            bmi = float(customer['bmi'])
            
            if bmi < 18.5:
                if 'low' not in all_bmi:
                    all_bmi['low'] = {'total customers': 1, 'total insurance costs': round(float(customer['charges']))}
                else:
                    all_bmi['low']['total customers'] += 1
                    all_bmi['low']['total insurance costs'] += (round(float(customer['charges'])))
               
            
            elif bmi >= 18.5 and bmi <= 24.9:
                if 'normal' not in all_bmi:
                    all_bmi['normal'] = {'total customers': 1, 'total insurance costs': round(float(customer['charges']))}
                else:
                    all_bmi['normal']['total customers'] += 1
                    all_bmi['normal']['total insurance costs'] += (round(float(customer['charges'])))
        
        
            elif bmi >= 25 and bmi <= 29.9:
                if 'overweight' not in all_bmi:
                    all_bmi['overweight'] = {'total customers': 1, 'total insurance costs': round(float(customer['charges']))}
                else:
                    all_bmi['overweight']['total customers'] += 1
                    all_bmi['overweight']['total insurance costs'] += (round(float(customer['charges'])))
        
        
            elif bmi >= 30:
                if 'obese' not in all_bmi:
                    all_bmi['obese'] = {'total customers': 1, 'total insurance costs': round(float(customer['charges']))}
                else:
                    all_bmi['obese']['total customers'] += 1
                    all_bmi['obese']['total insurance costs'] += (round(float(customer['charges'])))
        
        
        
        for bmi in all_bmi:
            all_bmi['low']['average insurance cost'] = round(float(all_bmi['low']['total insurance costs']/all_bmi['low']['total customers']))
            all_bmi['normal']['average insurance cost'] = round(float(all_bmi['normal']['total insurance costs']/all_bmi['normal']['total customers']))
            all_bmi['overweight']['average insurance cost'] = round(float(all_bmi['overweight']['total insurance costs']/all_bmi['overweight']['total customers']))
            all_bmi['obese']['average insurance cost'] = round(float(all_bmi['obese']['total insurance costs']/all_bmi['obese']['total customers']))

          
        return all_bmi

    
    counted_bmi = bmi_range (insurance_dictionary)
    
    highest_cost_multiple = round(counted_bmi['obese']['average insurance cost']/counted_bmi['overweight']['average insurance cost'],3)
    
    
    
    # Print customers amounts by sorted age, from youngest to oldest.
    #for bmi, data in counted_bmi.items():
    
    print ("BMI range: Obese      | Total customers:" ,counted_bmi['obese']['total customers'], " | Average insurance cost:" ,counted_bmi['obese']['average insurance cost'])
    print ("BMI range: Overweight | Total customers:" ,counted_bmi['overweight']['total customers'], " | Average insurance cost:" ,counted_bmi['overweight']['average insurance cost'])
    print ("BMI range: Normal     | Total customers:" ,counted_bmi['normal']['total customers'], " | Average insurance cost:" ,counted_bmi['normal']['average insurance cost'])
    print ("BMI range: Low        | Total customers:" ,counted_bmi['low']['total customers'], "  | Average insurance cost:" ,counted_bmi['low']['average insurance cost'])
    print ()
    print ("On average, the insurance costs of obese customers is",highest_cost_multiple,"times the insurance costs of overweight customers.")

BMI range: Obese      | Total customers: 707  | Average insurance cost: 15552
BMI range: Overweight | Total customers: 377  | Average insurance cost: 10994
BMI range: Normal     | Total customers: 222  | Average insurance cost: 10379
BMI range: Low        | Total customers: 20   | Average insurance cost: 8852

On average, the insurance costs of obese customers is 1.415 times the insurance costs of overweight customers.


With the data above, it's reasonable to conclude that customers in the “obese” category have significantly higher average insurance costs compared to other BMI groups. This may be due to the increased health risks associated with obesity, which typically result in higher medical expenses. The larger sample size of 705 customers in the “obese” category further strengthens this observation, as it indicates that the higher average cost isn’t due to a small, potentially unrepresentative group but reflects a broader trend in the dataset.

Given that smokers and obese people are the ones with the highest insurance costs, I would like to create a merged dictionary that includes people in both categories. These people, who have much higher health risks than others, should be targeted with motivational messages to make them stop smoking and/or lose weight. This could be a marketing campaign ran by the Insurance company itself to help promote healthier lifestyles (whilst also reducing insurance costs).

In [7]:
import csv
with open('insurance.csv', 'r') as insurance_file:
    
    insurance_dictionary = csv.DictReader(insurance_file)

    
    # Function to create a dictionary of customers with the highest insurance costs (obese and smokers).

    def check_obese_smoker (insurance_dictionary):
        
        all_obese_smokers = {}
        customer_id = 1
        total_customers = 0
        total_cost = 0
        average_cost = 0

        
        for customer in insurance_dictionary:
            if customer['smoker'] == 'yes' and float(customer['bmi']) >= 30:
                all_obese_smokers[customer_id] = {'age':customer['age'], 'sex':customer['sex'], 'bmi':customer['bmi'], 'children':customer['children'], 'insurance cost':customer['charges']}  
                total_cost += round(float(customer['charges']))
                customer_id += 1
                total_customers +=1
        
        average_cost = total_cost/total_customers
        
        return all_obese_smokers, average_cost
    
    
    
    # Functions with multiple outuputs are recorded in a tuple. To access, call function and assign each output to a variable with its correspondent index.

    obese_smoker = check_obese_smoker (insurance_dictionary)
    all_obese_smokers = obese_smoker[0]
    average_insurance_cost = round(obese_smoker[1],3)
    
    print ("The dataset contains", len(all_obese_smokers), "customers who are smoker and obese. Their average insurance cost is",average_insurance_cost)
    

The dataset contains 145 customers who are smoker and obese. Their average insurance cost is 41557.986
