# U.S. Medical Insurance Costs

# Goal
The aim of this project is to analyze different situations that contributes to the total insurance cost a patient pays. The analysis below could help patients make choices that could in turn reduce their insurance costs.

Analyze:
* ages and insurance costs across ages
* the percentage of males vs. females
* the average yearly insurance for an individual
* insurance costs among smokers and non-smokers
* how number of children/smoker status affects the charges
* insurance costs in different regions
* average insurance based on bmi/smoker status

and finally,
* Create a dictionary that contains all patient information


For this project, the csv and Counter libraries will be imported.

In [1]:
# import libraries
import csv
from collections import Counter

The next step is to look through **insurance.csv** in order to get aquanted with the data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:
* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost


The helper function below will make loading data into the lists as efficient as possible. Therefore with this function, we can simply call `csv_into_lst()` each time as shown below.

In [2]:
# Helper function to load data into lists
def csv_into_lst(csv_file, column_name):
    lst = []
    with open(csv_file) as csv_info:
        csv_dict = csv.DictReader(csv_info)
        for col in csv_dict:
            lst.append(col[column_name])
        return lst

There are no signs of missing data. To store this information, seven variables **(ages, sexes, bmis, num_children, smoker_status, regions, insurance_charges)** will be defined. They contain lists to hold each individual column of data from **insurance.csv**.


In [3]:
# call the function and save in respective variables
ages = csv_into_lst('insurance.csv', 'age')
sexes = csv_into_lst('insurance.csv', 'sex')
bmis = csv_into_lst('insurance.csv', 'bmi')
num_children = csv_into_lst('insurance.csv', 'children')
smoker_status = csv_into_lst('insurance.csv', 'smoker')
regions = csv_into_lst('insurance.csv', 'region')
insurance_charges = csv_into_lst('insurance.csv', 'charges')

To perform these inspections, a class called `InsuranceData` has been built out which contains several methods.
The class has been built out below. 

In [4]:
# create a class

class InsuranceData():
    def __init__(self, patients_ages, patients_sexes, patients_bmis, patients_num_children, 
                 patients_smoker_statuses, patients_regions, patients_charges):
        self.patients_ages = patients_ages
        self.patients_sexes = patients_sexes
        self.patients_bmis = patients_bmis
        self.patients_num_children = patients_num_children
        self.patients_smoker_statuses = patients_smoker_statuses
        self.patients_regions = patients_regions
        self.patients_charges = patients_charges
    
    
    # find the average age of insurance payers
    def analyze_ages(self):
        ages = [int(i) for i in self.patients_ages]
        avg = sum(ages) / len(self.patients_ages)
        return ('The average age of a patient is ' + str(round(avg, 2)) + ' years')
    
    
    # find which age dominates insurance payers
    def dominant_age(self):
        # count the number of occurences for each age 
        common_age = Counter(self.patients_ages)
        # select the first item(.most_common sorts the list)
        common_age = common_age.most_common(1)
        majority_age = (common_age[0][0])
        return ('The dominant age of taxpayers is ' + str(majority_age) + ' year olds')
    
    
    # find the total insurance for each age
    def total_insurance_age_x(self):
        ages_charges = list(zip(self.patients_ages, self.patients_charges))
        total_insurance_charges = sum([float(i) for i in self.patients_charges])
        # range(18, 65) represents the different ages in the data
        for i in range(18, 65):
            # str(i) because it is '18' for example. It has not been converted to integer
            total_sum_each_age = sum([float(item[1]) for item in ages_charges if item[0] == str(i)])
            # print "percentages" if you want to see the percentages for each age
            percentages = round((total_sum_each_age / total_insurance_charges) * 100, 2)
            # print("The total insurance for " + str(i) + " year olds is {}" .format(round(total_sum_each_age, 2)))

    
    # What percentage of male and female are represented in the dataset
    def male_female_percentage(self):
        num_male = self.patients_sexes.count('male')
        num_female = self.patients_sexes.count('female')
        total_gender = num_male + num_female
        # calculate percentage for male/female rounded to the tenths place
        male_percentage = round(((num_male / total_gender) * 100), 1)
        female_percentage = round(((num_female / total_gender) * 100), 1)
        return ('The percentage of male in the dataset is ' + str(male_percentage) + 
                '% and The percentage of female in the dataset is ' + str(female_percentage) + '%')
    
    # find the average yearly medical charges for patients 
    def average_charges(self):
        # initialize total_charges variable
        total_charges = 0
        # iterate through charges in patients charges list
        # add each charge to total_charges
        for charge in self.patients_charges:
            total_charges += float(charge)
        # return the average charges rounded to the hundredths place
        return ("Average Yearly Medical Insurance Charges: " +  
                str(round(total_charges/len(self.patients_charges), 2)) + " dollars.")
    
    
    # Calculate the average insurance for male or female
    def gender_average_insurance(self):
        sexes_charges = list(zip(self.patients_sexes, self.patients_charges))
        # filter for charges: where sex == 'male' or 'female'
        sexes = ['male', 'female']
        for i in sexes:
            # str(i) because it is 'male' or 'female'
            gender = [item[1] for item in sexes_charges if item[0] == str(i)]
            gender_numbers = [float(item) for item in gender]
            avg_gender_ins = sum(gender_numbers) / len(gender_numbers)
            print('The average insurance each ' + i + ' pays is ' + str(round(avg_gender_ins,2)) + ' dollars')
            
    
    # Calculate the average insurance for male or female smokers
    def average_insurance_smokers(self):
        sexes_charges = list(zip(self.patients_sexes, self.patients_smoker_statuses, self.patients_charges))
        sexes = ['male', 'female']
        for i in sexes:
            # filter for charges: where sex == 'male' or 'female' and smokers == 'yes'
            smokers_charges = [float(item[2]) for item in sexes_charges if item[0] == str(i) and item[1] == 'yes']
            avg_smokers = round(sum(smokers_charges) / len(smokers_charges), 2)
            print('The average insurance charges for ' + i + ' smokers is ' + str(avg_smokers) + ' dollars')
   

    # Calculate the average insurance for male or female non-smokers
    def average_insurance_non_smokers(self):
        sexes_charges = list(zip(self.patients_sexes, self.patients_smoker_statuses, self.patients_charges))
        sexes = ['male', 'female']
        for i in sexes:
            # filter for charges: where sex == 'male' or 'female' and smokers == 'no'
            non_smokers_charges = [float(item[2]) for item in sexes_charges if item[0] == i and item[1] == 'no']
            avg_non_smokers = round(sum(non_smokers_charges) / len(non_smokers_charges), 2)
            print('The average insurance charges for ' + i + ' non-smokers is ' + str(avg_non_smokers) + ' dollars')
    
    
    def unique_number_children(self):
        # initialize empty list
        unique_children = []
        # iterate through each num in num_children list
        for num in self.patients_num_children:
            # if num is not already in the unique children list
            # then add it to the unique children list
            if num not in unique_children: 
                unique_children.append(num)
        # return unique regions list
        return unique_children
    
    
     # find the average insurance based on the number of kids
    def average_charges_per_kids(self):
        charges_children = list(zip(self.patients_charges, self.patients_num_children))
        # range(0, 6) represents the unique number of children in the data
        for i in range(0, 6):
            # str(i) because it is '1' for example. It has not been converted to integer
            # filter for charges by the number of children
            charges_per_child = [float(item[0]) for item in charges_children if item[1] == str(i)]
            # average charges rounded to the hundreths
            avg_charges_per_child = round(sum(charges_per_child) / len(charges_per_child), 2)
            print('The average insurance based on number of kids - ' + str(i) + ': '+ str(avg_charges_per_child) 
                  + ' dollars')
               
            
    def average_charges_per_kids_smokers(self):
        charges_children_smokers = list(zip(self.patients_charges, self.patients_num_children, 
                                            self.patients_smoker_statuses))
        # range(0, 6) represents the unique number of children in the data
        for i in range(0, 6):
            # str(i) because it is '1' for example. It has not been converted to integer
            # filter for charges by the number of children and 'yes' for smokers
            charges_per_child = [float(item[0]) for item in charges_children_smokers if item[1] == str(i)
                                 and item[2] == 'yes']
            # average charges rounded to the hundreths
            avg_charges_per_child = round(sum(charges_per_child) / len(charges_per_child), 2)
            print('For smokers: The average insurance based on number of kids - ' + str(i) + ': '
                  + str(avg_charges_per_child) + ' dollars')
            
            
    def average_charges_per_kids_non_smokers(self):
        charges_children_smokers = list(zip(self.patients_charges, self.patients_num_children, self.patients_smoker_statuses))
        # range(0, 6) represents the unique number of children in the data
        for i in range(0, 6):
            # str(i) because it is '1' for example. It has not been converted to integer
            # filter for charges by the number of children and 'no' for non-smokers
            charges_per_child = [float(item[0]) for item in charges_children_smokers if item[1] == str(i)
                                 and item[2] == 'no']
            # average charges rounded to the hundreths
            avg_charges_per_child = round(sum(charges_per_child) / len(charges_per_child), 2)
            print('For non-smokers: The average insurance based on number of kids - ' + str(i) + ': '
                  + str(avg_charges_per_child) + ' dollars')
            
    
    # Find the unique regions patients live in
    def unique_regions(self):
        # initialize empty list
        unique_regions = []
        # iterate through each region in regions list
        for region in self.patients_regions:
            # if the region is not already in the unique regions list
            # then add it to the unique regions list
            if region not in unique_regions: 
                unique_regions.append(region)
        # return unique regions list
        return unique_regions
    
    
    # Analyze where a majority of the individuals are from
    def most_common_region(self):
        common_region = Counter(self.patients_regions)
        # common_region will print regions with their corresponding numbers
        print(common_region)
        # select the top-most common region
        common_region = common_region.most_common(1)
        majority_region = (common_region[0][0])
        return ('The majority of the individuals are from ' + majority_region)
    
    
    # Calculate the average insurance in each unique region
    def average_insurance_regions(self):
        regions_insurance_charges = list(zip(self.patients_regions, self.patients_charges))
        regions = ['southwest', 'southeast', 'northwest', 'northeast']
        for region in regions:
            # filter for charges from each region
            unique_region = [float(item[1]) for item in regions_insurance_charges if item[0] == region]
            avg_unique_region = round(sum(unique_region) / len(unique_region), 2)
            print ('Average insurance - ' + region + ': ' + str(avg_unique_region) + ' dollars')
    
    
    # Calculate the percentage of smokers in each region
    def percentage_smokers_regions(self):
        regions_smokers = list(zip(self.patients_regions, self.patients_smoker_statuses))
        regions = ['southwest', 'southeast', 'northwest', 'northeast']
        for region in regions:
            total_region_count = [item[1] for item in regions_smokers if item[0] == region]
            unique_region_count = [item[1] for item in regions_smokers if item[0] == region and item[1] == 'yes']
            # percentage rounded to the hundreths
            percentage_region = round((len(unique_region_count) / len(total_region_count)) * 100, 2)
            print(str(percentage_region) + '%' + ' smoke in the ' + region)
         
        
    # Calculate the average insurance considering the bmis and smoker statuses(smoker)
    def average_insurance_bmi_smokers(self):
        bmis_float = [float(i) for i in self.patients_bmis]
        charges_bmi_smoker = list(zip(self.patients_charges, bmis_float, self.patients_smoker_statuses))
        
        underweight = [float(item[0]) for item in charges_bmi_smoker if item[1] <= 18.5 and item[2] == 'yes']
        underweight_average = round(sum(underweight) / len(underweight), 2)

        healthy_range = [float(item[0]) for item in charges_bmi_smoker if item[1] >= 18.5 and item[1] <= 24.9 
                         if item[2] == 'yes']
        healthy_range_average = round(sum(healthy_range) / len(healthy_range), 2)

        overweight = [float(item[0]) for item in charges_bmi_smoker if item[1] >= 25 and item[1] <= 29.9 
                      if item[2] == 'yes']
        overweight_average = round(sum(overweight) / len(overweight), 2)

        obesity = [float(item[0]) for item in charges_bmi_smoker if item[1] >= 30 and item[1] <= 39.9 
                   if item[2] == 'yes']
        obesity_average = round(sum(obesity) / len(obesity), 2) 

        severe_obesity = [float(item[0]) for item in charges_bmi_smoker if item[1] >= 40 and item[2] == 'yes']
        severe_obesity_average = round(sum(severe_obesity) / len(severe_obesity), 2)
        
        #bmi_chart = [underweight_average, healthy_range_average, overweight_average, obesity_average, severe_obesity_average]
    
        return ('Average insurance for smokers - underweight:     ' + str(underweight_average), 
                'Average insurance for smokers - healthy_range:   ' + str(healthy_range_average), 
                'Average insurance for smokers - overweight:      ' + str(overweight_average), 
                'Average insurance for smokers - obesity:         ' + str(obesity_average), 
                'Average insurance for smokers - severe_obesity:  ' + str(severe_obesity_average))
    
    
    # # Calculate the average insurance considering the bmis and smoker statuses(non-smoker)
    def average_insurance_bmi_non_smokers(self):
        bmis_float = [float(i) for i in self.patients_bmis]
        charges_bmi_smoker = list(zip(self.patients_charges, bmis_float, self.patients_smoker_statuses))
        
        underweight = [float(item[0]) for item in charges_bmi_smoker if item[1] <= 18.5 and item[2] == 'no']
        underweight_average = round(sum(underweight) / len(underweight), 2)

        healthy_range = [float(item[0]) for item in charges_bmi_smoker if item[1] >= 18.5 and item[1] <= 24.9 
                         if item[2] == 'no']
        healthy_range_average = round(sum(healthy_range) / len(healthy_range), 2)

        overweight = [float(item[0]) for item in charges_bmi_smoker if item[1] >= 25 and item[1] <= 29.9 
                      if item[2] == 'no']
        overweight_average = round(sum(overweight) / len(overweight), 2)

        obesity = [float(item[0]) for item in charges_bmi_smoker if item[1] >= 30 and item[1] <= 39.9 
                   if item[2] == 'no']
        obesity_average = round(sum(obesity) / len(obesity), 2) 

        severe_obesity = [float(item[0]) for item in charges_bmi_smoker if item[1] >= 40 and item[2] == 'no']
        severe_obesity_average = round(sum(severe_obesity) / len(severe_obesity), 2)
        
        return ('Average insurance for non-smokers - underweight:     ' + str(underweight_average), 
                'Average insurance for non-smokers - healthy_range:   ' + str(healthy_range_average), 
                'Average insurance for non-smokers - overweight:      ' + str(overweight_average), 
                'Average insurance for non-smokers - obesity:         ' + str(obesity_average), 
                'Average insurance for non-smokers - severe_obesity:  ' + str(severe_obesity_average))
    
    
    def average_bmis_regions(self):
        bmis_float = [float(i) for i in self.patients_bmis]
        bmis_regions = list(zip(bmis_float, self.patients_regions, self.patients_smoker_statuses))
        regions = ['southwest', 'southeast', 'northwest', 'northeast']
        for region in regions:
            bmi = [item[0] for item in bmis_regions if item[1] == region]
            average_bmi = round(sum(bmi) / len(bmi), 2)
            #bmi = [item[0] for item in bmis_regions if item[1] == region
            print('The average bmi in ' + region + ' is ' + str(average_bmi))
        

### Number of methods in the InsuranceData class

In [5]:
method_list = [method for method in dir(InsuranceData) if method.startswith('__') is False]
print(len(method_list))

19


### An instance of the class called `insurance_data` is created below
With this instance, each method can be used to see the results of the analysis.

In [6]:
insurance_data = InsuranceData(ages, sexes, bmis, num_children, smoker_status, regions, insurance_charges)

### What is the average age of a patient?

In [7]:
insurance_data.analyze_ages()

'The average age of a patient is 39.21 years'

### What is the most common age?

In [8]:
insurance_data.dominant_age()

'The dominant age of taxpayers is 18 year olds'

### Which age contributes the most to the total insurance charges?
The `19 year olds` contribute the highest percentage of `3.73%` to the total insurance charges even though the `majority by age` are the `18 year olds`, they contribute `2.75%` to the total insurance. Most of the tax payers contribute mostly `1% to 2%`. 

### The percentage of males vs. females


Both genders seem to be well represented in the dataset

In [9]:
insurance_data.male_female_percentage()

'The percentage of male in the dataset is 50.5% and The percentage of female in the dataset is 49.5%'

### The average yearly insurance for an individual

In [10]:
insurance_data.average_charges()

'Average Yearly Medical Insurance Charges: 13270.42 dollars.'

### The average insurance for a male vs. female
This shows that males pay higher average charges yearly

In [11]:
insurance_data.gender_average_insurance()

The average insurance each male pays is 13956.75 dollars
The average insurance each female pays is 12569.58 dollars


### The average insurance for smokers: male vs. female
The average insurance for a smoker is more than `double` the yearly average insurance.

In [12]:
insurance_data.average_insurance_smokers()

The average insurance charges for male smokers is 33042.01 dollars
The average insurance charges for female smokers is 30679.0 dollars


### The average insurance for non-smokers: male vs. females
It seems the average insurance for non-smokers is lower than than the yearly average insurance.

In [13]:
insurance_data.average_insurance_non_smokers()

The average insurance charges for male non-smokers is 8087.2 dollars
The average insurance charges for female non-smokers is 8762.3 dollars


### The unique number of children 

In [14]:
insurance_data.unique_number_children()

['0', '1', '3', '2', '5', '4']

### The average insurance based on the number of kids
The unique number of kids in this dataset ranges from `0 to 5`. 
From the analysis below, the higher the number of children, the higher the insurance charges except for those with 5 children who pay relatively low average insurance compared to the others.

In [15]:
insurance_data.average_charges_per_kids()

The average insurance based on number of kids - 0: 12365.98 dollars
The average insurance based on number of kids - 1: 12731.17 dollars
The average insurance based on number of kids - 2: 15073.56 dollars
The average insurance based on number of kids - 3: 15355.32 dollars
The average insurance based on number of kids - 4: 13850.66 dollars
The average insurance based on number of kids - 5: 8786.04 dollars


### The average insurance based on the number of kids and smoker status(yes)

In [16]:
insurance_data.average_charges_per_kids_smokers()

For smokers: The average insurance based on number of kids - 0: 31341.36 dollars
For smokers: The average insurance based on number of kids - 1: 31822.65 dollars
For smokers: The average insurance based on number of kids - 2: 33844.24 dollars
For smokers: The average insurance based on number of kids - 3: 32724.92 dollars
For smokers: The average insurance based on number of kids - 4: 26532.28 dollars
For smokers: The average insurance based on number of kids - 5: 19023.26 dollars


### The average insurance based on the number of kids and smoker status(no)

In [17]:
insurance_data.average_charges_per_kids_non_smokers()

For non-smokers: The average insurance based on number of kids - 0: 7611.79 dollars
For non-smokers: The average insurance based on number of kids - 1: 8303.11 dollars
For non-smokers: The average insurance based on number of kids - 2: 9493.09 dollars
For non-smokers: The average insurance based on number of kids - 3: 9614.52 dollars
For non-smokers: The average insurance based on number of kids - 4: 12121.34 dollars
For non-smokers: The average insurance based on number of kids - 5: 8183.85 dollars


### Unique regions
4 regions were represented in this dataset.

In [18]:
insurance_data.unique_regions()

['southwest', 'southeast', 'northwest', 'northeast']

### Are the regions fairly represented?
The number of people from different regions are fairly represented with the majority being in `southeast`

In [19]:
insurance_data.most_common_region()

Counter({'southeast': 364, 'southwest': 325, 'northwest': 325, 'northeast': 324})


'The majority of the individuals are from southeast'

### The average insurance in different regions
The average insurance is highest in the `southeast` and lowest in the `northwest` with a difference of `2317.83`.
`southwest` and `northwest` have the lowest averages.

In [20]:
insurance_data.average_insurance_regions()

Average insurance - southwest: 12346.94 dollars
Average insurance - southeast: 14735.41 dollars
Average insurance - northwest: 12417.58 dollars
Average insurance - northeast: 13406.38 dollars


### The percentage of smokers in different regions

In [21]:
insurance_data.percentage_smokers_regions()

17.85% smoke in the southwest
25.0% smoke in the southeast
17.85% smoke in the northwest
20.68% smoke in the northeast


### Average insurance based on bmi/smoker status
The [chart](https://www.nhsinform.scot/healthy-living/food-and-nutrition/healthy-eating-and-weight-loss/understanding-your-health-and-weight-body-mass-index-bmi) below would be used as a guide:

|bmi|description|
|:--|:---------|
|under 18.5| This is described as underweight|
|between 18.5 and 24.9| This is described as the ‘healthy range’|
|between 25 and 29.9| This is described as overweight|
|between 30 and 39.9| This is described as obesity|
|40 or over|This is described as severe obesity|


### Average insurance based on BMIs and smoker status(yes)
The higher the bmi, the higher the average insurance.
The average insurance is even higher if the patient smokes.

In [22]:
insurance_data.average_insurance_bmi_smokers()

('Average insurance for smokers - underweight:     18809.82',
 'Average insurance for smokers - healthy_range:   19942.22',
 'Average insurance for smokers - overweight:      22379.03',
 'Average insurance for smokers - obesity:         40895.85',
 'Average insurance for smokers - severe_obesity:  45467.79')

### Average insurance based on BMIs and smoker status(no)

In [23]:
insurance_data.average_insurance_bmi_non_smokers()

('Average insurance for non-smokers - underweight:     5485.06',
 'Average insurance for non-smokers - healthy_range:   7599.64',
 'Average insurance for non-smokers - overweight:      8306.38',
 'Average insurance for non-smokers - obesity:         8927.2',
 'Average insurance for non-smokers - severe_obesity:  8179.66')

### The average bmis in different regions
The analysis below shows that the average patient in the `south` is `obese`
and the average patient in the `north` is `overweight`.
The highest average bmi is in the `southeast`.

In [24]:
insurance_data.average_bmis_regions()

The average bmi in southwest is 30.6
The average bmi in southeast is 33.36
The average bmi in northwest is 29.2
The average bmi in northeast is 29.17


# Create Dictionary

In [25]:
insurance_dictionary = insurance_data.__dict__
#print(insurance_dictionary)

All patient data is now neatly organized in a dictionary. This is convenient for further analysis if a decision is made to continue making investigations for the attributes in **insurance.csv**

### Convert the following columns to integer/float as the case may be
* `patients_ages`
* `patients_bmis`
* `patients_num_children`
* `patients_charges`

In [26]:

insurance_dictionary = insurance_data.__dict__
ages_int = [int(x) for x in insurance_dictionary['patients_ages']]
patients_bmis_int = [float(x) for x in insurance_dictionary['patients_bmis']]
patients_num_children_int = [int(x) for x in insurance_dictionary['patients_num_children']]
patients_charges_int = [float(x) for x in insurance_dictionary['patients_charges']]

# update the dictionary
insurance_dictionary.update({'patients_ages': ages_int})
insurance_dictionary.update({'patients_bmis': patients_bmis_int})
insurance_dictionary.update({'patients_num_children': patients_num_children_int})
insurance_dictionary.update({'patients_charges': patients_charges_int})

#print(insurance_dictionary)

# Conclusion

As seen from the analysis, 
Insurance charges are generally lower 
* if you have fewer children
* do not smoke 
* and your bmi is in the healthy range
    
Further investigations could be done to understand why:
* averages in the `southeast` are higher
* averages are lower with 5 children
* the average insurance for female smokers/non-smokers is higher than that of males eventhough the average yearly insurance show that men pay more insurance