# U.S. Medical Insurance Costs

In this project, a CSV file containing medical insurance data will be investigated using Python.

The goals for this project will include:

- Finding the average age of patients
- Returning the number of Males vs Females
- Finding the geographical location of the patients
- Returning the average yearly charges to the patient
- Comparing the average cost of insurance of smokers vs non smokers
- Finding the average age of someone with atleast one child

The first step is to import the relevant python libraries. Due to the file format, the CSV python library will be needed to work with the data.

In [1]:
# import csv library
import csv

The next step is to inspect the data in a text editor and find the names of the columns, types of data and if any of the data is missing

There are no missing values and each records contains 7 pieces of data which includes:
- Age - integer value
- Sex - male or female
- BMI - float value
- Children - integer value
- Smoker - string value (yes or no)
- Region - string value
- Charges - float value

To process and analyse this data, empty lists need to be created for every attribute

In [2]:
ages = []
sexes = []
bmis = []
num_children = []
smoker_statuses = []
regions = []
insurance_charges = []

To import the data into the lists, there are several approaches that can be taken. First option is to create 7 `for` loops which appends the data into the relevant list. 

Another approach is to make a function which takes the insurance.csv file, list and a column number as inputs and appends the data into the right list.
This approach reduces code repetition and will make the process more efficient overall and will make importing a new set of data easier as well


In [3]:
def import_data(lst, file, column):
    
    with open(file) as dataset:
        data = csv.DictReader(dataset)
        
        for row in data:
            lst.append(row[column])

    # return lst
    
# testing the import function
#import_data(ages,'insurance.csv','age')
#print(ages)

With this function, data can now be stored in the lists created earlier

In [4]:
import_data(ages, 'insurance.csv', 'age')
import_data(sexes, 'insurance.csv', 'sex')
import_data(bmis, 'insurance.csv', 'bmi')
import_data(num_children, 'insurance.csv', 'children')
import_data(smoker_statuses, 'insurance.csv', 'smoker')
import_data(regions, 'insurance.csv', 'region')
import_data(insurance_charges, 'insurance.csv', 'charges')

Now that all the data is stored in the relevant lists, the analysis can be started. This will include:
- Finding the average age of patients
- Returning the number of Males vs Females
- Finding the geographical location of the patients
- Returning the average yearly charges
- Comparing the average cost of insurance of smokers vs non smokers
- Finding the average age of someone with atleast one child

To do this, a new class has been created called `PatientsInfo` which will contain a method to tackle each one of these objectives

In [5]:
class PatientInfo:
    # initialisation takes the lists of data as input parameters
    def __init__(self, patient_ages, patient_sexes, patient_bmis, patient_num_children, 
                 patient_smoker_statuses, patient_regions, patient_charges):
        self.patient_ages = patient_ages
        self.patient_sexes = patient_sexes
        self.patient_bmis = patient_bmis
        self.patient_num_children = patient_num_children
        self.patient_smoker_statuses = patient_smoker_statuses
        self.patient_regions = patient_regions
        self.patient_charges = patient_charges
        
    def average_age(self):
        total_age = 0
        for age in self.patient_ages:
            total_age += int(age)
        # Average age is rounded to 2 decimal places after dividing the total by the length of the list
        print("Average Patient Age: {} years".format(round(total_age/len(self.patient_ages), 2)))
    
    def male_female(self):
        males = self.patient_sexes.count('male')
        females = self.patient_sexes.count('female')
        
        print('Females: {}'.format(females))
        print('Males: {}'.format(males))

    def unique_regions(self):
        unique_regions = []
        for region in self.patient_regions:
            if region not in unique_regions:
                unique_regions.append(region)
        return unique_regions
    
    def average_charges(self):
        total = 0.0
        for item in self.patient_charges:
            # values are stored as strings so must be converted to a float value
            total += float(item)
        print('Average yearly insurance costs: ${}'.format(round(total/len(self.patient_charges),2)))
        
    # method which compares the average insurance cost of smokers vs non-smokers
    def smoker_charges(self):
        smoker_count = 0
        smoker_cost = 0.0
        non_smoker_count = 0
        non_smoker_cost = 0
        # looping through a list of smoker status and their charges zipped
        for item in list(zip(self.patient_smoker_statuses, self.patient_charges)):
            if item[0] == 'yes':
                smoker_count += 1
                smoker_cost += float(item[1])
            else:
                non_smoker_count += 1
                non_smoker_cost += float(item[1])
        print('Average smoker insurance cost: ${}'.format(round(smoker_cost/smoker_count),2))
        print('Average non smoker insurance cost: ${}'.format(round(non_smoker_cost/non_smoker_count),2))
        
    # works out the average age of someone with atleast 1 child
    def one_child_age(self):
        total_age = 0
        count = 0
        # combining the ages and number of children into a list and then looping through it
        for item in list(zip(self.patient_ages, self.patient_num_children)):
            if int(item[1]) > 1:
                total_age += int(item[0])
                count += 1
        print('Average age of patients with atleast 1 child is {}'.format(round(total_age/count,1)))
        
        

The next step is to create an instance of the `PatientsInfo` class so that we can use the methods inside of it. The lists containing the data will be input parameters

In [6]:
patient_info = PatientInfo(ages, sexes, bmis, num_children, smoker_statuses, regions, insurance_charges)

In [7]:
patient_info.average_age()

Average Patient Age: 39.21 years


In [8]:
patient_info.male_female()

Females: 662
Males: 676


In [9]:
patient_info.unique_regions()

['southwest', 'southeast', 'northwest', 'northeast']

In [10]:
patient_info.average_charges()

Average yearly insurance costs: $13270.42


In [11]:
patient_info.smoker_charges()

Average smoker insurance cost: $32050
Average non smoker insurance cost: $8434


In [12]:
patient_info.one_child_age()

Average age of patients with atleast 1 child is 40.0


The average age of the patients in the dataset is around 39 years old. It is important to check this to ensure data is representative of the broader population

Further analysis can be done by finding the range and standard deviation of the ages to ensure the patients represent a random sampling of a larger population

The next objective of this project is to compare the number of male patients to female patients. Again, this is needed to ensure the data is representative of the broader population and if there is a large difference between the two values, analysis of the data will be affected

The function returns 662 Females and 676 Males. With a difference of 14, this is around 1% of the sample size which indicates data is evenly spread between males and females

There are 4 unique geographical regions in this dataset which are all based in the US and the average yearly costs per individual is 13270 dollas. Further analysis can be done to see how different variables will affect this cost such as smoking. The average insurance costs for a smoker is nearly 4 times higher than a non-smoker.