# U.S. Medical Insurance Costs

This projects builds on a set of data (insurance.csv) on yearly medical insurance costs and corresponding patient parameters like: age, sex, BMI, number of children, smoking status and region of proveniece in the USA.
The goal with this project is to look for patterns among these parameters and to investigate how they influence the insurance costs.

In [1]:
#import csv library
import csv

In [87]:
#create empty lists to store the data, one per column in the insurance.csv file
ages = []
sex = []
bmi = []
num_children = []
smoker = []
regions = []
charges = []

#define a function that given the insurance.csv file, stores each column data in a new list
def store_data (list_name, key_name):
    with open('insurance.csv') as insurance_file:
        insurance = csv.DictReader(insurance_file)
        for i in insurance:
            list_name.append(i[key_name])
    
#load data in the various lists
store_data(ages,'age')
store_data(sex,'sex')
store_data(bmi,'bmi')
store_data(num_children,'children')
store_data(smoker,'smoker')
store_data(regions,'region')
store_data(charges,'charges')

All data are well organized, the analysis can start!
Note that values are stored as strings in all lists.

In [274]:
#create a class that contains the needed methods to perform the data analysis
class Patient:
    #constructor
    def __init__(self, ages, sex, bmi, num_children, smoker, regions, charges):
        self.age = ages
        self.sex = sex
        self.bmi = bmi
        self.children = num_children
        self.smoker = smoker
        self.region = regions
        self.charge = charges

    #define a method that calculate the average age
    def avg_age(self):
        sum_values = 0
        for i in self.age:
            sum_values += int(i)

        print('Average age of patients in this study: ' + str(round(sum_values/len(self.age),2)) + ' years.')
    
    #define a method that calculate the average insurance consts per year
    def avg_insurance(self):
        sum_values = 0
        for i in self.charge:
            sum_values += float(i)

        print('Average insurance costs: ' + str(round(sum_values/len(self.charge),2)) + ' dollars per year.')
    
    
    #define a method to check if there is a good balance between female/male patients 
    def balance_sex(self):
        men_tot = 0
        women_tot = 0
        for i in self.sex:
            if i == "male":
                men_tot += 1
            else:
                women_tot += 1
        print('This dataset contains info on {women} women and {men} men.'.format(women = women_tot, men = men_tot))
    
    #define a method that finds distinct region of provenience 
    def find_regions(self):
        distinct_reg = []
        for i in self.region:
            if i not in distinct_reg:
                distinct_reg.append(i)

        print(distinct_reg)
        
    #def a method to check the distribution of data across USA regions  
    def distr_regions(self):
        nw_tot = 0
        ne_tot = 0
        sw_tot = 0
        se_tot = 0

        for i in self.region:
            if i == 'southwest':
                sw_tot += 1
            elif i == 'southeast':
                se_tot += 1
            elif i == 'northeast':
                ne_tot += 1
            else:
                nw_tot += 1
        
        print('Data provenience: {nw} from northwest, {ne} from northeast, {sw} from southwest and {se} from southeast.'. format(nw = nw_tot, ne= ne_tot, sw= sw_tot, se= se_tot))
    
    #define a method that count the smokers
    def smokers_num(self):
        count = 0
        for i in self.smoker:
            if i == 'yes':
                count += 1

        print('Total number of smokers: ' + str(count) + ' out of ' + str(len(self.smoker)) + ' people.')
        return count
        
    #define a method that constructs a dictionary containing all patients' info
    def create_dict(self):
        self.patients_dict = {}
        self.patients_dict['age'] = [int(i) for i in self.age]
        self.patients_dict["sex"] = self.sex
        self.patients_dict["bmi"] = self.bmi
        self.patients_dict["children"] = self.children
        self.patients_dict["smoker"] = self.smoker
        self.patients_dict["region"] = self.region
        self.patients_dict["charges"] = [float(i) for i in self.charge]
        return self.patients_dict

        
    def age_parents_one_child(self, patient_dict):
        sum_age = 0
        one_kid_parent = 0
        for i in range(len(self.age)):
            if patient_dict['children'][i] != '1':
                continue
            sum_age += patient_dict['age'][i]
            one_kid_parent += 1

        avg_age_parent = sum_age/one_kid_parent

        print('Average age of people having 1 child: ' + str(round(avg_age_parent,2)) + ' years.')
        print(str(one_kid_parent) + ' persons out of ' + str(len(self.age)) + ' have 1 child')
        
    
    

In [275]:
patient = Patient(ages, sex, bmi, num_children, smoker, regions, charges)

In [276]:
patient.avg_age()

Average age of patients in this study: 39.21 years.


In [207]:
patient.avg_insurance()

Average insurance costs: 13270.42 dollars per year.


Check if there is a good balance between female/male patients:

In [200]:
patient.balance_sex()

This dataset contains info on 662 women and 676 men.


Create a dictionaty that contains all patients' info:

In [259]:
patient_dict = patient.create_dict()

Investigate from where the data come from:

In [201]:
patient.find_regions()

['southwest', 'southeast', 'northwest', 'northeast']


In [256]:
patient.distr_regions()

Data provenience: 325 from northwest, 324 from northeast, 325 from southwest and 364 from southeast.


The majority comes from the southeast region (ca 40 more), otherwise balanced distribution.

Find the average age of patients with ONE kid (at the time of data-taking):

In [278]:
patient.age_parents_one_child(patient_dict)

Average age of people having 1 child: 39.45 years.
324 persons out of 1338 have 1 child


Compare insurance costs of smokers vs non-smokers

In [234]:
smoker_costs = 0
non_smoker_costs = 0
count_non_smokers = 0
count_smokers = 0

for i in range(len(smoker)):
    if smoker[i] == 'yes':
        smoker_costs += float(charges[i])
        count_smokers += 1
    else:
        non_smoker_costs += float(charges[i])
        count_non_smokers += 1
        
diff = round(smoker_costs/count_smokers - non_smoker_costs/count_non_smokers,2)
        
print('In average a smoker pays in insurance costs {diff} dollars more than a non-smoker.'.format(diff = diff))
    

In average a smoker pays in insurance costs 23615.96 dollars more than a non-smoker.


In [252]:
count = patient.smokers_num()

Total number of smokers: 274 out of 1338 people.


Do smokers tend to have NO children?

In [253]:
smokers_w_kids = 0
non_smokers_w_kids = 0
for i in range(len(smoker)):
    if num_children[i] != '0':
        if smoker[i] == 'yes':
            smokers_w_kids += 1
        else:
            non_smokers_w_kids += 1

print('Out of {count} smokers, {smokers_w_kids} have kids. This correspond to {percent}% of the smokers'.format(count = count, smokers_w_kids = smokers_w_kids, percent = round(smokers_w_kids*100/count,2)))
print('Out of {count} non-smokers, {non_smokers_w_kids} have kids. This correspond to {percent}% of the non-smokers'.format(count = len(ages) - count, non_smokers_w_kids = non_smokers_w_kids, percent = round(non_smokers_w_kids*100/(len(ages) -count),2)))

Out of 274 smokers, 159 have kids. This correspond to 58.03% of the smokers
Out of 1064 non-smokers, 605 have kids. This correspond to 56.86% of the non-smokers


Smokers seems to chose to have kids in the same measure as non-smokers

In which region are insurance costs higher in average?

In [236]:
costs_se = 0
costs_sw = 0
costs_ne = 0
costs_nw = 0
for i in range(len(regions)):
    if regions[i] == 'southeast':
        costs_se += float(charges[i])
    elif regions[i] == 'southwest':
        costs_sw += float(charges[i])
    elif regions[i] == 'northeast':
        costs_ne += float(charges[i])
    else:
        costs_nw += float(charges[i])
        
print('Average insurance cost in the SE region: {se}, in the SW region: {sw}, in the NE region: {ne} and in the NW region: {nw} dollars'.format(sw = round(costs_sw/sw_tot,2), nw=round(costs_nw/nw_tot,2), se=round(costs_se/se_tot,2), ne=round(costs_ne/ne_tot,2)))
        

Average insurance cost in the SE region: 14735.41, in the SW region: 12346.94, in the NE region: 13406.38 and in the NW region: 12417.58 dollars


Highest costs in the SE region, followed by NE.