# U.S. Medical Insurance Costs

For this project I will be using US medical insurance data and performing a basic analysis on that data. The data contains information on 1338 insurance applicants and is initially held in a CSV. The information given includes:
* Age
* Sex
* BMI
* Children
* Smoker
* Region
* Insurance Cost

I'm going to initally generate a summary of the data, followed by a deeper dive into what the data tells us. After completing the python fundamentals section of the data science career path, I have realised I am lacking knowledge on Classes and so will utilise them in this project to gain a deeper understanding. Since the data is given in CSV format I will first import the csv module and convert the .csv data to a list for each column in the .csv.

### Data Cleaning

In [1]:
import csv

In [2]:
def convert_to_list(filename, column_name):
    """ A helper function to convert the csv data in 'file' to a dictionary where the column names of the .csv are
        the keys and the data the values. """
    lst = list()
    with open(filename) as file:
        file_dict = csv.DictReader(file)
        for row in file_dict:
            lst.append(row[column_name])
    return lst

Now we have a helper function to convert the information given in each column of the csv to a list, we can create 7 lists to hold the 7 columns of data.

In [3]:
age = convert_to_list("insurance.csv", "age")
sex = convert_to_list("insurance.csv", "sex")
bmi = convert_to_list("insurance.csv", "bmi")
children = convert_to_list("insurance.csv", "children")
smoker = convert_to_list("insurance.csv", "smoker")
region = convert_to_list("insurance.csv", "region")
insurance_cost = convert_to_list("insurance.csv", "charges")

As said above, I would like to practise using Classes in this project. But first, what do we actually want to get out of the data?
* A summary of the data (average age, no. males/females, average bmi, average no. children, average no. (non) smokers, most frequent region and the average insurance cost)
* The difference in cost between: males/females, smokers/non-smokers, children/no-children, regions.
* The average age of applicants with 1, 2, 3... children

### Creating our Class

In [10]:
class InsuranceData:
    def __init__(self, age, sex, bmi, children, smoker, region, insurance_cost):
        self.age = age
        self.sex = sex
        self.bmi = bmi
        self.children = children
        self.smoker = smoker
        self.region = region
        self.insurance_cost = insurance_cost
    
    def average_age(self):
        """ Works out the average age of the insurance applicants. """
        sum_ages = 0
        for age in self.age:
            sum_ages += int(age)
        average_age = sum_ages / len(self.age)
        return int(average_age)
    
    def no_of_females(self):
        """ Counts the number of females within the dataset. """
        females = 0
        for sex in self.sex:
            if sex == "female":
                females += 1
        return females
    
    def no_of_males(self):
        """ Counts the number of males within the dataset. """
        males = 0
        for sex in self.sex:
            if sex == "male":
                males += 1
        return males
    
    def average_bmi(self):
        """ Works out the average bmi of the dataset. """
        sum_bmis = 0
        for bmi in self.bmi:
            sum_bmis += float(bmi)
        avg_bmi = sum_bmis / len(self.bmi)
        return round(avg_bmi, 1)
    
    def average_children(self):
        """ Works out the average no. of children of the applicants. """
        sum_children = 0
        for children in self.children:
            sum_children += int(children)
        avg_children = sum_children / len(self.children)
        return int(avg_children)
    
    def no_smoker(self):
        """ Works out the no. of smokers in the dataset. """
        sum_smokers = 0
        for smoker in self.smoker:
            if smoker == "yes":
                sum_smokers += 1
        return sum_smokers
    
    def no_non_smoker(self):
        """ Works out the no. of non-smokers in the dataset. """
        sum_non_smokers = 0
        for smoker in self.smoker:
            if smoker == "no":
                sum_non_smokers += 1
        return sum_non_smokers
    
    def most_frequent_region(self):
        """ Counts and returns the region applicants are most frequently from. """
        northeast_count = self.region.count("northeast")
        northwest_count = self.region.count("northwest")
        southeast_count = self.region.count("southeast")
        southwest_count = self.region.count("southwest")
        if northeast_count > northwest_count and northeast_count > southeast_count and northeast_count > southwest_count:
            return "northeast"
        elif northwest_count > northeast_count and northwest_count > southeast_count and northwest_count > southwest_count:
            return "northwest"
        elif southeast_count > northwest_count and southeast_count > northeast_count and southeast_count > southwest_count:
            return "southeast"
        else:
            return "southwest"
    
    def average_cost(self):
        """ Calculates the average cost for insurance. """
        sum_cost = 0
        for cost in self.insurance_cost:
            sum_cost += float(cost)
        average_cost = sum_cost / len(self.insurance_cost)
        return round(average_cost, 2)
    
    def data_summary(self):
        """ Returns a dictionary summary of the data, including: average age, no. males/females, average bmi, 
            average no. children, no. (non) smokers, most frequent region and the average insurance 
            cost. """
        summary_dict = {}
        summary_dict["Average Age"] = self.average_age()
        summary_dict["No. of Males"] = self.no_of_males()
        summary_dict["No. of Females"] = self.no_of_females()
        summary_dict["Average BMI"] = self.average_bmi()
        summary_dict["Average no. of Children"] = self.average_children()
        summary_dict["No. of Smokers"] = self.no_smoker()
        summary_dict["No. of Non-Smokers"] = self.no_non_smoker()
        summary_dict["Most Frequent Region"] = self.most_frequent_region()
        summary_dict["Average Cost"] = self.average_cost()
        return summary_dict
    
    def create_dictionary(self):
        """ Creates a dictionary of the applicants, where each key is an arbituarary count and the values are the
            applicant data. """
        applicant_dict = {}
        for i in range(len(self.age)):
            applicant_dict[i] = [self.age[i], self.sex[i], self.bmi[i], self.children[i], self.smoker[i], self.region[i], self.insurance_cost[i]]
        return applicant_dict
    
    def compare_cost_sex(self):
        """ Returns the difference in average insurance cost between males and females. """
        applicant_dict = self.create_dictionary()
        male_costs_sum = 0
        female_costs_sum = 0
        for lst in applicant_dict.values():
            if lst[1] == "male":
                male_costs_sum += float(lst[6]) 
            else:
                female_costs_sum += float(lst[6])
        average_male_cost = male_costs_sum / self.no_of_males()
        average_female_cost = male_costs_sum / self.no_of_females()
        diff_cost = round(average_male_cost - average_female_cost, 2)
        return f"The difference between the average cost of insurance for males and females is ${diff_cost}"
    
    def compare_cost_smoker(self):
        """ Returns the difference in average cost of insurance between smokers and non-smokers. """
        applicant_dict = self.create_dictionary()
        smokers_cost_sum = 0
        non_smokers_cost_sum = 0
        for lst in applicant_dict.values():
            if lst[4] == "yes":
                smokers_cost_sum += float(lst[6])
            else:
                non_smokers_cost_sum += float(lst[6])
        average_smokers_cost = smokers_cost_sum / self.no_smoker()
        average_non_smokers_cost = non_smokers_cost_sum / self.no_non_smoker()
        diff_cost = round(average_smokers_cost - average_non_smokers_cost, 2)
        return f"The difference between the average cost of insurance for smokers and non-smokers is ${diff_cost}"
    
    def compare_cost_children(self):
        """ Creates a dictionary where the keys are the no. of children and the values are the average cost of 
            insurance. """
        sum_costs = 0
        applicant_dict = self.create_dictionary()
        children_list = [int(i) for i in self.children]
        max_no_children = max(children_list)
        children_cost_dict = {key: [] for key in range(max_no_children+1)}
        for key in children_cost_dict.keys():
            for lst in applicant_dict.values():
                if int(lst[3]) == key:
                    children_cost_dict[key].append(float(lst[6]))
        for key in children_cost_dict.keys():
            for value in children_cost_dict[key]:
                sum_costs += value
            average_cost = sum_costs / len(children_cost_dict[key])
            sum_costs = 0
            children_cost_dict[key] = round(average_cost, 2)
        return children_cost_dict
    
    def compare_cost_region(self):
        """ Creates a dictionary where the keys are the region and the value is the average cost for that region. 
        """
        applicant_dict = self.create_dictionary()
        region_cost_dict = {"northeast": 0, "northwest": 0, "southeast": 0, "southwest": 0}
        northeast_count = self.region.count("northeast")
        northwest_count = self.region.count("northwest")
        southeast_count = self.region.count("southeast")
        southwest_count = self.region.count("southwest")
        for key in region_cost_dict.keys():
            for value in applicant_dict.values():
                if key == value[5]:
                    region_cost_dict[key] += float(value[6])
        region_cost_dict["northeast"] /= northeast_count
        region_cost_dict["northwest"] /= northwest_count
        region_cost_dict["southeast"] /= southeast_count
        region_cost_dict["southwest"] /= southwest_count
        for key in region_cost_dict.keys():
            region_cost_dict[key] = round(region_cost_dict[key], 2)
        return region_cost_dict
    
    def average_age_with_children(self):
        """ Creates a dictionary where the keys are the no. of children and the values are the average age of 
            applicants with that no. of children. """
        sum_ages = 0
        applicant_dict = self.create_dictionary()
        children_list = [int(i) for i in self.children]
        max_no_children = max(children_list)
        children_ages_dict = {key: [] for key in range(max_no_children+1)}
        for key in children_ages_dict.keys():
            for lst in applicant_dict.values():
                if int(lst[3]) == key:
                    children_ages_dict[key].append(int(lst[0]))
        for key in children_ages_dict.keys():
            for value in children_ages_dict[key]:
                sum_ages += value
            average_age = sum_ages / len(children_ages_dict[key])
            sum_ages = 0
            children_ages_dict[key] = round(average_age, 2)
        return children_ages_dict

Now we have created our class we can get into analysing the data. First lets create our insurance data object using the the lists we created earlier:

In [11]:
data = InsuranceData(age, sex, bmi, children, smoker, region, insurance_cost)

### Data Summary

We want to get a high level summary of the data, lets use the data_summary method we created:

In [12]:
data.data_summary()

{'Average Age': 39,
 'No. of Males': 676,
 'No. of Females': 662,
 'Average BMI': 30.7,
 'Average no. of Children': 1,
 'No. of Smokers': 274,
 'No. of Non-Smokers': 1064,
 'Most Frequent Region': 'southeast',
 'Average Cost': 13270.42}

This dictionary gives us a clear indication as to how the data is made up. We can see from the 1,338 applicants, 676 were male and 662 were female, so we can clearly see there is a roughly even representation of males and females within the data. The average age of the applicants was 39, average BMI 30.7 and applicants had an average of 1 child. We can see there is a larger representation of non-smokers to smokers, 1064 to 274, and the most frequent region was the Southeast. Of the 1,338 applicants the average cost of insurance was $13,270.42.

By creating this dictionary we have managed to quickly and easily analyse the make-up of our data ready for further analysis.

Now we have a summary of the data, lets look further into the data, to see what we can extract. We want to see how the insurance cost differs between people of different sex, smoker status, no. of children and region.

### Comparing the Data

In [13]:
sex_comparison = data.compare_cost_sex()
sex_comparison

'The difference between the average cost of insurance for males and females is $-295.16'

From our analysis, males paid an average of $295.16 less for there insurance than females. This would lead us nicley into some further analysis as to why that was the case.

In [14]:
smoker_comparison = data.compare_cost_smoker()
smoker_comparison

'The difference between the average cost of insurance for smokers and non-smokers is $23615.96'

Unsurpisingly smokers paid an average of $23,615.96 more for insurance than non-smokers.

In [15]:
children_comparison = data.compare_cost_children()
children_comparison

{0: 12365.98, 1: 12731.17, 2: 15073.56, 3: 15355.32, 4: 13850.66, 5: 8786.04}

The average cost of insurance increases from 0 to 3 children and then decreases to 5 children.

In [16]:
region_comparison = data.compare_cost_region()
region_comparison

{'northeast': 13406.38,
 'northwest': 12417.58,
 'southeast': 14735.41,
 'southwest': 12346.94}

Of the four regions, the Southeast had the highest average insurance cost at $14,735.41

The data can clearly be analysed in much greater detail, and it is clear that each variable has different effect on the average cost paid. However, the above comparisons give us a platform to look into why the comparisons are as they are.

Finally, we can present the data all together as one using the create_dictionary() method. The data will be displayed as a dictionary where each key represents an applicant (given a psuedo ID no.) and the value is the data for that applicant:

In [17]:
data_dict = data.create_dictionary()
data_dict

{0: ['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924'],
 1: ['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523'],
 2: ['28', 'male', '33', '3', 'no', 'southeast', '4449.462'],
 3: ['33', 'male', '22.705', '0', 'no', 'northwest', '21984.47061'],
 4: ['32', 'male', '28.88', '0', 'no', 'northwest', '3866.8552'],
 5: ['31', 'female', '25.74', '0', 'no', 'southeast', '3756.6216'],
 6: ['46', 'female', '33.44', '1', 'no', 'southeast', '8240.5896'],
 7: ['37', 'female', '27.74', '3', 'no', 'northwest', '7281.5056'],
 8: ['37', 'male', '29.83', '2', 'no', 'northeast', '6406.4107'],
 9: ['60', 'female', '25.84', '0', 'no', 'northwest', '28923.13692'],
 10: ['25', 'male', '26.22', '0', 'no', 'northeast', '2721.3208'],
 11: ['62', 'female', '26.29', '0', 'yes', 'southeast', '27808.7251'],
 12: ['23', 'male', '34.4', '0', 'no', 'southwest', '1826.843'],
 13: ['56', 'female', '39.82', '0', 'no', 'southeast', '11090.7178'],
 14: ['27', 'male', '42.13', '0', 'yes', 'southeast', '3