# U.S. Medical Insurance Costs

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

In [13]:
# import csv library
import csv

To start, all necessary libraries must be imported. For this project the only library needed is the `csv` library in order to work with the **insurance.csv** data. There are other potential libraries that could help with this project; however, for this analysis, using just the `csv` library will suffice.

The next step is to look through **insurance.csv** in order to get aquanted with the data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:
* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

In [14]:
#Create empty lists for the various attributes in insurance.csv
ages = []
sexes = []
bmis = []
num_children = []
smoker_statuses = []
regions = []
insurance_costs = []

**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. To store this information, seven empty lists will be created hold each individual column of data from **insurance.csv**.


In [15]:
# helper function to load csv data
def load_list_data(lst, csv_file, column_name):
    with open (csv_file) as file:
        csv_dict = csv.DictReader(file)

        for row in csv_dict:
            lst.append(row[column_name])

The helper function above was created to make loading data into the lists as efficient as possible. Without this function, one would have to open **insurance.csv** and rewrite the `for` loop seven times; however, with this function, one can simply call `load_list_data()` each time as shown below.

In [16]:
# load csv data into appropriate lists
load_list_data(ages, "insurance.csv", "age")
load_list_data(sexes, "insurance.csv", "sex")
load_list_data(bmis, "insurance.csv", "bmi")
load_list_data(num_children, "insurance.csv", "children")
load_list_data(smoker_statuses, "insurance.csv", "smoker")
load_list_data(regions, "insurance.csv", "region")
load_list_data(insurance_costs, "insurance.csv", "charges")

Now that all the data from **insurance.csv** neatly organized into labeled lists, the analysis can be started. This is where one must plan out what to investigate and how to perform the analysis. There are many aspects of the data that could be looked into. The following operations will be implemented:
* find average age of the patients
* return the number of males vs. females counted in the dataset
* return the average yearly medical charges of the patients
* find the number of smokers vs. nonsmokers and the difference between their average insurance costs
* return which average BMI (males vs. females) is higher
* return average age of the patients with 1 child or more
* find geographical location of the patients
* return the most popular region for medical insurance claims
* create database, the list of dictionaries that contains all patient information

To perform these inspections, a class called `PatientsInfo` has been built out which contains nine methods:
* `calculate_average_age()`
* `analyze_sexes()`
* `calculate_average_cost()`
* `calculate_smoking_difference()`
* `calculate_highest_bmi()`
* `calculate_age_children()`
* `count_unique_regions()`
* `calculate_popular_region()`
* `create_database()`

The class has been built out below. 

In [17]:
class PatientsInfo:
    # init method that takes in each list parameter
    def __init__(self, ages, sexes, bmis, num_children, smoker_statuses, regions, insurance_costs):
        self.ages = ages
        self.sexes = sexes
        self.bmis = bmis
        self.num_children = num_children
        self.smoker_statuses = smoker_statuses
        self.regions = regions
        self.insurance_costs = insurance_costs

    # method that calculates the average ages of the patients
    def calculate_average_age(self):
        sum_age = 0
        
        for age in self.ages:
            sum_age += int(age)
        average_age = round(sum_age/len(self.ages), 2)
        
        print("Average age of patients:", average_age, "year(s)")
        return average_age

    # method that calculates the number of males and females
    def analyze_sexes(self):
        males = self.sexes.count("male")
        females = self.sexes.count("female")

        print("Count for males:", males)
        print("Count for females:", females)

    # method to find average yearly medical charges for patients
    def calculate_average_cost(self):
        total_cost = 0

        for cost in self.insurance_costs:
            total_cost += float(cost)
        average_cost = round(total_cost / len(self.insurance_costs), 2)

        print("Average yearly medical insurance cost:", average_cost, "dollars.")
        return average_cost

    # method that calculates the difference between insurance costs of smoking and nonsmoking patients
    def calculate_smoking_difference(self):
        num_smokers = 0
        num_nonsmokers = 0
        costs_smokers = 0
        costs_nonsmokers = 0

        for status, cost in zip(self.smoker_statuses, self.insurance_costs):
            if status == "yes":
                num_smokers += 1
                costs_smokers += float(cost)
            else:
                num_nonsmokers += 1
                costs_nonsmokers += float(cost)
        average_cost_smokers = round(costs_smokers / num_smokers, 2)
        average_cost_nonsmokers = round(costs_nonsmokers / num_nonsmokers, 2)
        difference = abs(average_cost_smokers - average_cost_nonsmokers)
        
        print("There are {num_smokers} smoking and {num_nonsmokers} non-smoking patients in our data sample. The difference between average insurance costs is {difference} dollars.".format(num_smokers=num_smokers, num_nonsmokers=num_nonsmokers, difference=difference))
        return difference
    
    # method to find which average bmi (male or female) is higher
    def calculate_highest_bmi(self):
        num_males = 0
        num_females = 0
        bmi_males = 0
        bmi_females = 0

        for sex, bmi in zip(self.sexes, self.bmis):
            if sex == "male":
                num_males += 1
                bmi_males += float(bmi)
            else:
                num_females += 1
                bmi_females += float(bmi)
        average_bmi_males = bmi_males / num_males
        average_bmi_females = bmi_females / num_females
        difference = round(average_bmi_males - average_bmi_females, 3)

        if difference > 0.000:
            return "In current sample male BMI is {num} points higher than female one.".format(num=difference)
        elif difference < 0.000:
            return "In current sample female BMI is {num} points higher than male one.".format(num=difference)
        else:
            return "In current sample male BMI is equal the female one."
   
    # method that calculates average age of patients with children
    def calculate_age_children(self):
        sum_ages_w_children = 0
        num_people_w_children = 0

        for age, child in zip(self.ages, self.num_children):
            if int(child) > 0:
                num_people_w_children += 1
                sum_ages_w_children += int(age)
        average_age_w_children = round(sum_ages_w_children / num_people_w_children, 2)

        print("Average age of patients with 1 child or more is {age} year(s).".format(age=average_age_w_children))
        return average_age_w_children
    
    # method to find each unique region patients are from
    def count_unique_regions(self):
        unique_regions = []
        
        for region in self.regions:
            if region not in unique_regions:
                unique_regions.append(region)
                
        return unique_regions
    
    # method that calculates the most popular region based on the number of medical insurance claims
    def calculate_popular_region(self):
        popular_region = ""
        num_of_requests = 0

        southwest = self.regions.count("southwest")
        southeast = self.regions.count("southeast")
        northwest = self.regions.count("northwest")
        northeast = self.regions.count("northeast")

        if max(southwest, southeast, northwest, northeast) == southwest:
            popular_region = "Southwest"
            num_of_requests = southwest
        elif max(southwest, southeast, northwest, northeast) == southeast:
            popular_region = "Southeast"
            num_of_requests = southeast
        elif max(southwest, southeast, northwest, northeast) == northwest:
            popular_region = "Northwest"
            num_of_requests = northwest
        elif max(southwest, southeast, northwest, northeast) == northeast:
            popular_region = "Northeast"
            num_of_requests = northeast

        print("The most popular region for medical insurance is {region}. The number of request from this region is {num}.".format(region=popular_region, num=num_of_requests))
        return popular_region, num_of_requests
    
    # method to create database with all patients information
    def create_database(self):
        database = []
        
        for i in range(len(self.ages)):
            patient = {}
            patient["Age"] = self.ages[i]
            patient["Sex"] = self.sexes[i]
            patient["BMI"] = self.bmis[i]
            patient["Number of Children"] = self.num_children[i]
            patient["Smoker Status"] = self.smoker_statuses[i]
            patient["Region"] = self.regions[i]
            patient["Insurance Cost"] = self.insurance_costs[i]
            database.append(patient)

        return database

The next step is to create an instance of the class called `patients_data`. With this instance, each method can be used to see the results of the analysis.

In [18]:
patients_data = PatientsInfo(ages, sexes, bmis, num_children, smoker_statuses, regions, insurance_costs)

In [19]:
average_age = patients_data.calculate_average_age()

Average age of patients: 39.21 year(s)


The average age of the patients in **insurance.csv** is about 39 years old. This is important to check in order to ensure the data in **insurance.csv** is representative for a broader population. If it is decided to use the dataset to make inferences about other populations, the data must abundant and broad enough for such use cases.

A further analysis would have to be done to make sure the [range](https://www.mathsisfun.com/data/range.html#:~:text=The%20Range%20is%20the%20difference,is%209%20%E2%88%92%203%20%3D%206.) and [standard deviation](https://www.mathsisfun.com/data/standard-deviation.html) of the patient age group in **insurance.csv** is indicative of a random sampling of individuals. 

In [20]:
patients_data.analyze_sexes()

Count for males: 676
Count for females: 662


The next step of the analysis is to check the balance of males vs. females in **insurance.csv**. Similar to above, it is important to check that this dataset is representative of a broader population of individuals. If a person were to use this dataset to create a classification model, it would be imperitive to make sure that the attributes are balanced.

Quite often in the real-world, data is not balanced; this is an issue because it can lead to statistical issues when performing analysis. This is something that will be explored further in future portfolio projects!

In [21]:
average_insurance_cost = patients_data.calculate_average_cost()

Average yearly medical insurance cost: 13270.42 dollars.


The average yearly medical insurance charge per individual is 13270 US dollars. Some further analysis could be done to see what patient attributes contribute most strongly to low and/or high medical insurance charges. For example, one could check if patient age correlates with the amount of money they spend yearly.

In [22]:
smoking_difference = patients_data.calculate_smoking_difference()

There are 274 smoking and 1064 non-smoking patients in our data sample. The difference between average insurance costs is 23615.96 dollars.


In this case, two issues are resolved at once. Firstly, the difference in average costs of insurance is 23615 US dollars, which is quite sufficient. However, it should be considered that this difference also depends on other factors, f.e BMI, sex, age, number of children. Therefore, it is difficult to talk about the determining influence of smoking status on the final medical insurance cost. 

Secondly, in this data sample, the number of nonsmokers exceeds the number of smokers by more than three times. This indicates a decreasing interest in smoking and an increasing interest in health conditions. For further analysis, it is possible to determine in what age range the smoking patients are. Age range analysis can serve two purposes: on the one hand, it will help tobacco companies identify their current target audience and focus their marketing. On the other hand, the same information is important for social advertising in favor of smoking cessation.

In [23]:
highest_bmi = patients_data.calculate_highest_bmi()
print(highest_bmi)

In current sample male BMI is 0.565 points higher than female one.


_Note, that rounding the difference is up to 3 characters because in current dataset it is applied in case of value ends by 5._

As is seen from above, the average BMI of males vs. females in **insurance.csv** is in balance. Similar to method `analyze_sexes()` above it indicates sampling accuracy.

According to the WHO, normal BMI is between 18.50 and 24.99. For further analysis, the percentage of men and women with BMI within normal limits, with underweight and overweight can be determined. This can be useful for the development of complex medical programs for weight loss/weight gain, focusing on differences between male and female physiology.

In [24]:
average_age_with_children = patients_data.calculate_age_children()

Average age of patients with 1 child or more is 39.78 year(s).


The result suggests that the average age of patients with at least one child and the average age of patients in general are almost the same. On the one hand, this once again confirms the representativeness of the data sample, and on the other hand, that this method itself cannot serve as a method of any quantitative estimation. For the purpose of latter, it is possible to determine the age limits, where the majority of patients without children / with one child / with two children, etc. fall in.

In addition, it is acceptable to trace whether there is a direct relationship between the number of children and their parent's smoking status.

In [25]:
unique_regions = patients_data.count_unique_regions()
print(unique_regions)

['southwest', 'southeast', 'northwest', 'northeast']


There are four unique geographical regions in this dataset, and it is important to note that all the patients come from the United States.

In [26]:
popular_region = patients_data.calculate_popular_region()

The most popular region for medical insurance is Southeast. The number of request from this region is 364.


The Southeast is one of the most populous regions in the United States, and it also has one of the lowest unemployment rates. Apparently, this is the reason for large number of insurance policies issued. However, to confirm this hypothesis, it will be necessary at least:
* within this sample, determine the number of claims for insurance in the other three regions of the United States and quantitatively compare these values with the Southeast value presented above
* determine the percentage of insurance provided by the employer

In [27]:
patients_database = patients_data.create_database()
print(patients_database)

[{'Age': '19', 'Sex': 'female', 'BMI': '27.9', 'Number of Children': '0', 'Smoker Status': 'yes', 'Region': 'southwest', 'Insurance Cost': '16884.924'}, {'Age': '18', 'Sex': 'male', 'BMI': '33.77', 'Number of Children': '1', 'Smoker Status': 'no', 'Region': 'southeast', 'Insurance Cost': '1725.5523'}, {'Age': '28', 'Sex': 'male', 'BMI': '33', 'Number of Children': '3', 'Smoker Status': 'no', 'Region': 'southeast', 'Insurance Cost': '4449.462'}, {'Age': '33', 'Sex': 'male', 'BMI': '22.705', 'Number of Children': '0', 'Smoker Status': 'no', 'Region': 'northwest', 'Insurance Cost': '21984.47061'}, {'Age': '32', 'Sex': 'male', 'BMI': '28.88', 'Number of Children': '0', 'Smoker Status': 'no', 'Region': 'northwest', 'Insurance Cost': '3866.8552'}, {'Age': '31', 'Sex': 'female', 'BMI': '25.74', 'Number of Children': '0', 'Smoker Status': 'no', 'Region': 'southeast', 'Insurance Cost': '3756.6216'}, {'Age': '46', 'Sex': 'female', 'BMI': '33.44', 'Number of Children': '1', 'Smoker Status': 'no',

All patient data is now neatly organized in a list of dictionaries. This is convenient for further analysis if a decision is made to continue making investigations for the attributes in **insurance.csv**.

In conclusion, it can be noted that if we talk about urban planning, then the number of kindergartens, playgrounds, fitness centers, smoking areas, etc. can be planned based on the current analysis applied to a sample of the population of the desired region.