# U.S. Medical Insurance Costs

In this project, I will use Python to analyze a file called **insurance.csv** to learn more about patient information and prepare data for future analysis if it would be needed. 

In [7]:
# import csv library
import csv

To start the `csv` library will be imported to work with the **insurance.csv** data.

In [8]:
# Creating lists of every variable to fill them later 
ages = []
sexes = []
bmis = []
number_of_children = []
smoker_status = []
regions = []
charges = []

These lists will be filled later with our **import_by_column** function.

In [19]:
# Opening the insurance.csv file in reader mode 
with open("insurance.csv", "r") as file:
    # creating a list of dictionaries for every row in the file
    dict_reader = csv.DictReader(file)
    # iterate through each row and print them
    for row in dict_reader:
        print(row)

{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}
{'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}
{'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}
{'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}
{'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}
{'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}
{'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '8240.5896'}
{'age': '37', 'sex': 'female', 'bmi': '27.74', 'children': '3', 'smoker': 'no', 'region': 'northwest', 'charges': '7281.

This will `help` us to take a look to see how it is the data displayed.

In [10]:
# Creating a function that imports data separated by columns 

def import_by_column(lst, dataset, column_name):
    # Open the CSV file
    with open(dataset) as file:
        # Read the contents of the file
        dict_reader = csv.DictReader(file)
        # Iterate over the rows of the file
        for row in dict_reader:
            lst.append(row[column_name])
        # Return the list
        return lst
    

Thanks to this function we wont have to create `7` times for loops, only to apply it to our seven columns. 

In [11]:
# Importing data by columns and appending to our previous created lists

import_by_column(ages, "insurance.csv", "age")
import_by_column(sexes, "insurance.csv", "sex")
import_by_column(bmis, "insurance.csv", "bmi")
import_by_column(number_of_children, "insurance.csv", "children")
import_by_column(smoker_status, "insurance.csv", "smoker")
import_by_column(regions, "insurance.csv", "region")
import_by_column(charges, "insurance.csv", "charges")

['16884.924',
 '1725.5523',
 '4449.462',
 '21984.47061',
 '3866.8552',
 '3756.6216',
 '8240.5896',
 '7281.5056',
 '6406.4107',
 '28923.13692',
 '2721.3208',
 '27808.7251',
 '1826.843',
 '11090.7178',
 '39611.7577',
 '1837.237',
 '10797.3362',
 '2395.17155',
 '10602.385',
 '36837.467',
 '13228.84695',
 '4149.736',
 '1137.011',
 '37701.8768',
 '6203.90175',
 '14001.1338',
 '14451.83515',
 '12268.63225',
 '2775.19215',
 '38711',
 '35585.576',
 '2198.18985',
 '4687.797',
 '13770.0979',
 '51194.55914',
 '1625.43375',
 '15612.19335',
 '2302.3',
 '39774.2763',
 '48173.361',
 '3046.062',
 '4949.7587',
 '6272.4772',
 '6313.759',
 '6079.6715',
 '20630.28351',
 '3393.35635',
 '3556.9223',
 '12629.8967',
 '38709.176',
 '2211.13075',
 '3579.8287',
 '23568.272',
 '37742.5757',
 '8059.6791',
 '47496.49445',
 '13607.36875',
 '34303.1672',
 '23244.7902',
 '5989.52365',
 '8606.2174',
 '4504.6624',
 '30166.61817',
 '4133.64165',
 '14711.7438',
 '1743.214',
 '14235.072',
 '6389.37785',
 '5920.1041',
 '176

Now that all the data from **insurance.csv** neatly organized into labeled lists, the analysis can be started. 
The following operations will be implemented:
* find average age of the patients
* return the number of males vs. females counted in the dataset
* find geographical location of the patients
* return the average yearly medical charges of the patients
* creating a dictionary that contains all patient information


In [30]:
# We are going to create some functions that help us to analyse our data

# The first function will help us to calculate the average age of our patients
def average_age():
    # initialize the sum of ages at 0
    sum_of_ages = 0
    # sum all ages in our ages list
    for age in ages:
        sum_of_ages += int(age)
    number_of_patients = len(ages)
    # calculate the division of sum of ages by number of patiens
    average_age = round(sum_of_ages / number_of_patients, 2)
    # return the average age of our patients
    return "Average patient age: " + str(average_age) + " years."



# A function that calculate the number of males and females in our dataset
def gender_calculator():
    # establish the count of number of males and females at 0
    number_of_males = 0
    number_of_females = 0
    # iterate through each sex in sexes
    for sex in sexes:
        # if sex is male add to male variable 
        if sex == "male":
            number_of_males += 1
        # if sex is female add to female variable
        else:
            number_of_females += 1
    # return the count for each sex
    return "Count for males :" + str(number_of_males),"Count for females: " + str(number_of_females)



# A function that find the geographical region of our patiens

# First we are going to look how many different unique regions are in our list
def unique_regions(regions):
    # Create an empty list 
    different_regions = []
    # Iterate through every region in regions
    for region in regions:
        # add refion if region is not arleady in different_regions list
        if region not in different_regions:
            different_regions.append(region)
    # return our different_regions list filled
    return(different_regions)


# Now we can create our function knowing that there are only 4 different regions
def region_counter():
    # Establish the count for each region at 0
    northwest = 0
    northeast = 0
    southwest = 0
    southeast = 0
    # Iterate through each region in regions 
    for region in regions:
        # add if northwest
        if region == "northwest":
            northwest += 1
        # add if northeast
        elif region == "northeast":
            northeast += 1
        # add if southwest
        elif region == "southwest":
            southwest += 1
        # add if southeast
        else:
            southeast += 1
    # return the number of patients in each regions
    return "northwest = " + str(northwest), "northeast = " + str(northeast), "southwest = " + str(southwest), "southeast = " + str(southeast)
    


# A function that return the average yearly medical charges of the patiens
def average_yearly_charges():
    # Calculating the sum of all charges
    sum_of_charges = 0
    for cost in charges:
        sum_of_charges += float(cost)
    # Calculating number of clients
    number_of_clients = 0
    for cost in charges:
        number_of_clients += 1
    # Calculating average charges
    average_charges = sum_of_charges / number_of_clients
    # return the yearly charges rounded
    return "The average yearly charges per client are: " + str(round(average_charges, 2)) + " dollars."



# Making  a dictionary with our list in case we need to do more analysis in the future
def dictionary_creator():
    # create a dictionary with seven empty lists
    dictionary = {"ages": [], "sexes": [], "bmis": [], "number_of_children": [], "smoker_status": [], "regions": [], "charges": [] }
    # iterate through the lenght of ages and add each record of our lists in the empty lists above
    for i in range(len(ages)):
        dictionary["ages"].append(ages[i])
        dictionary["sexes"].append(sexes[i])
        dictionary["bmis"].append(bmis[i])
        dictionary["number_of_children"].append(number_of_children[i])
        dictionary["smoker_status"].append(smoker_status[i])
        dictionary["regions"].append(regions[i])
        dictionary["charges"].append(charges[i])
    # return the dictionary
    return dictionary
      

Now the functions will help us in our analysis.

In [22]:
   # Using our average_age function

average_age()

'Average patient age: 39.21 years.'

The average age of patients in **insurance.csv** is 39. It's important to check this to make sure the data is representative and can be used to make inferences about other populations. The data must be sufficient for such use cases. 

Further analysis of the **range** and **standard deviation** of patient ages is needed to ensure the data is a random sampling of individuals.

In [23]:
   # Using our gender_calculation function
    
gender_calculator()

('Count for males :676', 'Count for females: 662')

Similar to above, it is important to check that this dataset is **representative** of a broader population of individuals.

In [24]:
   # Using our unique_regions function
    
unique_regions(regions)

['southwest', 'southeast', 'northwest', 'northeast']

There are only **four** unique geographical regions in the dataset each of them referring to the United States.

In [25]:
   # Using our region_counter function

region_counter()

('northwest = 325', 'northeast = 324', 'southwest = 325', 'southeast = 364')

We can see that our dataset is **slightly biased** because we have more individuals from the southeast than from the other regions.

In [31]:
   # Using our average_yearly_charges function

average_yearly_charges()

'The average yearly charges per client are: 13270.42 dollars.'

The average yearly medical insurance charge per individual is **13270 US dollars**. Further analysis could identify patient attributes that contribute to low/high insurance charges.

In [27]:
   # Using our dictionary_maker function

dictionary_creator()

{'ages': ['19',
  '18',
  '28',
  '33',
  '32',
  '31',
  '46',
  '37',
  '37',
  '60',
  '25',
  '62',
  '23',
  '56',
  '27',
  '19',
  '52',
  '23',
  '56',
  '30',
  '60',
  '30',
  '18',
  '34',
  '37',
  '59',
  '63',
  '55',
  '23',
  '31',
  '22',
  '18',
  '19',
  '63',
  '28',
  '19',
  '62',
  '26',
  '35',
  '60',
  '24',
  '31',
  '41',
  '37',
  '38',
  '55',
  '18',
  '28',
  '60',
  '36',
  '18',
  '21',
  '48',
  '36',
  '40',
  '58',
  '58',
  '18',
  '53',
  '34',
  '43',
  '25',
  '64',
  '28',
  '20',
  '19',
  '61',
  '40',
  '40',
  '28',
  '27',
  '31',
  '53',
  '58',
  '44',
  '57',
  '29',
  '21',
  '22',
  '41',
  '31',
  '45',
  '22',
  '48',
  '37',
  '45',
  '57',
  '56',
  '46',
  '55',
  '21',
  '53',
  '59',
  '35',
  '64',
  '28',
  '54',
  '55',
  '56',
  '38',
  '41',
  '30',
  '18',
  '61',
  '34',
  '20',
  '19',
  '26',
  '29',
  '63',
  '54',
  '55',
  '37',
  '21',
  '52',
  '60',
  '58',
  '29',
  '49',
  '37',
  '44',
  '18',
  '20',
  '44',


All patient data is now **organized in a dictionary**. This is convenient for further analysis.