# U.S. Medical Insurance Costs

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals plus more things I discovered on the internet. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

In [1]:
import pandas as pd
from collections import Counter

To start, all necessary libraries must be imported. For this project the libraries needed are the `pandas` library and the `Counter` library in order to work with the **insurance.csv** data. There are other potential libraries that could help with this project; however, for this analysis, using just the `pandas` and `Counter` library will suffice.

In [2]:
# load csv data
df = pd.read_csv("insurance.csv")

In this project, a CSV file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within insurance.csv to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

The next step is to look through **insurance.csv** in order to get aquanted with the data. Using `pandas` makes this fairly easy. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:
* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

In [3]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [4]:
df.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [5]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. To store this information, seven variables will be created to hold each individual column of data from **insurance.csv**.


In [6]:
#Create variables for the various attributes in insurance.csv
ages = df.age
sexes = df.sex
bmis = df.bmi
num_of_children = df.children
smokers = df.smoker
regions = df.region
charges = df.charges

Now that all the data from **insurance.csv** neatly organized into labeled variables, the analysis can be started.  The following operations will be implemented:
* find average age of the patients
* return the number of males vs. females counted in the dataset
* find geographical location where the most patients live
* return the average yearly medical charges of the patients
* return the average yearly medical charges of patients that are smokers vs non smokers
* return the average age of smokers vs non smokers
* return the average age of patients with at least one child

In [7]:
print(f"The average age of patients is {round(df['age'].mean(),2)} years old.")

The average age of patients is 39.21 years old.


The average age of the patients in **insurance.csv** is about 39 years old. This is important to check in order to ensure the data in **insurance.csv** is representative for a broader population. If it is decided to use the dataset to make inferences about other populations, the data must abundant and broad enough for such use cases.

A further analysis would have to be done to make sure the [range](https://www.mathsisfun.com/data/range.html#:~:text=The%20Range%20is%20the%20difference,is%209%20%E2%88%92%203%20%3D%206.) and [standard deviation](https://www.mathsisfun.com/data/standard-deviation.html) of the patient age group in **insurance.csv** is indicative of a random sampling of individuals. 

In [8]:
def find_gender_count(sex):
    males = 0
    females = 0
    for sex in sexes:
        if sex == 'male':
            males += 1
        else:
            females += 1
    return f"There are {males} males and {females} females."       

In [9]:
find_gender_count(sexes)

'There are 676 males and 662 females.'

The next step of the analysis is to check the balance of males vs. females in **insurance.csv**. It is important to check that this dataset is representative of a broader population of individuals. If a person were to use this dataset to create a classification model, it would be imperitive to make sure that the attributes are balanced.

In [10]:
# function to find the region with the most patients
def num_of_location(region):
    num_locations = Counter(region)
    most = num_locations.most_common(1)[0]
    return f"The {most[0].title()} region has the most patients with {most[1]} patients."

In [11]:
num_of_location(regions)

'The Southeast region has the most patients with 364 patients.'

In [12]:
print(Counter(regions))

Counter({'southeast': 364, 'southwest': 325, 'northwest': 325, 'northeast': 324})


The next step of the analysis is to check the number of unique regions and which region has the most patients in **insurance.csv**.
There are four unique geographical regions in this dataset, and it is important to note that all the patients come from the United States. 

In [13]:
# initialize an index and set it equal to zero
i = 0 
# create an empty list for smokers prices
smokers_charges_list = []
# create an empty list for non smokers prices
non_smokers_charges = []
# create an empty list for smokers ages
smokers_ages = []
# create an empty list for non smokers ages
non_smokers_ages = []

#loop through charges column
for charge in charges:
    # check to make sure index is less than total number of charges
    if i > len(charges):
        break
    # check if patient is a smoker at each index
    elif 'yes' in smokers[i]:
        # if patient is a smoker then loop through smokers list
        for smoker in smokers:
            # for each smoker append the charge to the new smokers charges list
            smokers_charges_list.append(charge)
            # also append the age at the proper index 'i' to the new smokers ages list
            smokers_ages.append(ages[i])
            # increase the index by 1 to continue 
            i += 1
            # get out of the loop
            break
    # check if patient is not a smoker at each index
    elif 'no' in smokers[i]:
        # if patient is not a smoker then loop through smokers list
        for smoker in smokers:
            # append the charge to the new non smokers charges list
            non_smokers_charges.append(charge)
            # append the age at the proper index 'i' to the new non smokers ages list
            non_smokers_ages.append(ages[i])
            # increase the index by 1 to continue
            i += 1
            # get out of the loop
            break

The next step of the analysis is to seperate the insurance costs and ages of smokers and non smokers in **insurance.csv**. Once the ages and insurance costs have been seperated the averages are calculated to see if there is a big difference in ages and insurance costs based on smoking.

In [14]:
# used to find the average of a list of values
def find_average(lst):
    count = 0
    for item in lst:
        count += item
        length = len(lst)
        average = count / length
    return average

# find the average cost a smoker pays, round it, and then set it to a new variable
smoke_cost_average = round(find_average(smokers_charges_list))
# find the average cost a non smoker pays, round it and then set it to a new variable
non_smoke_average = round(find_average(non_smokers_charges))
# show savings in cost by subtracting non smoker cost from smoker cost
savings = smoke_cost_average - non_smoke_average

# find the average smoking age, round it two decimal places and then set it to a new variable
average_smoking_age = round(find_average(smokers_ages),2)
# find the average non smoking age round it two decimal places and then set it to a new variable
average_non_smoking_age = round(find_average(non_smokers_ages),2)

In [15]:
print(f"The average yearly cost per individual is ${'{:,}'.format(round(df['charges'].mean(),2))} dollars")

The average yearly cost per individual is $13,270.42 dollars


In [16]:
print(f"The average yearly medical insurance cost for a smoker is ${'{:,}'.format(smoke_cost_average)} dollars.")

The average yearly medical insurance cost for a smoker is $32,050 dollars.


In [17]:
print(f"The average yearly medical insurance cost for a non-smoker is ${'{:,}'.format(non_smoke_average)} dollars.")

The average yearly medical insurance cost for a non-smoker is $8,434 dollars.


In [18]:
print(f"If you do not smoke you will save ${'{:,}'.format(savings)} dollars.")

If you do not smoke you will save $23,616 dollars.


In [19]:
print(f"The average age of a smoker is {average_smoking_age} years old.")

The average age of a smoker is 38.51 years old.


In [20]:
print(f"The average age of a non-smoker is {average_non_smoking_age} years old.")

The average age of a non-smoker is 39.39 years old.


The average ages of a smoker and a non smoker are relatively the same.

In [21]:
# initialize an index and set it equal to zero
n = 0
# create an empty list for ages of patients that have 1 child
age_with_1kid = []
# loop through number of children column
for child in num_of_children:
    # check if the index is bigger than the length of number of children column
    if n > len(num_of_children):
        break
    # check if individual has at least one child
    elif child > 0:
        # add age of individuals that have at least one child
        age_with_1kid.append(ages[n])
        # increase the index variable to continue the loop
        n += 1
        # check if patient doesn't have a child
    elif child == 0:
        # increase the index variable to continue the loop
        n += 1

In [22]:
average_age_with_1kid = round(find_average(age_with_1kid))

In [51]:
print(f"The average age of an individual with at least 1 child is {int(average_age_with_1kid)} years old.")

The average age of an individual with at least 1 child is 40 years old.


The next and final step of the analysis is to find out the average ages of patients with at least one child in **insurance.csv**.