# U.S. Medical Insurance Costs

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals plus more things I discovered on the internet. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

In [1]:
import pandas as pd

To start, all necessary libraries must be imported. For this project the library needed is the `pandas` library in order to work with the **insurance.csv** data. There are other potential libraries that could help with this project; however, for this analysis, using just the `pandas` library will suffice.

In [2]:
df = pd.read_csv("insurance.csv")

In this project, a CSV file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within insurance.csv to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

The next step is to look through **insurance.csv** in order to get aquanted with the data. Using `pandas` makes this fairly easy. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:
* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

In [3]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. To store this information, seven variables will be created to hold each individual column of data from **insurance.csv**.


In [4]:
ages = df.age
sexes = df.sex
bmis = df.bmi
num_of_children = df.children
smokers = df.smoker
regions = df.region
charges = df.charges

Now that all the data from **insurance.csv** neatly organized into labeled variables, the analysis can be started.  The following operations will be implemented:
* find average age of the patients
* return the number of males vs. females counted in the dataset
* find geographical location where the most patients live
* return the average yearly medical charges of the patients
* return the average yearly medical charges of patients that are smokers vs non smokers
* return the average age of smokers vs non smokers
* return the average age of patients with at least one child
* return the average insurance cost based on region

In [5]:
print(f"The average age of patients is about {round(ages.mean(),2)} years old.")

The average age of patients is about 39.21 years old.


The average age of the patients in **insurance.csv** is about 39 years old. This is important to check in order to ensure the data in **insurance.csv** is representative for a broader population. If it is decided to use the dataset to make inferences about other populations, the data must abundant and broad enough for such use cases.

A further analysis would have to be done to make sure the [range](https://www.mathsisfun.com/data/range.html#:~:text=The%20Range%20is%20the%20difference,is%209%20%E2%88%92%203%20%3D%206.) and [standard deviation](https://www.mathsisfun.com/data/standard-deviation.html) of the patient age group in **insurance.csv** is indicative of a random sampling of individuals. 

In [6]:
females = df.loc[df['sex'] == 'female']
males = df.loc[df['sex'] == 'male']

In [7]:
def analyze_m_f_count(m, f):
    return f"There are {len(m)} males and {len(f)} females."

In [8]:
analyze_m_f_count(males, females)

'There are 676 males and 662 females.'

The next step of the analysis is to check the balance of males vs. females in **insurance.csv**. It is important to check that this dataset is representative of a broader population of individuals. If a person were to use this dataset to create a classification model, it would be imperitive to make sure that the attributes are balanced.

In [9]:
regions.value_counts()

southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64

The next step of the analysis is to check the number of unique regions and which region has the most patients in **insurance.csv**.
There are four unique geographical regions in this dataset, and it is important to note that all the patients come from the United States. 

In [35]:
def calculate_avg_charges(data):
    total = 0
    for charge in data:
        total += charge
    return "$"+'{:,}'.format(round(total / len(data),2))

In [11]:
def analyze_total_costs(total):
    print(f"Average Insurance: {total}")

In [34]:
analyze_total_costs(calculate_avg_charges(charges))

Average Insurance: $13,270.42


Here I analyze the average insurance cost for all patients.

In [13]:
smokers = df.loc[df['smoker'] == 'yes']
non_smokers = df.loc[df['smoker'] == 'no']

In [14]:
smokers_avg_cost = calculate_avg_charges(smokers['charges'])
non_smokers_avg_cost = calculate_avg_charges(non_smokers['charges'])

In [15]:
def analyze_cost_by_smoker_and_non(s, ns):
    print(f"Smokers Average Cost:\t  {s}")
    print(f"Non Smokers Average Cost: {ns}")

In [16]:
analyze_cost_by_smoker_and_non(smokers_avg_cost, non_smokers_avg_cost)

Smokers Average Cost:	  $32,050.23
Non Smokers Average Cost: $8,434.27


Here I analyze the average insurance cost broken down by smokers and non smokers.

In [17]:
female_smokers = df.loc[(df['smoker'] == 'yes') & (df['sex'] == 'female')]
male_smokers = df.loc[(df['smoker'] == 'yes') & (df['sex'] == 'male')]
female_non_smokers = df.loc[(df['smoker'] == 'no') & (df['sex'] == 'female')]
male_non_smokers = df.loc[(df['smoker'] == 'no') & (df['sex'] == 'male')]

In [18]:
female_smoker_avg = calculate_avg_charges(female_smokers['charges'])
male_smoker_avg = calculate_avg_charges(male_smokers['charges'])
female_non_smoker_avg = calculate_avg_charges(female_non_smokers['charges'])
male_non_smoker_avg = calculate_avg_charges(male_non_smokers['charges'])

In [36]:
def analyze_cost_by_gender_and_smoker(m,f,mn,fn):
    print(f"Female Smoker Average Cost:\t {f}")
    print(f"Male Smoker Average Cost:\t {m}")
    print(f"Female Non Smoker Average Cost:\t {fn}")
    print(f"Male Non Smoker Average Cost:\t {mn}")

In [37]:
analyze_cost_by_gender_and_smoker(male_smoker_avg, female_smoker_avg, male_non_smoker_avg, female_non_smoker_avg)

Female Smoker Average Cost:	 $30,679.0
Male Smoker Average Cost:	 $33,042.01
Female Non Smoker Average Cost:	 $8,762.3
Male Non Smoker Average Cost:	 $8,087.2


For some extra analysis I want to calculate the average cost for a male smoker and a female smoker.
Then analyze the average cost for a male non smoker and female non smoker.

In [21]:
female_smoker_avg_age = female_smokers['age'].mean()
male_smoker_avg_age = male_smokers['age'].mean()
female_non_smoker_avg_age = female_non_smokers['age'].mean()
male_non_smoker_avg_age = male_non_smokers['age'].mean()

In [22]:
def analyze_avg_age_smoker_vs_non(fsa, fna, msa, mna):
    print(f"Female Smoker Average Age:\t {round(fsa,2)}")
    print(f"Female Non Smoker Average Age:\t {round(fna,2)}")
    print(f"Male Smoker Average Age:\t {round(msa,2)}")
    print(f"Male Non Smoker Average Age:\t {round(mna,2)}")

In [23]:
analyze_avg_age_smoker_vs_non(female_smoker_avg_age, female_non_smoker_avg_age, male_smoker_avg_age, male_non_smoker_avg_age)

Female Smoker Average Age:	 38.61
Female Non Smoker Average Age:	 39.69
Male Smoker Average Age:	 38.45
Male Non Smoker Average Age:	 39.06


The next step of the analysis is to look at ages of smokers and non smokers in **insurance.csv**. Checking to see if there is any correlation in ages and smoking. Also we seperate the data into male and female smokers

The average ages of a smoker and a non smoker are relatively the same.

In [25]:
patients_without = df.loc[df['children'] < 1]
patients_with_children = df.loc[df['children'] > 0]

In [26]:
avg_age_with_children = patients_with_children['age'].mean()
avg_age_without = patients_without['age'].mean()

In [27]:
def analyze_avg_age_parent_vs_non(parent, non_parent):
    print(f"Average Age of Patients with Children: {round(parent)}")
    print(f"Average Age of Patients without Children: {round(non_parent)}")

In [28]:
analyze_avg_age_parent_vs_non(avg_age_with_children, avg_age_without)

Average Age of Patients with Children: 40
Average Age of Patients without Children: 38


The next step of the analysis is to find out the average age of patients with at least one child and the average age of patients without children in **insurance.csv**.

In [29]:
northwest = df.loc[df['region'] == 'northwest']
southwest = df.loc[df['region'] == 'southwest']
southeast = df.loc[df['region'] == 'southeast']
northeast = df.loc[df['region'] == 'northeast']

In [30]:
southwest_average = calculate_avg_charges(southwest['charges'])
southeast_average = calculate_avg_charges(southeast['charges'])
northeast_average = calculate_avg_charges(northeast['charges'])
northwest_average = calculate_avg_charges(northwest['charges'])

In [31]:
def analyze_avg_by_region(swa,sea,nea,nwa):
    print(f"Southwest Average: {swa}")
    print(f"Southeast Average: {sea}")
    print(f"Northeast Average: {nea}")
    print(f"Northwest Average: {nwa}")

In [32]:
analyze_avg_by_region(southwest_average, southeast_average, northeast_average, northwest_average)

Southwest Average: $12,346.94
Southeast Average: $14,735.41
Northeast Average: $13,406.38
Northwest Average: $12,417.58


The last and final step of the analysis is to calculate the average insurance cost based on region patient lives.

In [41]:
n_west_smokers = df.loc[(df['smoker'] == 'yes') & (df['region'] == 'northwest')]
n_west_non_smokers = df.loc[(df['smoker'] == 'no') & (df['region'] == 'northwest')]
calculate_avg_charges(n_west_smokers['charges'])
calculate_avg_charges(n_west_non_smokers['charges'])

'$8,556.46'

# Conclusion