# U.S. Medical Insurance Costs

### Description

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

### Dataset Exploring

We need to import the pandas package so we can easily see what sort of data we are working with and load it into our code. 

Our analysis on the dataset, we will determine:

- The name of all the variables
- The type of the variables
- The number of people in dataset
- Any missing information

In [38]:
import pandas as pd

insurance_data = pd.read_csv("insurance.csv")
insurance_data.info()

print(insurance_data.head())
print(type(insurance_data["sex"][0]))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
<class 'str'>


We can see that we have 7 different variables and the types for them:
- Age: Integer
- Sex: String (Pandas stores it as a Python object)
- BMI: Float
- Children: Integer
- Smoker: String (Pandas stores it as a Python object)
- Region: String (Pandas stores it as a Python object)
- Charges: Float

There are 1338 entries in the dataset and no missing information

### Analysis

Now that we have the dataset available in Python, it's time to see what sort of information we can find.
Some ideas from the project directory include:
- Find out the average age of the patients in the dataset.
- Analyze where a majority of the individuals are from.
- Look at the different costs between smokers vs. non-smokers.

We can use these as starting points and expand each point even further.

1. Average age of a person

In [79]:
insurance_data["age"].describe().round(0)

count    1338.0
mean       39.0
std        14.0
min        18.0
25%        27.0
50%        39.0
75%        51.0
max        64.0
Name: age, dtype: float64

The average overall age is 39 (rounded to whole age) but we can also determine the averages grouped by different varaibles such as:
- The average age separated by sex
- The avgerage age separated by smoker status
- The average age by area
- The average age of people with children vs no children

In [81]:
# Avg age by sex
insurance_data.groupby("sex")["age"].describe().round(0)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,662.0,40.0,14.0,18.0,27.0,40.0,52.0,64.0
male,676.0,39.0,14.0,18.0,26.0,39.0,51.0,64.0


In [82]:
# Avg age by smoker status
insurance_data.groupby("smoker")["age"].describe().round(0)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
no,1064.0,39.0,14.0,18.0,27.0,40.0,52.0,64.0
yes,274.0,39.0,14.0,18.0,27.0,38.0,49.0,64.0


In [83]:
# Avg age by area
insurance_data.groupby("region")["age"].describe().round(0)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
northeast,324.0,39.0,14.0,18.0,27.0,40.0,51.0,64.0
northwest,325.0,39.0,14.0,19.0,26.0,39.0,51.0,64.0
southeast,364.0,39.0,14.0,18.0,27.0,39.0,51.0,64.0
southwest,325.0,39.0,14.0,19.0,27.0,39.0,51.0,64.0


In [85]:
# Avg age parent vs non-parent
insurance_data["parent"] = insurance_data["children"] > 0
insurance_data.groupby('parent')['age'].describe().round(0)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
parent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,574.0,38.0,16.0,18.0,22.0,36.0,55.0,64.0
True,764.0,40.0,12.0,18.0,30.0,40.0,49.0,64.0


From looking at the different grouping of the people by their statuses, it seems like the data has all of the people being roughly the same age despite the differences. This was a bit surprising as I didn't expect the data to be so uniform, especially the smoker status.

The biggest change between statuses was the parental status which does make a lot of sense but even so, I thought there would be more of a difference where non-parents would be mostly younger.