# U.S. Medical Insurance Costs

In [1]:
import pandas as pd
import numpy as np

### **Look over your Dataset**

- Open insurance.csv and take a look at the file. 
- Take note of how information is organized. 
- How will this affect how you analyze the data in Python? 
    - Is there anything of particular interest to you in the dataset that you want to investigate?
- Think about these things before you jump into analyzing it.

In [6]:
df = pd.read_csv('insurance.csv')
df.sex = df.sex.astype("string")
df.smoker = df.smoker.astype("string")
df.region = df.region.astype("string")
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   string 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   string 
 5   region    1338 non-null   string 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), string(3)
memory usage: 73.3 KB


In [43]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


### **Scoping Your Project**

- Now that you have looked over your dataset, plan out what you want to analyze. 
- What is it that you want to find out about this dataset? 
- Based on the way information is organized, certain inspections may be easier to perform than others. 
- As you map out the process, consider the scope of your analysis as well.


- Properly scoping your project will greatly benefit you; scoping creates structure while requiring you to think through your entire project before you begin. 
- You should start by stating the goals for your project, then gathering the data, and considering the analytical steps required. 
- A proper project scope can be a great road map for your project, but keep in mind that some down-stream tasks may become dead ends which will require adjustment to the scope.

#### Numerical Statistics

`df.describe()` - This will give you a summary of the numerical columns in your dataset.

In [45]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


#### Categorical Statistics

In [48]:
df.sex.value_counts()

sex
male      676
female    662
Name: count, dtype: int64

In [49]:
df.smoker.value_counts()

smoker
no     1064
yes     274
Name: count, dtype: int64

In [50]:
df.region.value_counts()

region
southeast    364
southwest    325
northwest    325
northeast    324
Name: count, dtype: int64

### **Analyzing Your Data**

Some possible ideas for analysis are the following:

- Find out the average age of the patients in the dataset.
- Analyze where a majority of the individuals are from.
- Look at the different costs between smokers vs. non-smokers.
- Figure out what the average age is for someone who has at least one child in this dataset.

#### **Compare Smoker and Non-Smoker**

In [54]:
df_smoker = df[df['smoker'] == 'yes']
df_smoker.describe()

Unnamed: 0,age,bmi,children,charges
count,274.0,274.0,274.0,274.0
mean,38.514599,30.708449,1.113139,32050.231832
std,13.923186,6.318644,1.157066,11541.547176
min,18.0,17.195,0.0,12829.4551
25%,27.0,26.08375,0.0,20826.244213
50%,38.0,30.4475,1.0,34456.34845
75%,49.0,35.2,2.0,41019.207275
max,64.0,52.58,5.0,63770.42801


In [56]:
df_nonsmoker = df[df['smoker'] == 'no']
df_nonsmoker.describe()

Unnamed: 0,age,bmi,children,charges
count,1064.0,1064.0,1064.0,1064.0
mean,39.385338,30.651795,1.090226,8434.268298
std,14.08341,6.043111,1.218136,5993.781819
min,18.0,15.96,0.0,1121.8739
25%,26.75,26.315,0.0,3986.4387
50%,40.0,30.3525,1.0,7345.4053
75%,52.0,34.43,2.0,11362.88705
max,64.0,53.13,5.0,36910.60803


#### **Compare Sex**

In [58]:
df_male = df[df.sex == "male"]
df_male.describe()

Unnamed: 0,age,bmi,children,charges
count,676.0,676.0,676.0,676.0
mean,38.91716,30.943129,1.115385,13956.751178
std,14.050141,6.140435,1.218986,12971.025915
min,18.0,15.96,0.0,1121.8739
25%,26.0,26.41,0.0,4619.134
50%,39.0,30.6875,1.0,9369.61575
75%,51.0,34.9925,2.0,18989.59025
max,64.0,53.13,5.0,62592.87309


In [60]:
df_female = df[df.sex == "female"]
df_female.describe()    

Unnamed: 0,age,bmi,children,charges
count,662.0,662.0,662.0,662.0
mean,39.503021,30.377749,1.074018,12569.578844
std,14.054223,6.046023,1.192115,11128.703801
min,18.0,16.815,0.0,1607.5101
25%,27.0,26.125,0.0,4885.1587
50%,40.0,30.1075,1.0,9412.9625
75%,51.75,34.31375,2.0,14454.691825
max,64.0,48.07,5.0,63770.42801


#### **Patient with at least one child**

In [61]:
df_min_one_child = df[df.children > 0]
df_min_one_child.describe()

Unnamed: 0,age,bmi,children,charges
count,764.0,764.0,764.0,764.0
mean,39.780105,30.74837,1.917539,13949.941093
std,11.927317,6.144777,0.983351,12138.305911
min,18.0,16.815,1.0,1711.0268
25%,30.0,26.37875,1.0,5809.641625
50%,40.0,30.495,2.0,9223.8295
75%,49.0,34.625,3.0,18232.3924
max,64.0,52.58,5.0,60021.39897
