# Hypothesis Testing with Insurance Data
Cameron Peace

### Task

For this assignment, we will be working with the [US Health Insurance Dataset](https://docs.google.com/spreadsheets/d/e/2PACX-1vQBN8DPW2rdiRrY34eEM53HAzakNGSRrw4ogI-j8HyCUrbqTB_z4CeIn2IvjLF-w_6sOe5pIlypJGAA/pub?output=csv) from [Kaggle](https://www.kaggle.com/teertha/ushealthinsurancedataset).

We have been asked to use our hypothesis testing skills to answer the following questions:

- Q1. Do smokers have higher insurance charges than non-smokers?
- Q2. Are men more likely to smoke than women?
- Q3. Do different regions have different charges, on average?

For each question, make sure to:

* [ ] State your Null Hypothesis and Alternative Hypothesis
* [ ] Select the correct test according to the data type and number of samples
* [ ] Test the assumptions of your selected test.
* [ ] Execute the selected test, or the alternative test (if you do not meet the assumptions)
* [ ] Interpret your p-value and reject or fail to reject your null hypothesis 
* [ ] Show a supporting visualization that helps display the result

### Data Background

From Kaggle:
>This dataset contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region. There are no missing or undefined values in the dataset.

There is no information regarding data collection, provenance, time frame, etc.  The data are assumed to be fictitious.

### Data Dictionary

* **Age**  - Age of primary beneficiary

* **sex** - Insurance contractor gender, female / male

* **bmi** - Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9.

* **children** - Number of children covered by health insurance / Number of dependents

* **smoker** - Smoker / Non - smoker

* **region** - The beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

* **charges** - Individual medical costs billed by health insurance.

### Imports

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import scipy.stats as stats

### Loading, Viewing Data

In [2]:
# loading
df = pd.read_csv('insurance.csv')

# making a copy in case comparison is needed
df_original = df.copy()

In [3]:
# initial viewing
df.sample(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
469,18,female,24.09,1,no,southeast,2201.0971
463,56,male,25.935,0,no,northeast,11165.41765
621,37,male,34.1,4,yes,southwest,40182.246
1088,52,male,47.74,1,no,southeast,9748.9106
565,19,female,30.495,0,no,northwest,2128.43105


In [4]:
display(df.info(), df.describe(include='all'), df.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


None

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,1338.0,1338,1338.0,1338.0,1338,1338,1338.0
unique,,2,,,2,4,
top,,male,,,no,southeast,
freq,,676,,,1064,364,
mean,39.207025,,30.663397,1.094918,,,13270.422265
std,14.04996,,6.098187,1.205493,,,12110.011237
min,18.0,,15.96,0.0,,,1121.8739
25%,27.0,,26.29625,0.0,,,4740.28715
50%,39.0,,30.4,1.0,,,9382.033
75%,51.0,,34.69375,2.0,,,16639.912515


Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

<mark><u>**Comment:**</u>

<font color='dodgerblue' size=4><i>
This dataset looks fairly clean at first glance.  We have outliers in 'charges' and 'children'.  It appears our sexes are balanced, our smoker/non-smoker values are not (more non-smokers).  This dataset also seems to be skewed towards overweight/obese patients.
</i></font>

### Cleaning, checking data

In [5]:
# checking from duplicates
df.duplicated().sum()

1

In [6]:
# checking shape for confirmation
display(df.shape)

# removing duplicate entry
df = df.drop_duplicates().copy()

# confirming
display(df.shape)

(1338, 7)

(1337, 7)

In [7]:
# checking for NaNs
df.isna().sum().sum()

0

In [14]:
# checking for incorrect values 
for i in df.columns:
    if df[i].dtype == 'object' or df[i].nunique() < 15:
        print(i + ':\n', df[i].unique(), '\n****')

sex:
 ['female' 'male'] 
****
children:
 [0 1 3 2 5 4] 
****
smoker:
 ['yes' 'no'] 
****
region:
 ['southwest' 'southeast' 'northwest' 'northeast'] 
****


<mark><u>**Comment:**</u>

<font color='dodgerblue' size=4><i>
Looking good here, we dropped 1 duplicate value, but otherwise our dataset looks clean.
</i></font>

## **Question 1: Do smokers have higher insurance charges than non-smokers?**

### 