# Hypothesis Core 
Nokuthula Mchunu

Use our hypothesis testing skills to answer the following questions:

- Q1. Do smokers have higher insurance charges than non-smokers?
- Q2. Are men more likely to smoke than women?
- Q3. Do different regions have different charges, on average?



For each question, make sure to:

State your Null Hypothesis and Alternative Hypothesis
Select the correct test according to the data type and number of samples
Test the assumptions of your selected test.
Execute the selected test, or the alternative test (if you do not meet the assumptions)
Interpret your p-value and reject or fail to reject your null hypothesis 
Show a supporting visualization that helps display the result

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [5]:
# populating df
url = '/Users/noksmchunu/Downloads/insurance - insurance.csv'
df = pd.read_csv(url)
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Q1. Do smokers have higher insurance charges than non-smokers?

In [6]:
# Explore the data:
# see how many with and without strength
df['smoker'].value_counts()


no     1064
yes     274
Name: smoker, dtype: int64

In [9]:
# Filtering out those who smoke and those who dont 
ssmoker_df = df.loc[df['smoker']== 'yes'].copy()
non_smoker_df = df.loc[df['smoker']== 'no'].copy()


In [10]:
ssmoker_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
11,62,female,26.29,0,yes,southeast,27808.7251
14,27,male,42.13,0,yes,southeast,39611.7577
19,30,male,35.3,0,yes,southwest,36837.467
23,34,female,31.92,1,yes,northeast,37701.8768


In [12]:
#Define our feature of interest
smoker_charge = ssmoker_df['charges']
no_smoker_charge = non_smoker_df['charges']


In [14]:
# Check for outliers in charges for smoker group
zscores= stats.zscore(smoker_charge)
outliers = abs(zscores)>3
np.sum(outliers)


0

In [15]:
# Check for outliers in charges for non-smoker group
zscores= stats.zscore(no_smoker_charge)
outliers = abs(zscores)>3
np.sum(outliers)


24

In [17]:
# remove outliers from non-smoker group
no_smoker_charge = no_smoker_charge[(np.abs(stats.zscore(no_smoker_charge)) < 3)]


Check for Normality

In [18]:
# test the smoker group for normality
result_smoker_charge = stats.normaltest(smoker_charge)
result_smoker_charge

NormaltestResult(statistic=61.03941356533816, pvalue=5.564930630036462e-14)

In [19]:
# test the non - smoker group for normality
result_no_smoker_charge = stats.normaltest(no_smoker_charge)
result_no_smoker_charge


NormaltestResult(statistic=70.72942109230829, pvalue=4.3782580585265917e-16)

Our p-values for both groups are well below 0.05, which means our data is NOT normally distributed

In [20]:
# Test for equal variance
result = stats.levene(smoker_charge, no_smoker_charge)
result

LeveneResult(statistic=672.9614970899742, pvalue=8.51943690683427e-120)

Our p-value < alpha (0.05), so we reject the null hypothesis and accept that there is a significant difference between the charges of smoker and non smoker.

## Q2. Are men more likely to smoke than women?