* Oleksandra Aliyeva
* May 2022

We have been asked to use our hypothesis testing skills to answer the following questions:

- Q1. Do smokers have higher insurance charges than non-smokers?
- Q2. Are men more likely to smoke than women?
- Q3. Do different regions have different charges, on average?



For each question, make sure to:

1. State your Null Hypothesis and Alternative Hypothesis
2. Select the correct test according to the data type and number of samples
3. Test the assumptions of your selected test.
4. Execute the selected test, or the alternative test (if you do not meet the assumptions)
5. Interpret your p-value and reject or fail to reject your null hypothesis 
6. Show a supporting visualization that helps display the result


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

In [3]:
df = pd.read_csv('insurance - insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


**Q1. Do smokers have higher insurance charges than non-smokers?**
* Null Hypothesis: smokers and non-smokers have the same insurance charges. 
* Alternative Hypothesis: smokers have higher insurance charges than non-smokers.

**T-Tests** for this hypothesis

In [5]:
#do we need to creat a sample??????
smoker_charges = df.loc[df['smoker']=='yes','charges']
non_smoker_charges = df.loc[df['smoker']=='no','charges']

#getting means for smoker and non-smoker charges
print(f"For Smokers (n={len(smoker_charges)}): Mean={np.mean(smoker_charges):.2f}")
print(f"For Non-Smokers (n={len(non_smoker_charges)}): Mean={np.mean(non_smoker_charges):.2f}")

For Smokers (n=274): Mean=32050.23
For Non-Smokers (n=1064): Mean=8434.27


In [6]:
#check for outliers
zscores_f = stats.zscore(non_smoker_charges)
outliers_f = abs(zscores_f)>3
np.sum(outliers_f)

24

In [7]:
#drop outliers
non_smoker_charges = non_smoker_charges[(np.abs(stats.zscore(non_smoker_charges))<3)]
len(non_smoker_charges)

1040

In [8]:
#check for outliers
zscores_f = stats.zscore(smoker_charges)
outliers_f = abs(zscores_f)>3
np.sum(outliers_f)

0

In [9]:
#check normality non_smoker_charges
result_m = stats.normaltest(non_smoker_charges)
result_m

NormaltestResult(statistic=163.80367047789198, pvalue=2.6945416315543976e-36)

In [10]:
#check normality smoker_charges
result_m = stats.normaltest(smoker_charges)
result_m

NormaltestResult(statistic=61.03941356533816, pvalue=5.564930630036463e-14)

In [None]:
#not normal distribution???????


In [11]:
#check for variences
result = stats.levene(non_smoker_charges, smoker_charges)
result
#?????

LeveneResult(statistic=520.7468821724297, pvalue=2.4247238784347824e-97)

In [12]:
## Final t-test, after confirming we meet the assumptions
result = stats.ttest_ind(non_smoker_charges, smoker_charges)
result
#????

Ttest_indResult(statistic=-51.2078044173717, pvalue=3.68768124e-315)

In [13]:
## check if my result is significant
print(f"p-value={result.pvalue:.10f}")
print(f"Significant: {result.pvalue <.05}")

p-value=0.0000000000
Significant: True


**Q2. Are men more likely to smoke than women?**
* Null Hypothesis: men and woman have the same chances of being a smoker. 
* Alternative Hypothesis: men more likely to smoke than women.

**Q3. Do different regions have different charges, on average?**
* Null Hypothesis: on average all regions have the same charges. 
* Alternative Hypothesis: different regions have different charges, on average.

In [14]:
#check number of regions
df['region'].value_counts()

southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64