# Core Statistics Using Python
### Hana Choi, Simon Business School, University of Rochester


# Hypothesis Testing

## Topics covered

- Hypothesis testing using confidence interval
- Hypothesis testing using p-value 

## Here are the packages/modules we need for this notebook

In [15]:
import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.stats import uniform

# Example: LED bulb life


## Hypothesis

- Null hypothesis H0: mu_x = 1200
- Alternative hypothesis HA: mu_x != 1200
- Note: the numbers here don't match with slides because the mean and SD were exactly 1265 and 300 respectively there, to make the algebra simple.

## Data: LEDBulb.csv

In [16]:
# Method 1: Save the data file directly to your working directory
# led_bulb = pd.read_csv('LEDBulb.csv')

# Method 2: Tell Python where your data file exists "explicitly"
# Below is "my" file path, you should specify yours instead.
led_bulb = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/LEDBulb.csv")

# Method 3: We can also import a dataset from the web
# led_bulb = pd.read_csv("http://hanachoi.github.io/datasets/LEDBulb.csv")

# Display the first few rows of the dataframe
led_bulb.head()

Unnamed: 0,LED Bulb Life
0,1484.344201
1,1947.759492
2,1744.943083
3,1364.664703
4,1430.529286


## Summary statistics

In [17]:
# Sample mean
sample_mean = led_bulb['LED Bulb Life'].mean()
print("Sample Mean:", sample_mean)

# Sample standard deviation
sample_sd = led_bulb['LED Bulb Life'].std()
print("Sample SD:", sample_sd)

# Sample size
# .shape pandas DataFrame attribute returns a tuple where the first element represents the number of rows 
# and the second element represents the number of columns
print("Data Shape", led_bulb.shape) 
n = led_bulb.shape[0] # saving the first element in the tuple as "n"

# Or alternatively, one can use a built-in Python function len() on the DataFrame, which returns the number of rows
n = len(led_bulb)
print("Sample Size n:", n) 

# Sample standard error
standard_error = sample_sd / np.sqrt(n)
print("Sample SE:", standard_error)

Sample Mean: 1268.033194362
Sample SD: 297.86604812071647
Data Shape (100, 1)
Sample Size n: 100
Sample SE: 29.786604812071648


## Hypothesis testing using confidence interval

In [26]:
# 95% Confidence interval, Z(alpha/2) = 1.96
CI_L_95 = sample_mean - 1.96*standard_error
CI_R_95 = sample_mean + 1.96*standard_error
print("95% CI:", CI_L_95, CI_R_95)
# Decision: Reject null, because 1200 lies outside the 95% confidence interval

# 99% Confidence interval, Z(alpha/2) = 2.58
alpha = 1-0.99
print(norm.ppf(alpha/2))

CI_L_99 = sample_mean - 2.58*standard_error
CI_H_99 = sample_mean + 2.58*standard_error
print("90% CI:", CI_L_99, CI_H_99)
# Decision: Cannot reject null, because 1200 lies within the 99% confidence interval

95% CI: 1209.6514489303395 1326.4149397936603
-2.5758293035489004
90% CI: 1191.183753946855 1344.8826347771449


## What happens to 95% CI when n = 25?

In [22]:
n_new = 25
standard_error_new = sample_sd/np.sqrt(n_new)

# 95% Confidence interval, Z(alpha/2) = 1.96
CI_L_95_new = sample_mean - 1.96*standard_error_new
CI_R_95_new = sample_mean + 1.96*standard_error_new
print("95% CI when n=25:", CI_L_95_new, CI_R_95_new)

# Decision: Now we cannot reject null, because 1200 lies within the 95% confidence interval
# Both the sample size and the level of significance you choose can matter a lot!

95% CI when n=25: 1151.269703498679 1384.7966852253207


## Hypothesis testing using p-value

In [25]:
t_stat = (sample_mean - 1200) / standard_error
p_value = 2*norm.cdf(-abs(t_stat))
p_value

# If I choose 5% significance level, so p-value < 5%, then reject null
# If I choose 1% significance level, so p-value > 1%, then cannot reject null
# We will arrive at the same conclusion, whether we use (i) the confidence interval or (ii) the p-value to test a hypothesis.

0.02237036890526138