# Module Three Discussion: Confidence Intervals and Hypothesis Testing

This notebook contains the step-by-step directions for your Module Three discussion. It is very important to run through the steps in order. Some steps depend on the outputs of earlier steps. Once you have completed the steps in this notebook, be sure to answer the questions about this activity in the discussion for this module.

Reminder: If you have not already reviewed the discussion prompt, please do so before beginning this activity. That will give you an idea of the questions you will need to answer with the outputs of this script.



## Initial post (due Thursday)
_____________________________________________________________________________________________________________________________________________________

### Step 1: Generating sample data
This block of Python code will generate a unique sample of size 50 that you will use in this discussion. Note that your sample will be unique and therefore your answers will be unique as well. The numpy module in Python allows you to create a data set using a Normal distribution. Note that the mean and standard deviation were chosen for you. The data set will be saved in a Python dataframe that will be used in later calculations. 

Click the block of code below and hit the **Run** button above. 

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st

In [2]:
# create 50 randomly chosen values from a Normal distribution. (arbitrarily using mean=2.48 and standard deviation=0.50). 
dia = np.random.normal(2.4800, 0.500, 50) # mean = 2.48, std = 0.50, sample size = 50

# convert the array into a dataframe with the column name "diameters" using pandas library.
dia_df = pd.DataFrame(dia, columns = ['diameters']) # use diameters from 50 randomly instantiated
dia_df = np.round(dia_df, 2) # round column outpts to 2 decimal places

# print the dataframe (note that the index of dataframe starts at 0).
print('Diameters DataFrame:\n', dia_df.head(10)) ## print the head(10) of the df, displaying truncated rows will not display descrobe method for EDA
print('\nDescribe:\n', dia_df.describe()) 

Diameters DataFrame:
    diameters
0       2.84
1       0.93
2       2.49
3       2.08
4       1.92
5       2.54
6       2.17
7       2.39
8       2.02
9       3.21

Describe:
        diameters
count  50.000000
mean    2.419000
std     0.562234
min     0.930000
25%     2.095000
50%     2.335000
75%     2.682500
max     3.710000


### Step 2: Constructing confidence intervals
You will assume that the population standard deviation is known and that the sample size is sufficiently large. Then you will use the Normal distribution to construct these confidence intervals. You will use the submodule scipy.stats to construct confidence intervals using your sample data. 

Click the block of code below and hit the **Run** button above. 

In [3]:
# Python methods that calculate confidence intervals require the sample mean and the standard error as inputs.
# calculate the sample mean
sx_mean = dia_df['diameters'].mean()

# input the population standard deviation, which was given in Step 1.
std_dev = 0.5000

# calculate standard error = standard deviation / sqrt(n)   where n is the sample size.
sx_size = len(dia_df['diameters'])
SE = std_dev/np.sqrt(sx_size) # len(dia_df['diameters'])

alpha = 0.01 # significance level as defined by the discussion prompt

# placed interval calculations inside of a defined method to dynamically calc confidence interval
def print_confidence_int(level, mean, se):
    conf_int = st.norm.interval(level, loc = mean, scale = se)
    rounded = (np.round(conf_int[0], 2), (np.round(conf_int[1], 2))) # lower and upper bounds of tuple
    print(f'{int(level * 100)}% confidence interval (unrounded): {conf_int}')
    print(f'{int(level * 100)}% confidence interval rounded): {rounded}\n')

# print 90% and 99% confidence levels
print_confidence_int(0.90, sx_mean, SE) # 90% confidence level using sx_mean
print_confidence_int(0.99, sx_mean, SE) # 99% confidence level using sx_mean

90% confidence interval (unrounded): (2.302691284632333, 2.535308715367668)
90% confidence interval rounded): (2.3, 2.54)

99% confidence interval (unrounded): (2.2368613632281553, 2.6011386367718456)
99% confidence interval rounded): (2.24, 2.6)



### Step 3: Performing hypothesis testing for the population mean
Since you were given the population standard deviation in Step 1 and the sample size is sufficiently large, you can use the z-test for population means. The z-test method in statsmodels.stats.weightstats submodule runs the z-test. The input to this method is the sample dataframe and the value under the null hypothesis. The output is the test-statistic and the two-tailed P-value.

Click the block of code below and hit the **Run** button above. 

In [4]:
pip install statsmodels

Note: you may need to restart the kernel to use updated packages.


In [5]:
from statsmodels.stats.weightstats import ztest # type: ignore

# run z-test hypothesis test for population mean. The value under the null hypothesis is 2.30.
def run_z_test(data, hyp_mean):
    test_statistic, p_value = ztest(x1 = data, value = hyp_mean)
    
    print(f'z-test hypothesis test for population mean:')
    print(f'Test statistic = {test_statistic:.2f}')
    print(f'Two-Tailed P Value = {p_value:.4f}')

    if p_value < alpha:
        print(f'Reject the null hypothesis: Significant difference {p_value:.4f} < {alpha:.4f}\n')
    else:
        print(f'Fail to reject the null hypothesis: No significant difference {p_value:.4f} > {alpha:.4f}\n')

run_z_test(dia_df['diameters'], 2.30) # hypotesized mean of 2.30 per discussion prompt
run_z_test(dia_df['diameters'], 2.35) # testing function for hypothesized mean > 2.30 for output 

z-test hypothesis test for population mean:
Test statistic = 1.50
Two-Tailed P Value = 0.1345
Fail to reject the null hypothesis: No significant difference 0.1345 > 0.0100

z-test hypothesis test for population mean:
Test statistic = 0.87
Two-Tailed P Value = 0.3855
Fail to reject the null hypothesis: No significant difference 0.3855 > 0.0100



## End of initial post
Attach the HTML output to your initial post in the Module Three discussion. The HTML output can be downloaded by clicking **File**, then **Download as**, then **HTML**. Be sure to answer all questions about this activity in the Module Three discussion.

## Follow-up posts (due Sunday)
Return to the Module Three discussion to answer the follow-up questions in your response posts to other students. There are no Python scripts to run for your follow-up posts.