# Module Four Discussion: Hypothesis Testing for the Difference in Two Population Proportions

This notebook contains the step-by-step directions for your Module Four discussion. It is very important to run through the steps in order. Some steps depend on the outputs of earlier steps. Once you have completed the steps in this notebook, be sure to answer the questions about this activity in the discussion for this module.

Reminder: If you have not already reviewed the discussion prompt, please do so before beginning this activity. That will give you an idea of the questions you will need to answer with the outputs of this script.


## Initial post (due Thursday)
_____________________________________________________________________________________________________________________________________________________

### Step 1: Generating sample data
This block of Python code will generate two samples, both of size 50, that you will use in this discussion. The datasets will be unique to you and therefore your answers will be unique as well. The numpy module in Python allows you to create a data set using a Normal distribution. The data sets will be saved in Python dataframes and will be used in later calculations. 

Click the block of code below and hit the **Run** button above. 

In [10]:
try:
    import pandas as pd
    import numpy as np
    from statsmodels.stats.proportion import proportions_ztest # type: ignore

    print(f'Imports Successful!')

except Exception as e:
    print(f'Error: {str(e)}')

Imports Successful!


In [11]:
# define array of 50 diameters: df_1: men = 2.48, std_dev = 0.5; df_2: mean = 2.50 , std_dev = 0.750
    # normal distribution
dia_sx1 = np.random.normal(2.48, 0.500, 50)
dia_sx2 = np.random.normal(2.50, 0.750, 50)

# define dataframes using sample arrays
dia_df1 = pd.DataFrame(dia_sx1, columns = ['diameters_1'])
dia_df2 = pd.DataFrame(dia_sx2, columns = ['diameters_2'])

# print dataframe
print(f'{dia_df1.head().to_string(index = False)}\n')
print(f'{dia_df2.head().to_string(index = False)}\n')

 diameters_1
    2.657642
    3.690744
    2.733466
    2.430791
    2.695079

 diameters_2
    2.379532
    2.196717
    3.370580
    2.668101
    2.863301



### Step 2: Performing hypothesis test for the difference in population proportions
The z-test for proportions can be used to test for the difference in proportions. The **proportions_ztest** method in statsmodels.stats.proportion submodule runs this test. The input to this method is a list of counts meeting a certain condition (given in the problem statement) and a list of sample sizes for the two samples. 

***Counts***  &nbsp;&nbsp; Python list that is assigned the number of observations in each sample with diameter values less than 2.20.  
***n***  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Python list that is assigned the total number of observations in each sample.

Click the block of code below and hit the **Run** button above. 

In [12]:
from statsmodels.stats.proportion import proportions_ztest # type: ignore

# method to calculate 
def proportions_test(df1, df2, column1, column2):
    try:
        # number of observations from both df samples with diameter values less than 2.20. 
        sx1_obs = (df1[column1] < 2.20).sum()
        sx2_obs = (df2[column2] < 2.20).sum()

        # list of counts of observations from df1 and df2 < 2.20
        counts = [sx1_obs, sx2_obs]

        # total number of observations from the first and second sample
        n1 = len(df1)
        n2 = len(df2)

        # list of total counts from n1 and n2
        n = [n1, n2]
        
        # perform the hypothesis test. output is a Python tuple that contains test_statistic and the two-sided P_value.
        test_statistic, p_val = proportions_ztest(counts, n)
        
        # print formatted results
        print(f'test-statistic = {test_statistic:.2f}\ntwo-tailed p-value = {p_val:.4f}\n')

        # reject or accept null hypothesis
        sig_lvl = 0.05
        try:
            if p_val < sig_lvl:
                print(f'Reject the null hypothesis: (p-value) {p_val:4f} < (significance level) {sig_lvl}')
            else:
                print(f'Fail to reject the null hypothesis: (p-value) {p_val:.4f} > (significance level) {sig_lvl}')
        except Exception as e:
            print(f'Error: {str(e)}')

    except Exception as e:
        print(f'Error: {str(e)}')

# run method
proportions_test(dia_df1, dia_df2, 'diameters_1', 'diameters_2') # type: ignore

test-statistic = 0.23
two-tailed p-value = 0.8174

Fail to reject the null hypothesis: (p-value) 0.8174 > (significance level) 0.05


## End of initial post
Attach the HTML output to your initial post in the Module Four discussion. The HTML output can be downloaded by clicking **File**, then **Download as**, then **HTML**. Be sure to answer all questions about this activity in the Module Four discussion.

## Follow-up posts (due Sunday)
Return to the Module Four discussion to answer the follow-up questions in your response posts to other students. There are no Python scripts to run for your follow-up posts.