# Hypothesis testing: Chi-Square Test within the Montana Library case study

In this notebook we perform a chi-square test with the data from the Library of Montana University case study, applying a post-hoc correction to perform pairwise tests and find the true winner.

For the sake of simplicity, we follow the 9 step approach you already know, but use scipy instead of doing the math manually.

## 1.&nbsp;Define the initial question for which the truth is not known.

To improve click-through rates (CTR) for the interact section of the website, we found 4 other terms that could be more intuitive alternatives.

In an A/B test, we showed all 5 versions to randomly selected visitors, and counted how many visitors clicked on each of the alternatives.

Now we want to know whether one of the 5 versions tested performed much better than the others, and to an extent that this is unlikely to be explained by chance alone.

## 2.&nbsp;State the Null Hypothesis and the Alternative Hypothesis.

Null Hypothesis ($H_0$): CTR(version 1) = CTR(version 2) = CTR(version 3) = CTR(version 4) = CTR(version 5)

Alternative Hypothesis ($H_A$): at least one of the versions has a significantly better or worse CTR than the others

## 3.&nbsp; Select an appropriate significance level alpha ($\alpha$).

It was decided that a relatively high alpha was acceptable in this case, so the standard approach was to select alpha = 0.1.

In [None]:
alpha = 0.1

## 4.&nbsp; Consider the statistical assumptions about the set of data.

4 assumptions need to be met:

1. Both variables are categorical.

2. All observations are independent.

3. Cells in the contingency table are mutually exclusive.

4. The sample size is large enough (at least 5 observations in each of the cells of the table with expected values).

### 4.1&nbsp; Both variables are categorical.

#### 4.1.1&nbsp;Read in the data

The important pieces of information (clicks on each element of interest & visits on each page) are scattered around. Let's collect them:

In [None]:
import pandas as pd
import numpy as np
pd.set_option("max_colwidth", 1000)
#pd.set_option("max_rows", 1000)

# Element list Homepage Version 1 - Interact, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1Tj6Z4OtJqLBOW0z2fvuGS5EhZo8xTVM6/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v1 = pd.read_csv(path)

# Element list Homepage Version 2 - Connect, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1qHBdOjUWvJpN-LTg1z2jpeA3mDXQjdch/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v2 = pd.read_csv(path)

# Element list Homepage Version 3 - Learn, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1g8prRmy3hpVtL6zvkdCwXcgIV0CS48zr/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v3 = pd.read_csv(path)

# Element list Homepage Version 4 - Help, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1I9bjXkxtiILDogeQmsWCCDlQtRZ8OSrs/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v4 = pd.read_csv(path)

# Element list Homepage Version 5 - Services, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1noDp_jpdAL_LGxU3SPDxqP94pUCqisqW/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v5 = pd.read_csv(path)

In [None]:
# clicks on each element
v1_clicks = int(v1.loc[v1["Name"]=="INTERACT"]["No. clicks"])
v2_clicks = int(v2.loc[v2["Name"]=="CONNECT"]["No. clicks"])
v3_clicks = int(v3.loc[v3["Name"]=="LEARN"]["No. clicks"])
v4_clicks = int(v4.loc[v4["Name"]=="HELP"]["No. clicks"])
v5_clicks = int(v5.loc[v5["Name"]=="SERVICES"]["No. clicks"])

In [None]:
print(v1_clicks, v2_clicks, v3_clicks, v4_clicks, v5_clicks)

42 53 21 38 45


In [None]:
# visits on each page (they are in the last column of the second row, we read them manually)
v1_visits = 10283
v2_visits = 2742
v3_visits = 2747
v4_visits = 3180
v5_visits = 2064

#### 4.1.2&nbsp; Calculate the CTR

Defined as clicks / visits

In [None]:
# click-through rates
interact_rate = float(v1_clicks / v1_visits)
connect_rate = float(v2_clicks / v2_visits)
learn_rate = float(v3_clicks / v3_visits)
help_rate = float(v4_clicks / v4_visits)
services_rate = float(v5_clicks / v5_visits)

In [None]:
# CTR from worst to best
rates = pd.Series([interact_rate, connect_rate, learn_rate, help_rate, services_rate])
names = pd.Series(["Interact", "Connect", "Learn", "Help", "Services"])

ctr_df = pd.DataFrame({"rates":rates, "names":names}).sort_values("rates")
ctr_df.sort_values("rates", ascending=False)

Unnamed: 0,rates,names
4,0.021802,Services
1,0.019329,Connect
3,0.01195,Help
2,0.007645,Learn
0,0.004084,Interact


#### 4.1.3&nbsp;Create the contingency table

For observed values. We note clicks and no-clicks (defined as visits - clicks)

In [None]:
# no-clicks
v1_noclick = v1_visits - v1_clicks
v2_noclick = v2_visits - v2_clicks
v3_noclick = v3_visits - v3_clicks
v4_noclick = v4_visits - v4_clicks
v5_noclick = v5_visits - v5_clicks

In [None]:
# contingency table as a pd.DataFrame creation
clicks = pd.Series([v1_clicks, v2_clicks, v3_clicks, v4_clicks, v5_clicks])
noclicks = pd.Series([v1_noclick, v2_noclick, v3_noclick, v4_noclick, v5_noclick])

observed = pd.DataFrame(data = [clicks, noclicks])
observed.columns = ["Interact", "Connect", "Learn", "Help", "Services"]
observed.index = ["Click", "No-click"]

observed

Unnamed: 0,Interact,Connect,Learn,Help,Services
Click,42,53,21,38,45
No-click,10241,2689,2726,3142,2019


Both the versions (Interact, Connect, Learn, Help and Services) and the results (click, no-click) are categories.

Condition fulfilled.

### 4.2&nbsp;All observations are independent.

This needs to be ensured while collecting the data. At this point, we will assume that the visitors to the website were allocated randomly and did not influence each other, and that thus, their clicks are independent.

Condition fulfilled.

###&nbsp;4.3 Cells in the contingency table are mutually exclusive.

Technically, this needs to be ensured while collecting the data. Logically, any visitor's behaviour can only be described using one single button (Interact, Connect, Learn, Help or Services) and one single row (click or no-click), meaning that the cells are mutually exclusive.

Condition fulfilled.

###4.4&nbsp;The sample size is large enough (at least 5 observations in each of the cells of the table with expected values).

In [None]:
observed_expanded = observed.copy()
observed_expanded

Unnamed: 0,Interact,Connect,Learn,Help,Services
Click,42,53,21,38,45
No-click,10241,2689,2726,3142,2019


In [None]:
# Create a new row called "Total" with the totals of each column.
observed_expanded.loc["Total"] = observed_expanded.sum()
# Create a new column called "Total" with the totals of each row.
observed_expanded["Total"] = observed_expanded["Interact"] + observed_expanded["Connect"] + observed_expanded["Learn"] + observed_expanded["Help"] + observed_expanded["Services"]
observed_expanded

Unnamed: 0,Interact,Connect,Learn,Help,Services,Total
Click,42,53,21,38,45,199
No-click,10241,2689,2726,3142,2019,20817
Total,10283,2742,2747,3180,2064,21016


In [None]:
# For reasons of clarity, we get the largest index of the observed_expanded dataframe and assign it to a variable.
max_row_index = len(observed_expanded.index)-1
max_row_index

2

In [None]:
# For reasons of clarity, we get the largest index of the observed_expanded dataframe columns and assign it to a variable.
max_column_index = len(observed_expanded.columns)-1
max_column_index

5

In [None]:
# Create table for the expected values as a copy of the observed table.
# We will overwrite the values in the cells with the code below.
expected = observed.copy()

# Iterating over the rows in the table.
for i in range(expected.shape[0]):
  # Iterating over the columns in the table.
  for j in range(expected.shape[1]):
    # Setting the value in each cell to be equal to:
    # the Total value of that same columns in the observed_expanded table
    # (i.e. the total visitors of that version),
    # multiplied by the share of that row's total from the overall total
    # (i.e. the share of clicks/no-clicks from the overall total number of visitors) 
    expected.iloc[i,j] = observed_expanded.iloc[max_row_index,j] * (observed_expanded.iloc[i,max_column_index]/observed_expanded.iloc[max_row_index,max_column_index])

expected

Unnamed: 0,Interact,Connect,Learn,Help,Services
Click,97.36948,25.963932,26.011277,30.111344,19.543967
No-click,10185.63052,2716.036068,2720.988723,3149.888656,2044.456033


There are at least 8 observations expected in each of the cells.

Condition fulfilled.

## 5.&nbsp;Decide on the appropriate test to use and the associated test statistic.

Comparing the observed frequencies to the expected frequencies in one or more categories of a contingency table is done using a **Chi-squared test**.

We will spare you the formula of the test statistic this time.

## 6.&nbsp;Derive the distribution of the test statistic under the Null Hypothesis from the assumptions.

The chi-squared test statistic follows a chi-squared distribution with c degrees of freedom.
The shape of the contingency table determines c:

c = (number of rows - 1) * (number of columns - 1)

In [None]:
# We can get totally this from scipy, let's just calculate it manually for fun.
degrees_of_freedom = (observed.shape[0] - 1) * (observed.shape[1] - 1)
degrees_of_freedom

4

## 7.&nbsp;Compute the test statistic using the data set.

In [None]:
from scipy import stats
chisq, pvalue, df, expected = stats.chi2_contingency(observed)
print("test statistic:", chisq)

test statistic: 96.7432353798328


## 8.&nbsp;Derive the p-value.

This step differs a bit from step 8 in the manual approach. As we saw earlier, it is possible to:
- compare the test statistic to the critical value(s)/region(s) or
- compare the p-value to alpha.

Since scipy gives us the p-value nice and easy, we will choose this approach.

In [None]:
pvalue

4.852334301093838e-20

## 9.&nbsp;Compare the p-value and alpha.

Remember: alpha = 0.1.

In [None]:
if pvalue > alpha:
  print("The p-value is larger than alpha.")
else:
  print("The p-value is smaller than alpha.")

The p-value is smaller than alpha.


Does this mean that we should reject the Null Hypothesis - or not?

Since the p-value is (much) smaller than alpha, we reject the Null Hypothesis.

> Remember: **If p is low, the Null must go!**

This means that at least one of the five different versions performed significantly better or worse than the others.

# But how do we decide who's the winner?

If you feel very brave, read about [Post Hoc Tests](https://alanarnholt.github.io/PDS-Bookdown2/post-hoc-tests-1.html) and find out whether we can declare a clear winner.

Otherwise, just go on in the notebook.

In [None]:
ctr_df.sort_values("rates", ascending=False)

Unnamed: 0,rates,names
4,0.021802,Services
1,0.019329,Connect
3,0.01195,Help
2,0.007645,Learn
0,0.004084,Interact


We have 10 possible dual tests to perform:
* Interact - Learn
* Interact - Help
* Interact - Connect
* Interact - Services
* Learn - Help
* Learn - Connect
* Learn - Services
* Help - Connect
* Help - Services
* Connect - Services

The main takeaway from the post-hoc tests should be that the level of alpha we selected for the chi-squared test cannot be maintained for the dual tests. If there was an error of 10% in each of the tests, the total would sum up to much more than the 10% total we set for alpha, so we will need to be much more restrictive in the dual tests.

Therefore, we will split the value chosen for alpha equally among the dual tests to be performed.

In [None]:
possible_combinations = 10
alpha_post_hoc = alpha / possible_combinations
np.round(alpha_post_hoc, 4)

0.01

Let's do the 10 pair-wise tests, and pay close attention to the best performing version.

For each of the tests, we print
- the p-value and
- True if the p-value is smaller than the alpha -> reject the Null Hypothesis
- False if the p-value is greater than the alpha -> do not reject the Null Hypothesis

In [None]:
# interact vs connect
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Interact", "Connect"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

2.2250331654688293e-16
True


In [None]:
# interact vs learn
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Interact", "Learn"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

0.025419824342152637
False


In [None]:
# interact vs help
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Interact", "Help"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

9.03599988558687e-07
True


In [None]:
# connect vs learn
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Connect", "Learn"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

0.00027678881264505827
True


In [None]:
# connect vs help
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Connect", "Help"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

0.02808815288948292
False


In [None]:
# learn vs help
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Learn", "Help"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

0.12512753088691322
False


In [None]:
# services vs interact
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Interact", "Services"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

5.719451224375125e-18
True


In [None]:
# services vs learn
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Learn", "Services"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

5.0540996583731365e-05
True


In [None]:
# services vs help
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Help", "Services"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

0.007370912499282061
True


In [None]:
# services vs connect
chisq, pvalue, df, expected = stats.chi2_contingency(observed.loc[:, ["Connect", "Services"]])
print(pvalue)
print(pvalue < alpha_post_hoc)

0.6188771123975272
False


The difference between Services and Help, Learn and Interact is statistically significant, but the difference between Services and Connect is not.

Let's look at their ordered ranking again.

In [None]:
ctr_df.sort_values("rates", ascending=False)

Unnamed: 0,rates,names
4,0.021802,Services
1,0.019329,Connect
3,0.01195,Help
2,0.007645,Learn
0,0.004084,Interact


This means that we will reject the hypothesis that Help, Learn and Interact perform just as well as Services, and will not consider them anymore.

Also, this result does not let us reject the hypothesis that both Services and Connect perform equally well. Based on just these results, it is not possible to decide on the best performing version, it might be either Services or Connect.

To decide the winner, we will therefore need to add more steps. This is where we will leave the field of statistics, and come back into the business world. The following actions might help to choose which version should be on the website in the future:

- Look at other metrics besides CTR.
- Refer to the qualitative research.
- Ask opinions to subject-matter experts.
- Redesign the experiment and run it again.