## A / B Testing: Chi-2 with Montana Library case study

In this notebook we perform a Chi square test with data from the Library of Montana University case study
Scipy approach.

### Data reading

The important pieces of information (clicks on each element of interest & visits on each page) are scattered around. Let's collect them:

In [3]:
import pandas as pd
import numpy as np
pd.set_option("max_colwidth", 1000)
pd.set_option("max_rows", 1000)

# Element list Homepage Version 1 - Interact, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1Tj6Z4OtJqLBOW0z2fvuGS5EhZo8xTVM6/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v1 = pd.read_csv(path)

# Element list Homepage Version 2 - Connect, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1qHBdOjUWvJpN-LTg1z2jpeA3mDXQjdch/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v2 = pd.read_csv(path)

# Element list Homepage Version 3 - Learn, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1g8prRmy3hpVtL6zvkdCwXcgIV0CS48zr/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v3 = pd.read_csv(path)

# Element list Homepage Version 4 - Help, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1I9bjXkxtiILDogeQmsWCCDlQtRZ8OSrs/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v4 = pd.read_csv(path)

# Element list Homepage Version 5 - Services, 5-29-2013.csv
url = 'https://drive.google.com/file/d/1noDp_jpdAL_LGxU3SPDxqP94pUCqisqW/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
v5 = pd.read_csv(path)

In [4]:
# clicks on each element
v1_clicks = int(v1.loc[v1["Name"]=="INTERACT"]["No. clicks"])
v2_clicks = int(v2.loc[v2["Name"]=="CONNECT"]["No. clicks"])
v3_clicks = int(v3.loc[v3["Name"]=="LEARN"]["No. clicks"])
v4_clicks = int(v4.loc[v4["Name"]=="HELP"]["No. clicks"])
v5_clicks = int(v5.loc[v5["Name"]=="SERVICES"]["No. clicks"])

In [5]:
print(v1_clicks, v2_clicks, v3_clicks, v4_clicks, v5_clicks)

42 53 21 38 45


In [6]:
# visits on each page (they are in the last column of the second row, we read them manually)
v1_visits = 10283
v2_visits = 2742
v3_visits = 2747
v4_visits = 3180
v5_visits = 2064

#### Click Through rate

Defined as clicks / visits

In [7]:
# click-through rates
interact_rate = float(v1_clicks / v1_visits)
connect_rate = float(v2_clicks / v2_visits)
learn_rate = float(v3_clicks / v3_visits)
help_rate = float(v4_clicks / v4_visits)
services_rate = float(v5_clicks / v5_visits)

In [8]:
# CTR from worst to best
rates = pd.Series([interact_rate, connect_rate, learn_rate, help_rate, services_rate])
names = pd.Series(["Interact", "Connect", "Learn", "Help", "Services"])

ctr_df = pd.DataFrame({"CTR":rates, "names":names}).sort_values("CTR")
ctr_df.sort_values("CTR", ascending=False)

Unnamed: 0,CTR,names
4,0.021802,Services
1,0.019329,Connect
3,0.01195,Help
2,0.007645,Learn
0,0.004084,Interact


#### Contingency table

For observed values. We note clicks and no-clicks (defined as visits - clicks)

In [9]:
# no-clicks
v1_noclick = v1_visits - v1_clicks
v2_noclick = v2_visits - v2_clicks
v3_noclick = v3_visits - v3_clicks
v4_noclick = v4_visits - v4_clicks
v5_noclick = v5_visits - v5_clicks

In [10]:
# contingency table as a pd.DataFrame creation
clicks = pd.Series([v1_clicks, v2_clicks, v3_clicks, v4_clicks, v5_clicks])
noclicks = pd.Series([v1_noclick, v2_noclick, v3_noclick, v4_noclick, v5_noclick])

observed = pd.DataFrame(data = [clicks, noclicks])
observed.columns = ["Interact", "Connect", "Learn", "Help", "Services"]
observed.index = ["Click", "No-click"]

observed

Unnamed: 0,Interact,Connect,Learn,Help,Services
Click,42,53,21,38,45
No-click,10241,2689,2726,3142,2019


## Scipy approach

Null Hypothesis: **Interact	Connect	Learn	Help	Services** have the same number of clicks and no-clicks values

Alternative Hypthesis: **Interact	Connect	Learn	Help	Services** do not have the same number of clicks and no-clicks

Significance level: **95%** or 0.95

Alpha: 1 - 0.95 = 0.05

To reject the Null Hypothesis p-value needs to be less or equal to alpha (p-value  <= 0.05)

In [13]:
from scipy import stats
chi2, pvalue, df, expected = stats.chi2_contingency(observed)
print(f"p value: {pvalue} pvalue <= 0.05 ? {pvalue <= 0.05}")

p value: 4.852334301093838e-20


Null Hypothesis: **Connect	Learn	Help	Services** have the same number of clicks and no-clicks values

Alternative Hypthesis: **Connect	Learn	Help	Services** do not have the same number of clicks and no-clicks

Significance level: **95%** or 0.95

Alpha: 1 - 0.95 = 0.05

To reject the Null Hypothesis p-value needs to be less or equal to alpha (p-value  <= 0.05)

In [19]:
chi2, pvalue, df, expected = stats.chi2_contingency(observed[['Connect', 'Learn', 'Help', 'Services']])
print(f"p value: {pvalue} pvalue <= 0.05 ? {pvalue <= 0.05}")

p value: 5.25509870228566e-05 pvalue <= 0.05 ? True


Null Hypothesis: **Connect Help	Services** have the same number of clicks and no-clicks values

Alternative Hypthesis: **Connect Help	Services** do not have the same number of clicks and no-clicks

Significance level: **95%** or 0.95

Alpha: 1 - 0.95 = 0.05

To reject the Null Hypothesis p-value needs to be less or equal to alpha (p-value  <= 0.05)

In [30]:
chi2, pvalue, df, expected = stats.chi2_contingency(observed[['Connect',  'Help', 'Services']])
print(f"p value: {pvalue} pvalue <= 0.05 ? {pvalue <= 0.05}")

p value: 0.013726659948517513 pvalue <= 0.05 ? True


Null Hypothesis: **Connect	Services** have the same number of clicks and no-clicks values

Alternative Hypthesis: **Connect Services** do not have the same number of clicks and no-clicks

Significance level: **95%** or 0.95

Alpha: 1 - 0.95 = 0.05

To reject the Null Hypothesis p-value needs to be less or equal to alpha (p-value  <= 0.05)

In [31]:
chi2, pvalue, df, expected = stats.chi2_contingency(observed[['Connect', 'Services']])
print(f"p value: {pvalue} pvalue <= 0.05 ? {pvalue <= 0.05}")

p value: 0.6188771123975272 pvalue <= 0.05 ? False


In [32]:
def calculate_chi2_values(observed, cols):
  chi2, pvalue, df, expected = stats.chi2_contingency(observed[cols])
  print(f"p value: {pvalue} pvalue <= 0.05 ? {pvalue <= 0.05}")


In [33]:
calculate_chi2_values(observed, ['Interact', 'Connect', 'Learn', 'Help', 'Services'])

p value: 4.852334301093838e-20 pvalue <= 0.05 ? True
