# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [50]:
#<-- Write Your Code -->
import pandas as pd
from numpy import random
import numpy as np
read_json=pd.read_json("searchlog.json",lines=True)
#print(read_json)
read_cols=read_json[["search_ui","search_count"]]
read_group=read_cols.groupby("search_ui").mean()
print("The difference of the search_count means between interface A and Interface B is",read_group["search_count"]["B"]-read_group["search_count"]["A"])

The difference of the search_count means between interface A and Interface B is 0.13500569535052287


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [51]:
#<-- Write Your Code -->
numSamples=10000
mean_list=[]
for i in range(numSamples):
  read_cols=read_json[["search_ui","search_count"]]
  random.shuffle(read_cols["search_ui"].values)
  random.shuffle(read_cols["search_count"].values)
  read_group=read_cols.groupby("search_ui").mean()
  mean_list.append(read_group["search_count"]["B"]-read_group["search_count"]["A"])
  
mean_list=[i for i in mean_list if i>0.135]
print(len(mean_list)/numSamples)

0.1321


In [52]:
## P-value is larger than 0.01. Therefore we accept the null hypothesis.

Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it?**

**A.** Yes.

To resolve this try with a different p-value. For example decrease p-value to 0.005.


## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [53]:
#<-- Write Your Code -->
x1=read_json[read_json['is_instructor']==True]
x2=read_json[read_json['is_instructor']==False]
y1=read_json[read_json['search_ui']=='A']
y2=read_json[read_json['search_ui']=='B']
output11=len(np.intersect1d(x1.uid, y1.uid))
output12=len(np.intersect1d(x1.uid, y2.uid))
output21=len(np.intersect1d(x2.uid, y1.uid))
output22=len(np.intersect1d(x2.uid, y2.uid))
#print(output11,output12,output21,output22)
x_1=((output11-(((output11+output12)*(output11+output21))/(output11+output12+output21+output22)))**2)/(((output11+output12)*(output11+output21))/(output11+output12+output21+output22))
x_2=((output12-(((output11+output12)*(output12+output22))/(output11+output12+output21+output22)))**2)/(((output11+output12)*(output12+output22))/(output11+output12+output21+output22))
x_3=((output21-(((output21+output22)*(output11+output21))/(output11+output12+output21+output22)))**2)/(((output21+output22)*(output11+output21))/(output11+output12+output21+output22))
x_4=((output22-(((output21+output22)*(output12+output22))/(output11+output12+output21+output22)))**2)/(((output21+output22)*(output12+output22))/(output11+output12+output21+output22))
print("chi-square - value",x_1+x_2+x_3+x_4)


chi-square - value 0.6731740891275046


Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** Find degrees of freedom and set the level of significance to 0.05. Find the corresponding critical value from the chi-square table.
If the calculated chi-sqaure value is less than critical value, we can conclude that is_instructor and search_ui are not correlated.

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 9.