# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


### Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [4]:
#<-- Write Your Code -->
import pandas as pd
import numpy as np

search_log = pd.read_json("searchlog.json", lines=True)
mean_sc = search_log.groupby('search_ui')['search_count'].mean()
diff = mean_sc['A'] - mean_sc['B']
print("The difference of the search_count means between interface A and interface B is :", abs(diff))

The difference of the search_count means between interface A and interface B is : 0.13500569535052287


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [5]:
#<-- Write Your Code -->
count = 0
num_samples = 1000
copy_search_log = search_log
for i in range(num_samples):
    np.random.shuffle(copy_search_log['search_ui'].values)
    shuffled_mean = copy_search_log.groupby('search_ui')['search_count'].mean()
    local_diff = shuffled_mean['A'] - shuffled_mean['B']
    if local_diff <= diff :
        count += 1
        
print("p-values computed using permutation test : ", count/num_samples)

p-values computed using permutation test :  0.136


Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it?**

**A.** Yes, using the same dataset to do the analysis feels like p-hacking. p-hacking can be avoided by preregistration, It helps avoid making any selections or tweaks in data after seeing it. Also, we can change the significance value and perform testing.

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [6]:
#<-- Write Your Code -->
search_log = pd.read_json("searchlog.json", lines=True)
table = pd.crosstab(index = search_log['is_instructor'], columns = search_log['search_ui'], margins=True)

# table
margin_col = table['All'].values[:-1]
margin_row = table.iloc[2].values[:-1]
total = table['All'].values[-1]

expected_values = []
for i in margin_col:
    expected_values.append([(i*j)/total for j in margin_row])

observed_values = table.iloc[:-1, :-1].to_numpy()
calc_values = np.square(np.subtract(observed_values, expected_values))/expected_values
print("The chi-squared values is : ", np.sum(calc_values))

The chi-squared values is :  0.6731740891275046


Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** The degree of freedom is product of (number of rows - 1) and (numberof columns - 1) of the table. So, value is 1.

Consider, 
- Null Hypothesis = The 2 variables are independent.
- Alternate Hypothesis = The 2 variables are dependent.
- Taking the significance value as 0.05.

From the chi2 distribution table, for degree of freedom 1 and significance value of 0.05, the value is 3.841. Since, the chi2 vaue(0.67) is less than 3.841 we cannot reject the null hypothesis. Thus, the columns are independent.



## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 9.