# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [1]:
import pandas as pd
import numpy as np

data_df = pd.read_json('searchlog.json', lines=True)


mean_df = data_df.groupby("search_ui")['search_count'].mean().reset_index()

diff = mean_df['search_count'][1] - mean_df['search_count'][0]

diff

0.13500569535052287

In [2]:
data_df

Unnamed: 0,uid,is_instructor,search_ui,search_count
0,6061521,True,A,2
1,11986457,False,A,0
2,15995765,False,A,0
3,9106912,True,B,0
4,9882383,False,A,0
...,...,...,...,...
676,16768212,False,B,0
677,7643715,True,A,0
678,14838641,False,A,0
679,6454817,False,A,0


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [19]:

numSamples = 10000
k = []

#shufling values in search_count column, taking mean of group A and B, finding difference of the mean
data_df_copy = data_df.copy()
for i in range(numSamples):
    data_df_copy['search_count'] = np.random.permutation(data_df['search_count'].values)
    temp_mean_df = data_df_copy.groupby("search_ui")['search_count'].mean().reset_index()
    temp_diff = temp_mean_df['search_count'][1] - temp_mean_df['search_count'][0]
    k.append(temp_diff)



In [21]:
p_value = len(np.where(k > diff)[0]) / numSamples

p_value

0.1139

Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it? **

**A.** 

This would be very similar to p-hacking, since we are deliberately changing our data to fit our assumptions.
To avoid p-hacking, we need to avoid biased data and skewed assumptions.

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [58]:
#<-- Write Your Code -->
contingency_table = pd.crosstab(
    data_df['is_instructor'],
    data_df['search_ui'],
    margins = True
)


f_obs = np.append(contingency_table.iloc[0][0:2].values, contingency_table.iloc[1][0:2].values)


contingency_table
row_sums = contingency_table.iloc[0:2,2].values
col_sums = contingency_table.iloc[2,0:2].values


total = contingency_table.loc['All', 'All']

f_expected = []
for j in range(2):
    for i in col_sums:
        f_expected.append(i*row_sums[j]/total)

        
chi_squared_statistic = ((f_obs - f_expected)**2/f_expected).sum()
print('Chi-squared Statistic: {}'.format(chi_squared_statistic))

dof = (len(row_sums)-1)*(len(col_sums)-1)
print("Degrees of Freedom: {}".format(dof))

Chi-squared Statistic: 0.6731740891275046
Degrees of Freedom: 1



Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** 
Since the degrees of freedom is one, we need to check the values in the chi squared distribution table. Checking the value gives us a p-value of 0.411. Observing this p-value, since it's not a small value, we don't have enough evidence to reject the null-hypothesis. Hence, we can conclude that there is not enough evidence to prove that is_instructor and search_ui are correlated.

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 7.