# AB Testing / Course Reviews Datasets

AB Testing Steps:

* 1- Create a hypothesis
* 2- Assumptions check
  * Assumption 1: Distributions are normal for each variable
  * Assumption 2: Variances are homogeneus
* 3- Apply the hypothesis and control p value. If p value is less than 0.05 we can reject the HO if not we can accept HO. 
 * a) If the assumptions are correct use the independent samples t-test (parametric test, t test)
 * b) If the assumptions are not correct use the mannehitneyu test (non-parametric test, mannwhitneyu)
 
**Note**: If assumption 1 is not correct we can use directly non-parametric test (option b). If assumption 1 is correct but assumption 2 is not, we can use the parametric test and add that variances are not homogeneus as a argument.

___

**Import libraries and create dataframe called df**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats.api as sms
from scipy.stats import ttest_1samp, shapiro, levene, ttest_ind, mannwhitneyu, \
pearsonr, spearmanr, kendalltau, f_oneway, kruskal

pd.set_option('display.max_columns', None)
pd.set_option('display.max_row', 10)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

df = pd.read_csv('course_reviews.csv')

### Problem: Is there any difference in the scores of those who watched the course and those who did not ? 

**Step 1: Create hypothesis and check the dataframe**

Hypothesis (H0): There is no statistical difference in the scores of those who watched the course and those who did not.

u1 = who watch the course (greater than %75)<br>
u2 = who did not the watch course (less than %25)

HO: u1 = u2<br>
H1: u1 != u2

In [2]:
df.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0


**Watched**

In [8]:
df[df['Progress'] > 75].agg({'Rating': ['count', 'mean']})

Unnamed: 0,Rating
count,448.0
mean,4.86049


In [9]:
watched = df[df['Progress'] > 75]
watched

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered
6,5.00000,2021-02-04 12:25:30,2020-11-30 19:23:54,85.00000,0.00000,4.00000
14,5.00000,2021-02-03 19:08:33,2020-03-29 09:07:41,93.00000,1.00000,0.00000
112,5.00000,2021-01-22 19:11:38,2020-12-04 17:17:47,80.00000,1.00000,2.00000
167,4.50000,2021-01-13 21:49:32,2019-11-27 14:10:04,100.00000,2.00000,1.00000
174,5.00000,2021-01-12 09:51:38,2020-11-23 18:10:54,100.00000,0.00000,0.00000
...,...,...,...,...,...,...
4198,5.00000,2019-06-10 11:57:54,2019-06-08 17:43:25,80.00000,0.00000,0.00000
4199,5.00000,2019-06-10 11:02:04,2019-06-06 16:19:40,89.00000,0.00000,0.00000
4201,5.00000,2019-06-10 08:53:39,2019-05-15 14:13:03,78.00000,3.00000,2.00000
4273,5.00000,2019-05-26 18:59:50,2019-05-22 00:18:14,95.00000,0.00000,0.00000


**Not Watched**

In [10]:
df[df['Progress'] < 25].agg({'Rating': ['count', 'mean']})

Unnamed: 0,Rating
count,2573.0
mean,4.7225


In [11]:
not_watched = df[df['Progress'] < 25]
not_watched

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered
0,5.00000,2021-02-05 07:45:55,2021-01-25 15:12:08,5.00000,0.00000,0.00000
1,5.00000,2021-02-04 21:05:32,2021-02-04 20:43:40,1.00000,0.00000,0.00000
2,4.50000,2021-02-04 20:34:03,2019-07-04 23:23:27,1.00000,0.00000,0.00000
3,5.00000,2021-02-04 16:56:28,2021-02-04 14:41:29,10.00000,0.00000,0.00000
4,4.00000,2021-02-04 15:00:24,2020-10-13 03:10:07,10.00000,0.00000,0.00000
...,...,...,...,...,...,...
4316,5.00000,2019-05-17 17:46:04,2019-05-16 20:25:44,3.00000,0.00000,0.00000
4317,5.00000,2019-05-17 10:33:15,2019-05-17 10:29:41,1.00000,0.00000,0.00000
4319,5.00000,2019-05-16 21:27:05,2019-05-16 20:32:15,5.00000,0.00000,0.00000
4320,5.00000,2019-05-16 20:22:26,2019-05-16 20:21:19,1.00000,0.00000,0.00000


**Step 2: Assumption Check**

Assumption 1: Distributions are normal for each variable (Use shapiro test for normal distribution)<br>
Assumption 2: Variances are homogeneus (Use levene test for normal distribution)<br>

Assumption 1:

H0: Distribution is normal. <br>
H1: Distrubution is not mormal.

In [12]:
test_stat, p_value = shapiro(df.loc[df['Progress'] > 75, 'Rating'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 0.31595
p value: 0.00000


P value is less than 0.05 so we reject the H0 hypothesis, which is distribution is normal for who watched the course.

In [13]:
test_stat, p_value = shapiro(df.loc[df['Progress'] < 25, 'Rating'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 0.57096
p value: 0.00000


P value is less than 0.05 so we reject the H0 hypothesis, which is distribution is normal for who did not watched the course. Both distributions are not normal that's why we can directly go to the non-parametric solution but let's check the second assumption. 

Assumption 2:

H0: Variances are homogeneous. <br>
H1: Variances are not homogeneous.

In [14]:
test_stat, p_value = levene(df.loc[df['Progress'] > 75, 'Rating'],
                           df.loc[df['Progress'] < 25, 'Rating'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 24.92766
p value: 0.00000


P value is less than 0.05 so we reject the H0 hypothesis, which is variances are homogeneous.

**Step 3: Apply Hypothesis**

Non parametric solution: mannwhitneyu

In [15]:
test_stat, p_value = mannwhitneyu(df.loc[df['Progress'] > 75, 'Rating'],
                           df.loc[df['Progress'] < 25, 'Rating'])
print('Test statistic: %.5f\np value: %.5f' % (test_stat, p_value))

Test statistic: 661481.50000
p value: 0.00000


P value is less than 0.05 it is mean that we reject the H0 hypothesis. So we can say that there is a statistical difference between the  scores of those who watched the course and those who did not.