### Most relevant take-aways from Hypothesis Testing (t-test / ANOVA):
* p_value can be used for A/B testing and for checking dependencies between variables;
* you are checking for statistical significance in your assumption (adding evidence that your results didn't happen by chance);
* in other words: if i took N samples from the populations (so i repeated my experiment N amount of time), it is likely that in 95% (or more) i'd obtain similar results (for a p_value lower than 0.05);
* low p_value is evidence that the samples you are testing returned different means in 95% (or more) of the tests (that the means are the same in less than 5% of the times, for a p_value lower than 0.05);



## ANOVA

One way ANOVA (ANalysis Of VAriance) is a technique meant to compare if there is any difference between the means of three or more groups/populations. It uses p_value to do so, similar to the tests we've seen so far.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

### Some applications of ANOVA

#### A/B/C/... Testing

Example: We have 4 different designs for the 'Search' button in our e-commerce app. To test which one is better for our business, we divide our customers into 4 equally distributed groups. Knowing the monthly average of the increase in sales percentage for the past 5 months, which button design is better?

In [2]:
data = pd.read_excel('anova_class_example_data.xlsx', sheet_name='data_collected')
data

Unnamed: 0,Display_design,Percent_increase_in_sales
0,1,575
1,2,565
2,3,600
3,4,725
4,1,542
5,2,593
6,3,651
7,4,700
8,1,530
9,2,590


In [3]:
data.describe()

Unnamed: 0,Display_design,Percent_increase_in_sales
count,20.0,20.0
mean,2.5,617.75
std,1.147079,61.648302
min,1.0,530.0
25%,1.75,573.75
50%,2.5,605.0
75%,3.25,659.5
max,4.0,725.0


In [4]:
data.groupby('Display_design').agg(np.mean)

Unnamed: 0_level_0,Percent_increase_in_sales
Display_design,Unnamed: 1_level_1
1,551.2
2,587.4
3,625.4
4,707.0


Our first intuition is that design 4 is the winner. But, is our conclusion statistically significant, or did it happen by chance?

Testing with ANOVA: <br>
Ho = the means are similar (the different designs did not provoke any statistical significant change in sales); <br>
Ha = the means are different (two-tailed test) (there is at least one design with the mean different then the others).


In [5]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('Percent_increase_in_sales ~ C(Display_design)',data=data).fit()
sm.stats.anova_lm(model)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(Display_design),3.0,66870.55,22290.183333,66.797073,2.882866e-09
Residual,16.0,5339.2,333.7,,


Considering a significance level of 0.05, the p-value is below, so we reject the null hypothesis. <br>
We can conclude that at least one of the designs resulted in different mean from the others (we can't tell exactly which). <br>
In this case, the difference is big, so we can assume 4 is the best design for our 'Search' button, or we can apply t-test in each pair to make sure.

ANOVA is not magic, and like the other statistical tests, it's just a tool that can support your conclusions a bit further, indicating that you have enough evidence to say your results didn't happen by chance (statistical significance).

#### Feature Elimination

P-value can also be used for feature elimination in linear models. Note that you won't use it to find important features, but to eliminate non-important ones.

For easier understanding, we can refrase the usual null hypothesis (mean(A) is similar to mean(B)) to A has no effect on B (A is independent from B).

So, what we are looking for are those **features** without relationship with the **target**.

Read about it [here](https://statisticsbyjim.com/regression/no-p-values-nonlinear-regression/).

In [6]:
import statsmodels.api as sm

numerical = pd.read_csv('7.03/numerical.csv')
targets = pd.read_csv('7.03/target.csv')

X = numerical[numerical.columns[:30]] # using this to get only the first 30 columns so we don't stay here the whole day waiting for stuff to process
y = targets['TARGET_D']

X = sm.add_constant(X) # we need to add this constant value for the intercepts
model = sm.OLS(y, X).fit()
model.summary()

0,1,2,3
Dep. Variable:,TARGET_D,R-squared:,0.001
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,4.482
Date:,"Thu, 13 May 2021",Prob (F-statistic):,3.62e-15
Time:,18:08:27,Log-Likelihood:,-277320.0
No. Observations:,95412,AIC:,554700.0
Df Residuals:,95381,BIC:,555000.0
Df Model:,30,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2220,0.186,1.192,0.233,-0.143,0.587
TCODE,1.4e-06,1.5e-05,0.093,0.926,-2.81e-05,3.09e-05
AGE,0.0017,0.001,1.605,0.109,-0.000,0.004
INCOME,0.0657,0.009,7.016,0.000,0.047,0.084
WEALTH1,0.0086,0.006,1.430,0.153,-0.003,0.020
HIT,0.0036,0.002,2.239,0.025,0.000,0.007
MALEMILI,0.0011,0.003,0.323,0.747,-0.006,0.008
MALEVET,0.0012,0.002,0.737,0.461,-0.002,0.004
VIETVETS,-0.0014,0.001,-1.225,0.220,-0.004,0.001

0,1,2,3
Omnibus:,163798.487,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,323925474.758
Skew:,11.724,Prob(JB):,0.0
Kurtosis:,287.483,Cond. No.,93900.0


In [7]:
cols_to_keep = []

for feature in range(X.shape[1]):
    if model.pvalues[feature] < 0.05:
        cols_to_keep.append(X.columns[feature])
        print(X.columns[feature], model.pvalues[feature])
        
cols_to_keep        

INCOME 2.297393629964502e-12
HIT 0.025182507763585226
LOCALGOV 0.046909530274591014
POP901 0.0426118100332588


['INCOME', 'HIT', 'LOCALGOV', 'POP901']

In [8]:
# check, because pvalues change if different columns are put
X = sm.add_constant(X)
cols_to_keep.append('const')
model = sm.OLS(y, X[cols_to_keep]).fit()
model.summary()

0,1,2,3
Dep. Variable:,TARGET_D,R-squared:,0.001
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,21.78
Date:,"Thu, 13 May 2021",Prob (F-statistic):,5.52e-18
Time:,18:08:27,Log-Likelihood:,-277340.0
No. Observations:,95412,AIC:,554700.0
Df Residuals:,95407,BIC:,554700.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
INCOME,0.0727,0.008,8.605,0.000,0.056,0.089
HIT,0.0038,0.002,2.446,0.014,0.001,0.007
LOCALGOV,-0.0089,0.003,-2.711,0.007,-0.015,-0.002
POP901,-2.585e-06,2.5e-06,-1.034,0.301,-7.49e-06,2.32e-06
const,0.5490,0.045,12.096,0.000,0.460,0.638

0,1,2,3
Omnibus:,163836.658,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,324238737.717
Skew:,11.73,Prob(JB):,0.0
Kurtosis:,287.621,Cond. No.,21100.0


Important to notice that even though we reduced our model from 30 to 4 features, the R2 remained the same.

### EXTRA: How is ANOVA calculated [here](https://towardsdatascience.com/statistical-tests-t-test-andanova-674b242a5274)
F-table can be found [here](https://web.ma.utexas.edu/users/davis/375/popecol/tables/f005.html).