# Statistical Tests

An important part of feature engineering and feature selection is analyzing the relationship between variables, such that we can implement only the important features into our machine learning model. To analyze the relationship between two or more variables, we implement statistical tests. For each statistical test, there are two outputs: a measurement, usually between -1 and 1, that measures the relationship between the variables, and a p-value, which describes the probability of the measurement being true if the null hypothesis is true, meaning that there exists no relationship between the variables. In other words, if we have a p-value of 0.2, then there is a 20 percent chance that the statistical test gives this result if there exists no relationship between the variables. The norm is that if we have a p-value < 0.05, then it follows that the statistical test is significant, while p-values lower than that means that the statistical test is invalid, as there is a likelyhood that the test is not valid.

## Parametric Tests

There are assumptions that must be met when implementing parametric tests: The observations from the data must be independent, the variance between each group must be similar, and the data in each group must follow a normal-like distribution. Most data will not follow such assumptions, that's why we should apply transformating techniques such that these tests become valid. Otherwise, it would be better if a non-parametric test is used.

To test for normality, we can graphically check to see if the distributions of each group follows a normal distribution using histograms and qq plots. However, if we want to be exact, we should use hypothesis tests. One such test is the Shapiro-Wilk test. Its null hypothesis is that the sample is from the normal distribution, and its alternate hypothesis is that the sample says otherwise.

In [24]:
import numpy as np
from numpy.random import randn
import scipy.stats as stats

data = randn(5000)
f, p = stats.shapiro(data)
p

0.3728526532649994

To test for homogenity of variances, we use Bartlett's test and see if the value outputted from the test is lower than the critical value, if the data is normal-like. It follows that the null hypothesis would be that the variables have equal variance, while the alternate hypothesis says otherwise(This is a hypothesis test, thus the p-value is the only value that matters).

In [11]:
import pandas as pd

p1 = [89, 89, 88, 78, 79]
p2 = [93, 92, 94, 89, 88]
p3 = [89, 88, 89, 93, 90]
p4 = [81, 78, 81, 92, 82]

_, p = stats.bartlett(p1, p2, p3, p4)
p

0.15166301835959678

If the data in each group is not normally distributed, then we can use a more robust method called the Levene's test. The same hypothesis can be made here.

In [12]:
_, p = stats.levene(p1, p2, p3, p4)
p

0.5846671108816857

To measure the relationship between a nominal variable that has more than 2 groups and a continuous variable, we most commonly use the ANOVA test, which analyzes the difference of mean values of the variables among the different groups that are part of the nominal variable. The ANOVA test uses the F-statistic, which is the value of dividing the Mean Squares Treatment and the Mean Squares Error. In other words, F-value = variation between sample means / variation within the samples. The higher the F-value, the more significant the relationship.

In [26]:
f, p = stats.f_oneway(p1, p2, p3, p4)
print(f)
print("p-value: ", p)

4.625000000000002
p-value:  0.016336459839780215


## Non-Parametric Tests