# Statistical hypothesis testing

URL https://github.com/FIIT-IAU/

**We want to verify whether the number of engine cylinders has an effect on consumption.**

In [None]:
import pandas as pd
import matplotlib
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats.api as sms
import scipy.stats as stats
from sklearn import preprocessing

In [None]:
cars = pd.read_csv('data/auto-mpg.data', 
                   sep='\s+', 
                   names = ['mpg', 'cylinders', 'displacement','horsepower',
                            'weight', 'acceleration', 'model_year', 'origin', 'name'],
                   na_values='?')
cars.head()

In [None]:
cars.cylinders.unique()

In [None]:
cars.cylinders.value_counts()

In [None]:
sns.boxplot(x='cylinders', y='mpg', data=cars)

We see that in the dataset, there are several types of engines (based on the number of cylinders). From the boxplot visualization, it's clear that there is a relationship between the number of cylinders and fuel consumption (measured as miles per gallon, `mpg`).

We have several ways to test the nature of this relationship:

* We can check if there is a correlation between these two attributes.
* We can try fitting a (e.g., linear) regression model.
* We can test the differences between the means of the groups based on the number of cylinders.

We'll focus on the last option. Let’s test whether the **difference in fuel consumption between 6-cylinder and 8-cylinder engines is statistically significant (and thus not just due to chance or error).**

Let's define our hypotheses as follows:

**$H_0$ (null hypothesis)**: The fuel consumption of 6-cylinder engines is **the same** on average as the fuel consumption of 8-cylinder engines.

**$H_1 = H_A$ (alternative hypothesis)**: The fuel consumption of 6-cylinder engines is **different/greater/less** on average compared to 8-cylinder engines.


In [None]:
sns.boxplot(x='cylinders', y='mpg', data=cars[(cars.cylinders == 6) | (cars.cylinders == 8)])

- There is some difference, we can also see it based on a visual comparison. If we want to verify whether it is statistically significant, we need to use a statistical test.
- We have two groups, the thrust between them is independent (the engine always has either 6 or 8 cylinders). Therefore, *t-test* or *Mann-Whiteny U test* are considered. We choose the t-test if its assumptions are met (the data come from normal distributions and they also have the same (or similar) variances.

## Verification of assumptions

### Assumption of normality of distribution

We can check the normality of the distribution visually using a histogram, or using the so-called QQ-plot.

In [None]:
mpg6 = cars.loc[cars.cylinders == 6, 'mpg']

In [None]:
mpg6.describe()

In [None]:
# sns.distplot(mpg6)
sns.histplot(mpg6)

The sample contains outliers. The simplest method for identifying outliers is to label any observation as an outlier if it differs by more than 1.5 times the interquartile range from either the upper or lower quartile.

In [None]:
def identify_outliers(a):
    lower = a.quantile(0.25) - 1.5 * stats.iqr(a)
    upper = a.quantile(0.75) + 1.5 * stats.iqr(a)
    
    return a[(a > upper) | (a < lower)]

In [None]:
mpg6_out = identify_outliers(mpg6)
mpg6_out

In [None]:
mpg6 = mpg6.drop(mpg6_out.index)

In [None]:
# sns.distplot(mpg6)
sns.histplot(mpg6)

In [None]:
mpg8 = cars.loc[cars.cylinders == 8, 'mpg']

In [None]:
mpg8.describe()

In [None]:
mpg8_out = identify_outliers(mpg8)
mpg8_out

In [None]:
mpg8 = mpg8.drop(mpg8_out.index)

In [None]:
# sns.distplot(mpg8)
sns.histplot(mpg8)

In [None]:
_ = sm.ProbPlot(mpg6, fit=True).qqplot(line='45')

In [None]:
_ = sm.ProbPlot(mpg8, fit=True).qqplot(line='45')

A QQ-plot is a visual method for determining whether two data sets come from the same distribution. Most often, the sampling distribution is compared with the theoretical normal distribution. The point on the graph shows the quantile value in the first and second compared dataset.

#### What questions can QQ-plot answer?

* Do the two groups of observations come from the same distribution?
* Does the observed sample come from the tested theoretical distribution (e.g. normal)?
* Do the distributions have similar skewness and kurtosis properties?

## Shapiro-Wilk normality test

To verify normality, we can also use the **Shapiro-Wilk test**, which tests the null hypothesis that the data comes from a normal distribution. If $p < 0.05$, we reject the null hypothesis $H_0$ and the data probably come from a non-normal distribution. If $p > 0.05$, we do not reject the null hypothesis $H_0$, that is, based on the data, we cannot declare that the data come from a different than normal distribution.

In [None]:
stats.shapiro(mpg6)

In [None]:
stats.shapiro(mpg8)

Based on the test results, the sample of cars with 6-cylinder engines appears to come from a normal distribution, while the sample with 8-cylinder engines does not. So we should use the non-parametric version of the t-test, i.e. **Mann-Whitney U-test** (although the t-test is relatively robust to slight deviations from the assumption of normality beyond a certain number of samples).

## Similarity of variance

The second prerequisite for the use of the t-test is the equality of variances (although there is a variant of the t-test that can also work with data with unequal variance). Although we have not verified the assumption of normality of the distributions, let's look at their variances. 

**Levene's test** is used to test the similarity of variances. It tests the null hypothesis $H_0$ that all input samples come from distributions with equal variances. If we do not reject $H_0$ ($p > 0.05$), it means that we cannot claim based on the data that the samples come from distributions with different variances.

In [None]:
stats.levene(mpg6, mpg8)

Based on the test result, it appears that the samples come from distributions with equal variance.

## Student's t-test vs. Mann-Whiteney U-test

Since the assumptions of the t-test were not met (in this case - sample with 8-cylinder engines probably does not come from a normal distribution), we should use its non-parametric version. If they were met, we would use the `scipy.stats.ttest_ind` function.

In [None]:
stats.mannwhitneyu(mpg6, mpg8)

Since $p < 0.001$, the probability of a Type I error (that $H_0$ is true and we reject it) is less than 1 in 1000. Therefore, we reject our null hypothesis $H_0$ in favor of the alternative hypothesis $H_A$. The difference in consumption between 6-cylinder and 8-cylinder engines is statistically significant.

We can visualize the difference between the two means—often displayed using bar charts along with *confidence intervals*, which indicate that with N% probability (commonly 95%), the true value of the mean will lie within the given interval.

In [None]:
sms.DescrStatsW(mpg6).tconfint_mean()

In [None]:
sms.DescrStatsW(mpg8).tconfint_mean()

In [None]:
sns.barplot(x='cylinders', y='mpg', data=cars[(cars.cylinders == 8) | (cars.cylinders == 6)], capsize=0.1, err_kws={'linewidth': 2})