### Data Science Principles and Practices (COMM054) Lab Week 6 - Extra materials

Follow the instructions to complete each of these tasks. This set of exercises focusses on writing basic Python code to carry out simple hypothesis testing.

This is not assessed but will help you gain practical experience for the module exam and coursework.

First, import the packages we will be using in this lab:
```python
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
```

In [1]:
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt

## Performing hypothesis tests using statsmodels

We now discuss a software package that will perform hypothesis tests on our data directly (without having to calculate sample means and standard deviations, standardise our variables etc.)

### Install statsmodels
You first have to install statsmodels, if it is not already installed. The instructions are available at https://www.statsmodels.org/stable/install.html

The easiest way might be to use

!pip install statsmodels

[It may take a while.]

Please install statsmodels below.

In [2]:
!pip install statsmodels



For the Z-test, we can call the function 

`statsmodels.stats.weightstats.ztest(x1, x2=None, value=0, alternative='two-sided', usevar='pooled', ddof=1.0)`

from the **statsmodels** library. 


This is to test for mean based on normal distribution, with *one* or *two* samples. (In the case of two samples, the samples are assumed to be independent.)

The most important parameters are 

* x1: array_like,
first of the two independent samples

* x2: array_like,
second of the two independent samples

* value: 
 + In the one sample case, value is the mean of x1 under the null hypothesis. 
 + In the two sample case, value is the difference between mean of x1 and mean of x2 under the null hypothesis. The **test statistic** is x1_mean - x2_mean - value.

* alternative:  The alternative hypothesis, H1, has to be one of the following

 + ‘two-sided’: H1: difference in means *not equal* to value (default) 
 + ‘larger’ : H1: difference in means *larger* than value 
 + ‘smaller’ : H1: difference in means *smaller* than value

The function returns 
* tstat, which is the test statistic
* pvalue, which is the pvalue of the test statistic

More information can be found: 

https://www.statsmodels.org/devel/generated/statsmodels.stats.weightstats.ztest.html#statsmodels.stats.weightstats.ztest

### The one-sample case

To test, we generate a random sample from the normal distribution 

`v1 = np.random.normal(loc=2, size=50)`

(recall, this is using `numpy`'s built in random generators - we could use the equivalent `ss.norm.rvs(loc=2, size=50)` instead.)

Then you could formulate the hypotheses

* $H_0$: mean = 2
* $H_1$: mean > 2

To test, you can use

```python
import statsmodels.stats.weightstats as ws
print(ws.ztest(v1, value=2, alternative='larger'))
```

Try it below. 

In [3]:
import statsmodels.stats.weightstats as ws

v1 = np.random.normal(loc=2, size=50)

print(ws.ztest(v1, value=2, alternative='larger'))

(1.5397184047111845, 0.061814504294254904)


### The two-sample case

To test, we generate two random samples from the normal distribution 

```python
v1 = np.random.normal(loc=5, size=1000)
v1 = np.random.normal(loc=2, size=1000)
```

Then you could formulate the hypotheses

* H0: mean1-mean2 = 3
* H1: mean1-mean2 $\neq$ 3

To test, we could use

```python
import statsmodels.stats.weightstats as ws
print(ws.ztest(v1, v2, value=3, alternative='two-sided'))
```

Try it below. 

In [4]:
v1 = np.random.normal(loc=5, scale=1, size=10)
v2 = np.random.normal(loc=2, scale=1, size=10)

print(ws.ztest(v1, v2, value=3, alternative='two-sided'))

(-3.078402000448739, 0.0020811396657551638)



# T-Test

When the sample size is relatively small, we can use T-tests.




### One-sample t-test: testing the value of a population mean

`scipy.stats.ttest_1samp()` tests if the population mean of data is likely to be **equal to** a given value (technically if observations are drawn from a Gaussian distributions of given population mean). Note it is a **two-sided** test. It returns the t-statistic, and the p-value.


To test, we generate a random sample 

```python
v1 = np.random.normal(loc=2, size=10)
```

and use 

```python
ss.ttest_1samp(v1, 2)  
```

to carry out the test.

In [5]:
v1 = np.random.normal(loc=2, size=10)
ss.ttest_1samp(v1, 2)

TtestResult(statistic=2.09186256795348, pvalue=0.06598369992732861, df=9)

### Two-sample t-test: testing for difference across populations

The function `ttest_ind()` takes **two** samples of same size and produces a tuple of t-statistic and p-value.

This is a **two-sided** test for the null hypothesis that two independent samples have **identical average (expected) values**. This test assumes that the populations have identical variances by default.

**Example** Find if the given values v1 and v2 are from same distribution:
```python

v1 = np.random.normal(size=100)
v2 = np.random.normal(size=100)

res = ss.ttest_ind(v1, v2)

print(res)
```

If you want to return only the p-value, use the pvalue property:
`res = ttest_ind(v1, v2).pvalue`


In [6]:
v1 = np.random.normal(size=100)
v2 = np.random.normal(size=100)

res = ss.ttest_ind(v1, v2)

print(res)

TtestResult(statistic=-0.6695626952856581, pvalue=0.5039166380308115, df=198.0)


## KS-Test

KS test is used to check if given values follow a distribution.

The function takes the value to be tested, and the CDF as two parameters.

```python
v = np.random.normal(size=100)
res = ss.kstest(v, 'norm')
print(res)
```

Try this below.

In [7]:
v = np.random.normal(size=100)
res = ss.kstest(v, 'norm')
print(res)

KstestResult(statistic=0.08373748966014927, pvalue=0.45997164521544665, statistic_location=-0.34848632863068013, statistic_sign=-1)


### Normality Tests (Skewness and Kurtosis)

Normality tests are based on the skewness and kurtosis.

The `scipy.stats.normaltest()` function returns p-value for the null hypothesis:

"x comes from a normal distribution".


```python
v = np.random.normal(size=100)
print(ss.normaltest(v))
```

#### Skewness:
A measure of symmetry in data.

* For normal distributions it is 0.

* If it is negative, it means the data is skewed left.

* If it is positive it means the data is skewed right.

#### Kurtosis:
A measure of whether the data is heavy or lightly tailed to a normal distribution.

* Positive kurtosis means heavy tailed.

* Negative kurtosis means lightly tailed.

```python
v = np.random.normal(size=100)

print(ss.skew(v))
print(ss.kurtosis(v))
```

Try this below.

In [8]:
v = np.random.normal(size=100)

print(ss.skew(v))
print(ss.kurtosis(v))

-0.20918597727111968
0.7532343334483582


In [9]:
v = np.random.normal(size=100)
print(ss.normaltest(v))

NormaltestResult(statistic=0.47963007206420355, pvalue=0.7867733723337516)
