# 1. z-tests (when we know something about the distribution of the data)
## Overview of the theory
As you know from the Lecture Notes, we use z-tests if we either know that the data is distributed normally or if the sample is large enough so we may assume that its averaged value is distributed normally according to the central limit theorem. Note that, to use z-tests, we have to know the variance of the distribution.

## Example 1
Consider an example from the Lecture notes.

A machine produces items having a nominal mass of 1 kg. The mass of a randomly selected item x follows the distribution X ~ N(u, 0.022). If u ‡ 1 then the machinery should be corrected. The mean mass of a randomly sample of 25 items was
found to be 0.989 kg. Test the null hypothesis that Ho: mu = 1 at the 1% significance level.
Note that the alternative hypothesis here is mu ‡ 1, i.e. we deal with the two-tailed test.

In [2]:
mu = 1
sigma = 0.02
n = 25
xbar = 0.989
alpha = 0.01

In [3]:
import numpy as np

In [4]:
z = (xbar - mu)/(sigma / (n ** 0.5))
z

-2.750000000000002

In [5]:
from scipy.stats import norm

In [6]:
p = 2 * norm.cdf(z)
p

np.float64(0.0059595264701090694)

In [7]:
zscore = norm.ppf(alpha/2)
zscore

np.float64(-2.575829303548901)

## Example 2
The number of strokes a golfer takes to complete a round of golf has mean 84.1 and standard deviation 2.6. After a month of holidays without playing golf her mean is now 85.1 in 36 subsequent rounds. At the 5% significance level test the null hypothesis that her standard of play is unaltered against the alternative hypothesis that it became worse, i.e.

• Ho: mu = 84.1

• H1 : mu > 84.1 (one-tailed)

### Task 1.1
Calculate the test statistic and assign it to z.

In [8]:
mu = 84.1
sigma = 2.6
xbar = 85.1
n = 36
alpha = 0.05

In [9]:
z = (xbar - mu)/ (sigma / np.sqrt(n))
z

np.float64(2.3076923076923075)

### Task 1.2
Calculate the p-value, taking into account that z > 0 and we test the one-tailed case. Since p < 0.05 = a, we reject H0

In [11]:
p = 1- norm.cdf(z)
p

np.float64(0.01050812811375934)

### Task 1.3
Alternatively, calculate the z-score correspondent to the considered significance level (and assign it to zscore). Note that we are going to compare the z-score with the test statistic, hence, we will be looking for the positive z-score (as the test statistic is positive).

In [14]:
zscore = norm.ppf(alpha)
abs(zscore)

np.float64(1.6448536269514729)

Since 0 < Zscore < z, we have that the test statistic is inside the critical (tail) region, and we will reject H0.

## 2. t-tests (when we do not know much about the data)
Usually, we have just the data (e.g. one or two samples). We may expect that the data is normally distributed (and there are some statistical tests to verify this - though, we will not consider them now, but we don't know exactly e.g. the standard deviation of that distribution. Hence we need to guess that standard deviation from the sample standard deviation. This guess leads to errors. To take these potential errors into account, t-tests are used.
t-tests are quite similar to z-tests with two main differences:
1. Test statistic (denoted t in the Lecture Notes) is calculated with the sample standard deviation s (as we don't know the population standard deviation sigma).
2. p-value and t-score are calculated for t-distribution, instead of the standard normal distribution. The work with t-distribution in Python is similar to any other distribution:

In [1]:
from scipy.stats import t
import pandas as pd

### Task 2.1
Download file Heights.csv, upload it to Anaconda cloud, and load its content to Pandas data frame df_heights (see Lab 1 if you don't remember how to do this).

In [2]:
url = 'Heights.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,HEIGHTS
0,6.5
1,6.25
2,6.33
3,6.5
4,6.42


## Task 2.2
Assign the values of the only column of this data frame to Numpy array x. Assign to n the size of the sample, i.e. the length of x using function len.

In [3]:
n = len(df)
n

65

Recall that, to calculate the sample standard deviation s (that has n - 1 in the
denominator), we need to use the key ddof = 1, e.g.

In [4]:
import numpy as np
a = np.array([1,2,3])
np.std(a, ddof = 1)

np.float64(1.0)

In [5]:
a.std(ddof = 1)

np.float64(1.0)

### Task 2.3
Assign to xbar the mean of x. Assign to s the sample standard deviation of x.

In [6]:
xbar = df.mean()
s = df.std(ddof=1)

print(xbar.values, s.values)

[6.46769232] [0.33029706]


### Task 2.4
We are going to use a one-sample t-test to determine whether the heights in the data
frame has a mean of mu = 6.5. Assign to tstat the corresponding test statistic, using
Equation 2.

In [59]:
mu = 6.5
tstat = (xbar.values - mu) / (s.values / (np.sqrt(n)))
print(tstat)

AttributeError: 'numpy.float64' object has no attribute 'values'

Here H0: u = 6.5 and H1: u ‡ 6.5, i.e. we deal with the two-tailed test.

We can calculate p-value, recall that for the one-sample test we should take t-distribution with n - 1 degrees of freedom. Since the test statistic is negative, we have

In [11]:
p = 2 * t.cdf(tstat, df = n-1)
p

array([0.43325635])

Therefore, if we consider any significance level below this p: e.g. 5% = 0.05, then we
do not have any evidence to reject H0

## Task 2.5
Download the file HealthData.csv, upload it to Anaconda.com/app, and consider the column DENSITY. Use a one-sample t-test to determine whether the density variable in the data set Health Data has a mean of 1.051 using the 5% significance level. Assign the corresponding p-value to p (you may use either of the considered approaches).

In [66]:
url = 'HealthData.csv'
df1 = pd.read_csv(url)
s_density = pd.Series(df1['DENSITY']).to_numpy()


In [64]:
mu1 = 1.051
a1 = 0.05

xbar1 = s_density.mean()
s1 = np.std(s_density, ddof=1)
n1 = len(s_density)


In [72]:
tstat1 = ((xbar1 - mu1) / (s1/np.sqrt(n1)))
p = 2*(1-(t.cdf(tstat1, df=n1-1)))
p

np.float64(0.00017145125375250814)

In [73]:
p < a1

np.True_

## Two-sample test
We consider now examples of two-sample tests.
First, we will use a paired t-test to determine whether there is any ditference in the two processes (Process A and Process B) of preserving meat joints. This data can be found in the file MeatJoints.csv, and we will test at the 5% significance level whether the means of the two processes are equal.
So, we have Ho: MA= MBAnd H1: MA‡ MB.

### Task 2.6
Load the the data to Anaconda.com/app, and assign to a the column Process A and to b the column Process B.
Then use the following command to apply two-sample test

In [78]:
url = 'MeatJoints.csv'
df2 = pd.read_csv(url)
df2

processA = df2['Process A']
processB = df2['Process B']

from scipy.stats import ttest_rel
results = ttest_rel(processA,processB)
results

TtestResult(statistic=np.float64(-2.29517764444372), pvalue=np.float64(0.047371692861499864), df=np.int64(9))

In [79]:
results.pvalue < 0.05

np.True_

We reject H0

## 3. If you have time

Next, we consider an unpaired test on the dataset SportHeights.csv. We will use an unpaired t-test at the 5% significance level to determine whether there is a difference in the mean height between basketball and football players.

### Task 3.1
Load the file to Anaconda.com/app and assign to f the heights of football players and to b the heights of basketball players.
Then perform the following test:

In [81]:
url = 'SportHeights.csv'
df3 = pd.read_csv(url)

In [82]:
df3.head()

Unnamed: 0,football,basketball
0,6.33,6.08
1,6.5,6.58
2,6.5,6.25
3,6.25,6.58
4,6.5,6.25


In [83]:
from scipy.stats import ttest_ind
footballers = df3['football']
basketballers = df3['basketball']


results = ttest_ind(footballers, basketballers, nan_policy='omit')
results

TtestResult(statistic=np.float64(-3.684107948156318), pvalue=np.float64(0.00040777606606915155), df=np.float64(83.0))

In [84]:
results.pvalue < 0.05

np.True_