### Z-score

We can calculate z-scores in Python using scipy.stats.zscore, which uses the following syntax:

scipy.stats.zscore(a, axis=0, ddof=0, nan_policy=’propagate’)

where:

a: an array like object containing data
axis: the axis along which to calculate the z-scores. Default is 0.
ddof: degrees of freedom correction in the calculation of the standard deviation. Default is 0.
nan_policy: how to handle when input contains nan. Default is propagate, which returns nan. ‘raise’ throws an error and ‘omit’ performs calculations ignoring nan values.
The following examples illustrate how to use this function to calculate z-scores for one-dimensional numpy arrays, multi-dimensional numpy arrays, and Pandas DataFrames.

Numpy One-Dimensional Arrays

#### Step 1: Import modules.

In [3]:
import pandas as pd
import numpy as np
import scipy.stats as stats

#### Step 2: Create an array of values.

In [31]:
data = np.array([6, 7, 7, 12, 13, 13, 15, 16, 19, 22])

#### Step 3: Calculate the z-scores for each value in the array.

In [32]:
z_score = stats.zscore(data)
z_score

array([-1.39443338, -1.19522861, -1.19522861, -0.19920477,  0.        ,
        0.        ,  0.39840954,  0.5976143 ,  1.19522861,  1.79284291])

Each z-score tells us how many standard deviations away an individual value is from the mean. For example:

The first value of “6” in the array is 1.394 standard deviations below the mean.
The fifth value of “13” in the array is 0 standard deviations away from the mean, i.e. it is equal to the mean.
The last value of “22” in the array is 1.793 standard deviations above the mean.

### P-value for z and T score

To find the p-value for each z_score in the array, we use the "stats.norm.sf" function.

This can be used to tell us the percentage of the population below (minus/left-tailed) or above the poplutaion (right-tailed)

In [44]:
stats.norm.sf(-z_score) # adding "minus" indicates that we want the percentage of the population/sample *lower* than the value of each z_score
# because by default (without adding minus),  python calculates for the percentage of the popultion greater than the value)

array([0.08159339, 0.11599886, 0.11599886, 0.42105128, 0.5       ,
       0.5       , 0.65483584, 0.72495134, 0.88400114, 0.96350098])

In [55]:
stats.norm.sf(1.75) # i.e the percentage of the population greater than 1.75

0.040059156863817086

In [56]:
# to get the percentage of the population less than 1.75, is same as the percentsage f the population greater than -1.75
stats.norm.sf(-1.75)

0.9599408431361829

We can assume they are t-scores, and find the p-value for them.\
Note the values will be close to the p-value for z score method above

In [45]:
stats.t.sf(-z_score, df=9)

array([0.09832371, 0.13127053, 0.13127053, 0.42326516, 0.5       ,
       0.5       , 0.65019265, 0.7175855 , 0.86872947, 0.9467046 ])

#### Numpy Multi-Dimensional Arrays
If we have a multi-dimensional array, we can use the axis parameter to specify that we want to calculate each z-score relative to its own array. For example, suppose we have the following multi-dimensional array:

In [58]:
data = np.array([[5, 6, 7, 7, 8],
                 [8, 8, 8, 9, 9],
                 [2, 2, 4, 4, 5]])

We can use the following syntax to calculate the z-scores for each array:

In [61]:
z_score = stats.zscore(data, axis=1) # if we want to calculate for each row, we'll use axis = 0
z_score

array([[-1.56892908, -0.58834841,  0.39223227,  0.39223227,  1.37281295],
       [-0.81649658, -0.81649658, -0.81649658,  1.22474487,  1.22474487],
       [-1.16666667, -1.16666667,  0.5       ,  0.5       ,  1.33333333]])

In [63]:
stats.norm.sf(z_score)
# note we use "z_score" and not "minus z_score", so it'll output the %  of the population greater than each value

array([[0.94166777, 0.72185077, 0.3474433 , 0.3474433 , 0.08490525],
       [0.79289191, 0.79289191, 0.79289191, 0.11033568, 0.11033568],
       [0.8783275 , 0.8783275 , 0.30853754, 0.30853754, 0.09121122]])

#### Pandas DataFrames
Suppose we instead have a Pandas DataFrame:

In [8]:
data = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=['A', 'B', 'C'])
data

Unnamed: 0,A,B,C
0,0,0,1
1,0,3,5
2,5,9,1
3,6,0,1
4,9,8,6


We can use the apply function to calculate the z-score of individual values by column:

In [15]:
data.apply(stats.zscore)

Unnamed: 0,A,B,C
0,-1.135924,-1.03975,-0.808224
1,-1.135924,-0.259938,0.987829
2,0.283981,1.299688,-0.808224
3,0.567962,-1.03975,-0.808224
4,1.419905,1.03975,1.436842


#### Two-tailed test

Suppose we want to find the p-value associated with a z-score of 1.24 in a two-tailed hypothesis test.

In [46]:
stats.norm.sf((1.24))*2

0.21497539414917388

### Z-score value for an alpha value (Z-score method in hypothesis testing)

In [66]:
stats.norm.ppf(0.025) # python calculate for left tail by default 

-1.9599639845400545

In [53]:
# to get for right tail, we minus one from the number
stats.norm.ppf(1-0.025)

1.959963984540054

### T-score value for an alpha value (T-score method in hypothesis testing)

Suppose we want to find the T critical value for a left-tailed test with a significance level of .05 and degrees of freedom = 22:

In [47]:
stats.t.ppf(q=.05,df=22)

-1.7171443743802424

The T critical value is -1.7171. Thus, if the test statistic is less than this value, the results of the test are statistically significant.

In [67]:
stats.t.ppf(q=1-.05,df=22)

1.717144374380242

For a two-tailed test, you can do for any of the tails, just add the -ve and +ve signs where necessary