<table align="left" width=100%>
    <tr>
        <td width="20%">
            <img src="faculty.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b>Pre Read (Week 4) </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Table of Content

1. **[Import Libraries](#lib)**
2. **[Large Sample Test](#z)**
    - 2.1 - **[One Sample Z Test](#1z)**

<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [None]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import statsmodels
import statsmodels.api as sm

# import 'stats' package from scipy library
from scipy import stats

# import statistics to perform statistical computations
import statistics

# to test the normality 
from scipy.stats import shapiro

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

In [None]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

<a id="z"></a>
# 2. Large Sample Test

If the sample size is sufficiently large (usually, n > 30) then we use the `Z-test`. If population standard deviation ($\sigma$) is unknown, then the sample standard deviation (s) is used to calculate the test statistic.

<a id="1z"></a>
## 2.1 One Sample Z Test

Let us perform one sample Z test for the population mean. We compare the population mean with a specific value. 

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu = \mu_{0}$ or $\mu \geq \mu_{0}$ or $\mu \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu \neq \mu_{0}$ or $\mu < \mu_{0}$ or $\mu > \mu_{0}$</strong></p>

Consider a normal population with standard deviation $\sigma$. The test statistic for one sample Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{\overline{X} -  \mu_{0}}{\frac{\sigma}{\sqrt(n)}}$</strong></p>

Where, <br>
$\overline{X}$: Sample mean<br>
$\sigma$: Population standard deviation<br>
$n$: Sample size

Under $H_{0}$ the test statistic follows a standard normal distribution.

If the population standard deviation ($\sigma$) is unknown, use the sample standard deviation (s). Here, $s^{2} = \frac{\sum_{i=1}^{n}(x_{i} - \bar{x})^{2}}{n-1}$

### Example:

#### 1. We need to determine if girls' score on average higher than 65 in the reading test.  We collected the data of 517 girls by using random sampling from a normally distributed population and recorded their marks. Set ⍺ value (significance level) to 0.05.

Consider the reading score for female students given in the CSV file `StudentsPerformance.csv`. 

In [None]:
# read the students performance data 
df_student = pd.read_csv('StudentsPerformance.csv')

# display the first two observations
df_student.head(2)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning


In [None]:
# find the number of female students in the dataframe
df_student.gender.value_counts()

female    517
male      483
Name: gender, dtype: int64

There are 517 female studens in the dataset. Consider the reading score of these students.

In [None]:
# given reading scores
# consider the subset of the given dataframe 'df_student'
scores = df_student[(df_student['gender'] == 'female')]['reading score']

# print score of first five female students 
scores.head()

0    55
1    63
2    71
5    85
6    51
Name: reading score, dtype: int64

Let us check the normality of the data.

In [None]:
# perform Shapiro-Wilk test to test the normality
# shapiro() returns a tuple having the values of test statistics and the corresponding p-value
# pass the reading scores of female students to perform the test
stat, p_value = shapiro(scores)

# print the test statistic and corresponding p-value 
print('Test statistic:', stat)
print('P-Value:', p_value)

Test statistic: 0.9949197173118591
P-Value: 0.08649472147226334


From the above result, we can see that the p-value is greater than 0.05, thus we can say that the data is normally distributed.

The null and alternative hypothesis is:

H<sub>0</sub>: $\mu \leq 65$<br>
H<sub>1</sub>: $\mu > 65$

Here ⍺ = 0.05, for a one-tailed test calculate the critical z-value.

In [None]:
# calculate the z-value for 95% of confidence level
# use 'stats.norm.isf()' to find the z-value corresponding to the upper tail probability 'q'
# pass the value of 'alpha' to the parameter 'q', here alpha = 0.05
# use 'round()' to round-off the value to 2 digits
z_val = np.abs(round(stats.norm.isf(q = 0.05), 2))

print('Critical value for one-tailed Z-test:', z_val)

Critical value for one-tailed Z-test: 1.64


i.e. if z is greater than 1.64 then we reject the null hypothesis.

In [None]:
# 'ztest()' returns the test statistic and corresponding p-value
# pass the sample data to the parameter, 'x1'
# pass the value in null hypothesis to the parameter, 'value'
# pass the one-tailed condition to the parameter, 'alternative'
z_score, pval = stests.ztest(x1 = scores, value = 65, alternative = 'larger')

# print the test statistic and corresponding p-value
print("Z-score: ", z_score)
print("p-value: ", pval)

Z-score:  2.529410071375873
p-value:  0.005712722457410142


In [None]:
# calculate the 95% confidence interval for the population mean
# pass the sample mean to the parameter, 'loc'
# pass the scaling factor (sample_std / n^(1/2)) to the parameter, 'scale'
# use 'stdev()' to calculate the sample standard deviation 
print('Confidence interval:', stats.norm.interval(0.95, loc = np.mean(scores), 
                                                  scale = statistics.stdev(scores) / np.sqrt(len(scores))))

Confidence interval: (65.33094545307333, 67.6090932316462)


Here the z score is greater than 1.64, the p-value is less than 0.05, also the confidence interval does not contain the value in the null hypothesis (i.e. 65), thus we reject the null hypothesis and thus, we have enough evidence to conclude that  on average girls' score higher marks than 65.

#### 2. The manager of a packaging process at a protein powder manufacturing plant wants to determine if the protein powder packing process is in control. The correct amount of protein powder per box is 350 grams on an average. A sample of 80 boxes was drawn which gave a mean of 354.5 grams with a standard deviation of 15. At 5% level of significance, is there evidence to suggest that the weight is different from 350 grams.

The null and alternative hypothesis is:

H<sub>0</sub>: $\mu = 350$<br>
H<sub>1</sub>: $\mu \neq 350$

Here ⍺ = 0.05, for a two-tailed test calculate the critical z-value.

In [None]:
# calculate the z-value for 95% of confidence level
# use 'stats.norm.isf()' to find the z-value corresponding to the upper tail probability 'q'
# pass the value of 'alpha/2' for a two-tailed to the parameter 'q', here alpha = 0.05
# use 'round()' to round-off the value to 2 digits
z_val = np.abs(round(stats.norm.isf(q = 0.05/2), 2))

print('Critical value for two-tailed Z-test:', z_val)

Critical value for two-tailed Z-test: 1.96


i.e. if z is less than -1.96 or z is greater than 1.96 then we reject the null hypothesis.

In [None]:
# define a function to calculate the Z-test statistic 
# here the population mean is unknown, thus use the sample standard deviation 
# pass the population mean, sample standard deviation, sample size and sample mean as the function input
def z_test(pop_mean, samp_std, n, samp_mean):
   
    # calculate the test statistic
    z_score = (samp_mean - pop_mean) / (samp_std / np.sqrt(n))

    # return the z-test value
    return z_score

# given data
n = 80
pop_mean = 350
samp_mean = 354.5
samp_std = 15

# calculate the test statistic using the function 'z_test'
z_score = z_test(pop_mean, samp_std, n, samp_mean)
print("Z-score:", z_score)

Z-score: 2.6832815729997477


In [None]:
# calculate the corresponding p-value for the test statistic
# use 'sf()' to calculate P(Z > z_score)
p_value = stats.norm.sf(z_score)

# for a two-tailed test multiply the p-value by 2
req_p = p_value*2
print('p-value:', req_p)

p-value: 0.007290358091535638


In [None]:
# calculate the 95% confidence interval for the population mean
# pass the sample mean to the parameter, 'loc'
# pass the scaling factor (sample_std / n^(1/2)) to the parameter, 'scale'
print('Confidence interval:', stats.norm.interval(0.95, loc = samp_mean, scale = samp_std / np.sqrt(n)))

Confidence interval: (351.2130404728378, 357.7869595271622)


Here the z score is greater than 1.96, the p-value is less than 0.05, also the confidence interval does not contain the value in the null hypothesis (i.e. 350), thus we reject the null hypothesis and thus, there is enough evidence to conclude that average weight per protein powder box is not 350.