# One Sample Z Test

Performed when the population means and standard deviation are known.

## Example-1

- Suppose that a beach is safe to swim if the mean level of lead in the water is 10.0 (μ0) parts/million.  
- We assume Xi ~ N(μ, σ = 1.5)
- Water safety is going to be determined by taking 40 water samples and using the test statistic. 
- Sample mean = 10.5
- α = 0.05

In [1]:
import scipy.stats as stats
from math import sqrt
import numpy as np

In [2]:
x_bar = 10.5 # sample mean 
n = 40 # number of students
sigma = 1.5 # sd of population
mu = 10 # Population mean 

Calculate the test statistic:
$$ z = \frac{\bar x - \mu_0} {\sigma / \sqrt n} $$

In [3]:
z = (x_bar - mu)/(sigma/sqrt(n))
z

2.1081851067789197

Calculate the p-value

In [4]:
p_value = 1 - stats.norm.cdf(z)
p_value

0.017507490509831247

In [5]:
alpha = 0.05

if p_value<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.05 level of significance, we can reject the null hypothesis in favor of alternative hypothesis.


## Example-2

- A department store manager determines that a new billing system will be cost-effective only if the mean monthly account is more than 170 dollars.
- A random sample of 400 monthly accounts is drawn, for which the sample mean is 178 dollars. 
- The accounts are approximately normally distributed with a standard deviation of 65 dollars.


- Can we conclude that the new system will be cost-effective?

In [6]:
x_bar = 178 # sample mean 
n = 400 # number of students
sigma = 65 # sd of population
mu = 170 # Population mean 

Calculate the test statistic

In [7]:
z = (x_bar - mu)/(sigma/sqrt(n))
z

2.4615384615384617

In [8]:
p_value = 1 - stats.norm.cdf(z)
p_value

0.006917128192854505

In [9]:
alpha = 0.05

if p_value<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.05 level of significance, we can reject the null hypothesis in favor of alternative hypothesis.


# One Sample t Test

## Example-1

- Bon Air ELEM has 1000 students. The principal of the school thinks that the average IQ of students at Bon Air is at least 110. To prove her point, she administers an IQ test to 20 randomly selected students. 
- Among the sampled students, the average IQ is 108 with a standard deviation of 10. 
- Based on these results, should the principal accept or reject her original hypothesis? α = 0.01

In [10]:
x_bar = 108 # sample mean 
n = 20 # number of students
s = 10 # sd of population
mu = 110 # Population mean 
alpha = 0.01

Calculate the test statistic:
$$ t = \frac{\bar x - \mu_0} {s / \sqrt n} $$

In [11]:
t = (x_bar - mu)/(s/sqrt(n))
t

-0.8944271909999159

In [12]:
p_value = stats.t.cdf(t, n-1)
p_value

0.1911420676837155

In [13]:
if p_value<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of alternative hypothesis.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.01 level of significance, we fail to reject the null hypothesis.


### Example-2

Analyze if college students get 7.2 hours of sleep, on average, based on a sample of students: alpha=0.05

H0: mu = 7.2

Ha: mu != 7.2

### scipy.stats

In [14]:
import pandas as pd
import scipy.stats as stats
import math

In [15]:
#students.csv file loaded to Github in Statistics Course

df = pd.read_csv('https://raw.githubusercontent.com/clarusway/clarusway-ds-students-7-21/main/3-%20Classes_Labs/Statistics/data/students.csv?token=APETIRNXQBNL7A72OSDZKBDAQLTL2')
#df = pd.read_excel('students.xlsx')
df.head()

Unnamed: 0,ID,Gender,Classification,Height,Shoe Size,Phone Time,# of Shoes,Birth order,Pets,Happy,...,Exercise,Stat Pre,Stat Post,Phone Type,Sleep,Social Media,Impact of SocNetworking,Political,Animal,Superhero
0,1,male,senior,67.75,7.0,12.0,12.0,youngest,5.0,0.8,...,360,3.0,,iPhone,7.0,180.0,worse,Democrat,Dog person,Batman
1,2,male,freshman,71.0,7.5,1.5,5.0,middle,4.0,0.75,...,200,9.0,,Android smartphone,7.0,20.0,better,Democrat,Dog person,Batman
2,3,female,freshman,64.0,6.0,25.0,15.0,oldest,8.0,0.9,...,30,7.0,5.0,Android smartphone,8.0,60.0,better,Republican,Dog person,Batman
3,4,female,freshman,63.0,6.5,30.0,30.0,middle,12.0,0.98,...,180,6.0,7.0,iPhone,6.0,60.0,better,Republican,Both,Superman
4,5,male,senior,69.0,6.5,23.0,8.0,oldest,4.0,0.75,...,180,4.0,7.0,iPhone,5.5,60.0,worse,Independent,Dog person,Superman


In [16]:
df['Sleep'].mean()

6.8618421052631575

In [17]:
df['Sleep'].std()

1.5310098719701895

In [18]:
onesample = stats.ttest_1samp(df['Sleep'], 7.2)

In [19]:
onesample.statistic

-1.92552134000487

In [20]:
onesample.pvalue

0.05795525591903326

In [21]:
print(f'p-value for two sided test: {onesample.pvalue:.4f}')

p-value for two sided test: 0.0580


In [22]:
alpha = 0.05

if p_value<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of Ha.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.05 level of significance, we fail to reject the null hypothesis.


#### One-tailed Test

The principal of the school thinks that the average hours of sleep is at most 7.2

H0: mu = 7.2

Ha: mu < 7.2

In [23]:
print(f'p-value for one sided test: {onesample.pvalue/2:.4f}')

p-value for one sided test: 0.0290


In [24]:
alpha = 0.05
p_value = onesample.pvalue/2

if p_value<alpha:
    print('At {} level of significance, we can reject the null hypothesis in favor of Ha.'.format(alpha))
else:
    print('At {} level of significance, we fail to reject the null hypothesis.'.format(alpha))

At 0.05 level of significance, we can reject the null hypothesis in favor of Ha.


### Statsmodels Example

In [25]:
import statsmodels.api as sm
dataset1 = sm.datasets.get_rdataset(dataname='Pima.tr', package='MASS')
dataset1.keys()

  return dataset_meta["Title"].item()


dict_keys(['data', '__doc__', 'package', 'title', 'from_cache'])

In [26]:
df1 = dataset1.data
df1.head()

Unnamed: 0,npreg,glu,bp,skin,bmi,ped,age,type
0,5,86,68,28,30.2,0.364,24,No
1,7,195,70,33,25.1,0.163,55,Yes
2,5,77,82,41,35.8,0.156,35,No
3,0,165,76,43,47.9,0.259,26,No
4,0,107,60,25,26.4,0.133,23,No


In [27]:
df1.describe()

Unnamed: 0,npreg,glu,bp,skin,bmi,ped,age
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,3.57,123.97,71.26,29.215,32.31,0.460765,32.11
std,3.366268,31.667225,11.479604,11.724594,6.130212,0.307225,10.975436
min,0.0,56.0,38.0,7.0,18.2,0.085,21.0
25%,1.0,100.0,64.0,20.75,27.575,0.2535,23.0
50%,2.0,120.5,70.0,29.0,32.8,0.3725,28.0
75%,6.0,144.0,78.0,36.0,36.5,0.616,39.25
max,14.0,199.0,110.0,99.0,47.9,2.288,63.0


In [28]:
df1['type'].value_counts()

No     132
Yes     68
Name: type, dtype: int64

In [29]:
df1.groupby(['type'])['bmi'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,132.0,31.074242,6.381457,18.2,25.825,31.05,35.5,47.9
Yes,68.0,34.708824,4.810956,22.9,31.6,34.6,37.625,46.1


Suppose we hypothesize that the population mean of BMI among Pima Indian women is above 30. Because we found sample mean of $bmi$ as $\bar x =32.30$.  Let us consider the Pima.tr data set (The sample size is $n = 200$) and  test the hypothesis that 
- $H_{0}:  \mu  = 30 $ 
- $H_{A}:  \mu  > 30 $ 

In [30]:
# Select a variable and the calculate mean and standard deviation
colN ='bmi'
n = df1.shape[0]
sample_mean= df1[colN].mean()
sample_std = df1[colN].std()
print('Sample Size is ', n)
print('Sample Mean is {:.2f}'.format(sample_mean))
print('Sample Standard Deviation  is {:.2f}'.format(sample_std))

Sample Size is  200
Sample Mean is 32.31
Sample Standard Deviation  is 6.13


In [31]:
mu_zero = 30
t_score = (sample_mean - mu_zero) / (sample_std / np.sqrt(n))
print('The t-score is {:.2f}'.format(t_score))

The t-score is 5.33


In [32]:
degrees_of_freedom = n -1
p_value = stats.t.sf(abs(t_score), df=degrees_of_freedom)
print('The p value is ', p_value)

The p value is  1.3307205153730877e-07


In [33]:
t_statistics = stats.ttest_1samp(df1[colN], popmean = 30)
t_score = t_statistics.statistic
p_value = t_statistics.pvalue / 2 # We are interested in one-sided test.
print('The t-score is ', t_score)
print('The p-value is ', p_value)

The t-score is  5.329070841262502
The p-value is  1.3307205153727868e-07


At any significance level, there is strong evidence to reject the null hypothesis and conclude that the population mean of BMI among Pima Indian Women is in fact greater than 30. Therefore, on average, the population is obese.