# Hypothesis Testing

**If you are a Colab user**

If you use Google Colab, you can uncomment the following cell to mount your Google Drive to Colab. <br>
After that, Colab can read/write files and data in your Google Drive <br>

please change the current directory to be the folder that you save your Notebook and <br>
data folder. For example, I save my Colab files and data in the following location

In [1]:
#from google.colab import drive
#drive.mount('/content/drive')

#%cd /content/drive/MyDrive/Colab\ Notebooks

**Install new libraries**

You can run the following code in terminal or conda prompt or uncomment the next cell and direcdtly run the code in this notebook

*conda install statsmodels*


So far, we have installed

jupyter notebook
numpy
pandas
matplotlib
seaborn
plotly
scipy
openpyxl
geopandas
contextily
statsmodels

In [2]:
#!pip3 install statsmodels

**Set up standards for the remainder of the notebook**

In [3]:
# import libraries and modules to be used
import numpy as np 
np.set_printoptions(precision=4, suppress=True)
np.random.seed(12345)

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scipy as sp
from scipy import stats # we'll use the stats module of scipy

import statsmodels as sm
from statsmodels.stats import proportion # we will use it for z tests

# to display multiple outputs in one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Import and Review California Housing Data

CA_housing.csv dataset pertains to the houses found in a given California
district and some summary stats about them based on the year 1990 census data.
The dataset contains 20,640 observations and 10 columns

Below is a list of the 9 attributes (X) with their discription

  --Longitude: block group longitude\
  --Latitude block group latitude\
  --HouseAge: median house age in block group\
  --AveRooms: average number of rooms per household\
  --AveBedrms: average number of bedrooms per household\
  --Population: block group population\
  --AveOccup: average number of household members\
  --MedInc: Median income for households within a block\
  --OceanProx: Location of the house w.r.t ocean/sea

The target (y) is:\
--MedVal: Median house value for households within a block


In [4]:
# read the Housing dataset and be familar with data types
CA_housing=pd.read_csv("Data/CA_housing.csv")
CA_housing.head()
CA_housing.info()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,OceanProx,MedVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,NEAR BAY,452600.0
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,NEAR BAY,358500.0
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,NEAR BAY,352100.0
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,NEAR BAY,341300.0
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,NEAR BAY,342200.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   OceanProx   20640 non-null  object 
 9   MedVal      20640 non-null  float64
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [5]:
# change 'OceanProx' to ordered categorical variable
ordered_categories = ['INLAND', 'NEAR BAY', '<1H OCEAN', 'NEAR OCEAN', 'ISLAND']
CA_housing['OceanProx'] = pd.Categorical(CA_housing['OceanProx'], 
                                         categories=ordered_categories, 
                                         ordered=True
                                        )

## Hypothesis Testing Foundation

**Hypothesis**

A tentative conjecture is called the *null hypothesis*. The opposite of what is stated in the null hypothesis is the *alternative hypothesis*. The hypothesis testing procedure uses a sample of data to test the validity of the two competing statements about a population.

**Steps of Hypothesis Testing**
1. Develop the null and alternative hypotheses
2. Specify the level of significance, $\alpha$
3. Collect the sample data and compute the value of the test statistic
4. Use the value of the test statistic to compute the $p$ value
5. Draw a decision on the null hypothesis
6. Interpret the statistical conclusion in the context of the application

## One-sample t test

The one-sample t test is often used to test a hypothesis on $\mu$, the mean of a distribution $F$.  Let $\mu_0$ be the hypothesized value of mean, and $\alpha$ be the level of significance in hypothesis testing.  Let's draw a random sample, $x_1,\dots,x_n$,  from the distribution $F$.  We use this sample to test the hypothesis on the population mean.

The table below summarizes three types of one-sample t test. 

**Table 1: Hypothesis for one-sample t-test**
$$
\begin{array}{l|c|c|c}
 & \text{lower-tail test} & \text{upper-tail test} & \text{two-tail test} \\
\hline
\text{Null hypothesis } H_0 & \mu \geq \mu_0 & \mu \leq \mu_0 & \mu = \mu_0 \\
\text{Alternative hypothesis } H_1 & \mu < \mu_0 & \mu > \mu_0 & \mu \neq \mu_0 \\
p\text{-value} & P(X \leq t) & P(X > t) & P(X \geq |t|) \\
\hline
\end{array}
$$

The test statistic, t, is
\begin{equation}
t=\frac{\bar{x}-\mu_0}{s/\sqrt{n}}. \tag{1}
\end{equation}

p value represents the probability that we mistakely reject the null hypothesis. If p value is less than $\alpha$, that probabability is small so that we can safely reject the hull hypothesis.

Let's test on $\mu$, the mean value of log(MedVal)

If the hypothesized mean of `log(MedVAL)' of a survey block is 14, can you generate a random sample of size 1,000 and use the sample to perform the lower-tailed test, upper-tailed test, and two-tail tested, respectively? 


H0: $\mu$ = 14\
Ha: $\mu$ !=14, or < 14 or >14, depending on the belief or the conjecture that you have about this parameter

we use **ttest_1samp(sample, hypothesized value, alternative=)** in scipy.stats to perform 1-sample t test. alternative = 'less' if it is a lower-tail test, 'greater' for upper-tail test, and 'two-sided' for two-tail test

In [6]:
# sp.stats.ttest_1samp(sample, hypothesized value, alternative) performs the 1 sample t test

alpha =0.05 # level of significance
n=1000 # sample size
mu_0 = 14 #hypothesized value of mean

sample_ppl = np.log(CA_housing.MedVal.sample(n)) # a random sample of size n
print(f'The sample mean is {sample_ppl.mean():.2f}')
print(f'The sample standard deviation is {sample_ppl.std(ddof=1):.3f}','\n')

# lower-tail test (alternative: mu < mu_0)
t_stat, p_value = sp.stats.ttest_1samp(sample_ppl, mu_0, alternative='less')
print('lower-tail test:')
print(f'test statistic is {t_stat:.3f} and the p value is {p_value:.3f}')
if p_value > alpha:
   print (f'Fail to reject the null hypothesis at the level of significance {alpha}','\n')
else:
   print (f'Reject the null hypothesis at the level of significance {alpha}','\n')

# upper-tail test (alternative: mu>mu_0)
t_stat, p_value = sp.stats.ttest_1samp(sample_ppl, mu_0, alternative='greater')
print('upper-tail test:')
print(f'test statistic is {t_stat:.3f} and the p value is {p_value:.3f}')
if p_value > alpha:
   print (f'Fail to reject the null hypothesis at the level of significance {alpha}','\n')
else:
   print (f'Reject the null hypothesis at the level of significance {alpha}','\n')

# two-tail test (alternative hypothesis: mu!=mu_0)
t_stat, p_value = sp.stats.ttest_1samp(sample_ppl, mu_0, alternative='two-sided')
print('two-tail test:')
print(f'test statistic is {t_stat:.3f} and the p value is {p_value:.3f}')
if p_value > alpha:
   print (f'Fail to reject the null hypothesis at the level of significance {alpha}','\n')
else:
   print (f'Reject the null hypothesis at the level of significance {alpha}','\n')


The sample mean is 12.10
The sample standard deviation is 0.577 

lower-tail test:
test statistic is -103.947 and the p value is 0.000
Reject the null hypothesis at the level of significance 0.05 

upper-tail test:
test statistic is -103.947 and the p value is 1.000
Fail to reject the null hypothesis at the level of significance 0.05 

two-tail test:
test statistic is -103.947 and the p value is 0.000
Reject the null hypothesis at the level of significance 0.05 



## One-way ANOVA Test / F test

The one-way ANOVA test (named F test as well) tests the null hypothesis that two or more distributions have the same mean.  The test is applied to samples from each of the distributions, possibly with different sizes.

Suppose there are $m$ distributions whose mean values are $\mu_i$, for $i=1,\dots,m$. The hypotheses of the one-way ANOVA test are:

H0: $\mu_1=\cdots\mu_m$

Ha: $\exists\; \mu_i\neq\mu_j$, for $i, j \in\{1,\dots,m\}$

We often perform the one-way ANOVA test first. If it turned out we reject the null hypothesis, we can continue to perform the two-sample t test to further determine the relationship between the means of any pair of distributions.

Assuming  a random sample is drawn from each of the $m$ distributions.
$n_j$, $\overline{x}_j$, and $s_j$ are the sample size, sample mean, and sample standard deviation pertaining to distribution $j$, for $j=1,\dots,m$. Let $\overline{x}$ be the overall mean:
\begin{equation}
\overline{x}=\frac{\sum_{j=1}^m \overline{x}_j n_j}{\sum_{j=1}^m n_j}. \tag{2}
\end{equation}
The test statistic $F$ is
\begin{equation}
F=\frac{\text{between-group mean squares}}{\text{Within-group mean squares}}=\frac{\sum_{j=1}^m (\overline{x}_j-\overline{x})^2 n_j\left/(m-1)\right.}{\sum_{j=1}^m s_j^2(n_j-1)\left/(\sum_{j=1}^m n_j-m)\right.} \tag{3}
\end{equation}
The corresponding $p$ value is the probability that a random value drawn from the F distribution with $(m-1)$ and $(\sum_{j=1}^m n_j-m)$ degrees of freedom is grater than the test statistic F.



Does the mean of log(MedVal) for houses in a block vary with the block's proximity to the ocean? If we split blocks by their proximity to ocean, there are five groups. Can you generate a random sample of size 1,000 from each group and then use the samples to test if the mean value of log(MedVal) in a block varies by its proximity to ocean? 

H0: $
\mu_\text{<1H OCEAN} = \mu_\text{NEAR BAY} = \mu_\text{INLAND} = \mu_\text{NEAR OCEAN}
$

Ha:
$
\mu_i \neq \mu_j \quad \text{for any } i, j \in \{\text{<1H OCEAN}, \text{NEAR BAY}, \text{NEAR OCEAN}, \text{INLAND}\}
$

**f_oneway(smp1, smp2,...)** in scipy.stats

In [7]:
alpha =0.05 # level of significance
n=1000 # sample size

# sample 1000 MedCal values from each group and apply the log function to the samples
samples = {
    i: np.log(CA_housing.loc[CA_housing.OceanProx == i, "MedVal"].sample(n))
    for i in ["<1H OCEAN", "NEAR BAY", "NEAR OCEAN", "INLAND"]
}


F_stat,p_value=sp.stats.f_oneway(samples["<1H OCEAN"],
                                 samples["NEAR BAY"],
                                 samples["NEAR OCEAN"],
                                 samples["INLAND"]
                                )

print('One-way ANOVA test:')
print(f'The test statistic is {F_stat:.3f} and the p value is {p_value:.3f}')

if p_value > alpha:
   print (f'Fail to reject the null hypothesis at the level of significance {alpha}','\n')
else:
   print (f'Reject the null hypothesis at the level of significance {alpha}','\n')

One-way ANOVA test:
The test statistic is 524.685 and the p value is 0.000
Reject the null hypothesis at the level of significance 0.05 



## Two-sample t test

The two-sample $t$ test is often used to test if two distributions have the same mean value. 

Let $\mu_1$ and $\mu_2$ denote the means of two distributions, respectively. A random sample is drawn from each distribution. $n_i$, $\overline{x}_i$, and $s_i$ are the sample size, sample mean, and sample standard deviation of group $j$, for $j=1$ and $2$.


The table below shows summarizes the three types of two-sample t test.

**Table 2: Hypothesis for two-sample t-test**
$$
\begin{array}{l l l l}
 & \text{lower-tailed test} & \text{upper-tailed test} & \text{two-tailed test} \\
\hline
\text{Null hypothesis } H_0: & \mu_1 \ge \mu_2 & \mu_1 \le \mu_2 & \mu_1 = \mu_2 \\
\text{Alternative hypothesis } H_a: & \mu_1 < \mu_2 & \mu_1 > \mu_2 & \mu_1 \neq \mu_2 \\
\text{p-value:} & P(X \le t) & P(X > t) & P(X \ge |t|) \\
\hline 
\end{array}
$$

**Unequal variance assumed**

If unequal variance of the two distributions is assumed, the test statistic  for two-sample t test is
\begin{equation}
t=\frac{\overline{x}_1-\overline{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}, \tag{4}
\end{equation}
and the  $p$-value for each type of hypothesis tests listed in the Table is calculated using the t distribution with the degrees of freedom determined by
\begin{equation}
\mathrm{df}=\frac{\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1}+\frac{(s_2^2/n_2)^2}{n_2-1}}.\tag{5}
\end{equation}

**Equal variance assumed**

If equal variance of the two distributions is assumed, the test statistic  for the two-sample t test is
\begin{equation}
t=\frac{\overline{x}_1-\overline{x}_2}{\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}}}, \tag{6}
\end{equation}
and the $p$-value for each type of hypothesis tests listed in Table is calculated using the t distribution with the degrees of freedom:
\begin{equation}
\mathrm{df} = n_1+n_2-2. \tag{7}
\end{equation}


What's the relationship between the mean value of log(MedVal) (the log median value of houses in a block) for blocks near the ocean and those near bay? You may draw a random sample of size 2,000 from each groups,  and use the two independent samples to perform a hypothesis test that compares the mean values of MedVal for the two distributions.

H0: $\mu_{\text{NEAR BAY}}=\mu_{\text{NEAR OCEAN}}$

Ha:  $\mu_{\text{NEAR BAY}} \neq \mu_{\text{NEAR OCEAN}}$

we can use **ttest_ind(sample 1, sampl 2, equal_var, alternative=types of test)** in scipy.stats

In [8]:
# two independent simple random samples of size 2000 (n=2000), one from "NEAR OCEAN" blocks, and the other from "NEAR BAY" blocks


alpha =0.05 # level of significance
n=2000 # sample size

# sample 1000 MedCal values from each group and apply the log function to the samples
samples = {
    i: np.log(CA_housing.loc[CA_housing.OceanProx == i, "MedVal"].sample(n))
    for i in ["NEAR BAY", "NEAR OCEAN"]
}

# test if the the two distribution have the same mean of MedVal using
# sp.stats.ttest_ind(sample 1, sampl 2, alternative=types of test)
t_stat,p_value=sp.stats.ttest_ind(samples["NEAR BAY"],
                                  samples["NEAR OCEAN"],
                                  nan_policy='omit', 
                                  equal_var=False,
                                  alternative='two-sided') # alternative='two-sided', 'less', 'greater'. Note: you need the latest version of scipy to use "alternative="
print('two-sample t test:')
print(f'The test statistic is {t_stat:.3f} and the p value is {p_value:.3f}')


if p_value > alpha:
   print (f'Fail to reject the null hypothesis at the level of significance {alpha}','\n')
else:
   print (f'Reject the null hypothesis at the level of significance {alpha}','\n')

two-sample t test:
The test statistic is 2.718 and the p value is 0.007
Reject the null hypothesis at the level of significance 0.05 



## One-sample z-test

The one-sample $z$ test is often used to test a hypothesis on $p$, the probability of success in Bernoulli distribution. Let $p_0$ be the hypothesized probability of success, and $\alpha$ be the level of significance in the hypothesis testing. A random sample is selected with a sample proportion $\overline{p}$. The one-sample z test is summarized in the following Table:

**Table 3: Hypothesis for one-sample z test**

$$
\begin{array}{l l l l}
 & \text{lower-tail test} & \text{upper-tail test} & \text{two-tail test} \\
\hline
\text{Null hypothesis } H_0: & p \ge p_0 & p \le p_0 & p = p_0 \\
\text{Alternative hypothesis } H_a: & p < p_0 & p > p_0 & p \neq p_0 \\
\text{p-value:} & P(X \le z) & P(X > z) & P(X \ge |z|) \\
\hline
\end{array}
$$


The test statistic is
\begin{equation}
z=\frac{\overline{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}. \tag{8}
\end{equation}

Thee p-value corresponding to the test statistic $z$ is calculated based on the standard normal distribution. 



Can you generate a random sample of size 1,000 and use the sample to test the probability that a randomly selected survey block being ``NEAR OCEAN" is greater than 0.2?

H0: p = 0.2\
Ha: p> 0.2

One-sample z-test using **proportion.proportions_ztest** in statsmodel.stats

In [9]:
n = 1000 # sample size
alpha = 0.05 # level of significance
p_0=0.2 # hypothesized probability of success

# sample 1000 blocks and count those are "NEAR OCEAN:
count = (CA_housing.sample(n).OceanProx=="NEAR OCEAN").sum()

z_stat, p_value = proportion.proportions_ztest(count=count,
                                               nobs=n,
                                               value=p_0,
                                               alternative='larger',#‘two-sided’, ‘smaller’, ‘larger’]
                                               prop_var=False) #If prop_var is false, then the variance of the proportion estimate is calculated based on the sample proportion.
print('One-sample z test:')
print(f'The test statistic is {z_stat:.3f} and the p value is {p_value:.3f}')


if p_value > alpha:
   print (f'Fail to reject the null hypothesis at the level of significance {alpha}','\n')
else:
   print (f'Reject the null hypothesis at the level of significance {alpha}','\n')

One-sample z test:
The test statistic is -3.641 and the p value is 1.000
Fail to reject the null hypothesis at the level of significance 0.05 



## Two-sample z-test

Let $p_1$ and $p_2$ be the probabilities of success for two Bernoulli distributions, repsectively. The two-sample $z$ test can be  used to determine if the two distributions have the same probability of success.

Let $n_1$ be the sample size of a random sample drawn from distribution 1. $\overline{p}_1$ is the sample proportion of success. Similarly, $n_2$ is the sample size of a sample drawn from distribution 2 and $\overline{p}_2$ is the sample proportion. The Table below summarizes the two-sample z tests.

**Table 4: Hypothesis for two-sample z test**

$$
\begin{array}{l l l l}
 & \text{lower-tailed test} & \text{upper-tailed test} & \text{two-tailed test} \\
\hline
\text{Null hypothesis } H_0: & p_1 \ge p_2 & p_1 \le p_2 & p_1 = p_2 \\
\text{Alternative hypothesis } H_a: & p_1 < p_2 & p_1 > p_2 & p_1 \neq p_2 \\
\text{p-value:} & P(X \le z) & P(X > z) & P(X \ge |z|) \\
\hline
\end{array}
$$

The test statistic is
\begin{equation}
z=\frac{\overline{p}_1-\overline{p}_2}{\sqrt{\overline{p}(1-\overline{p})(\frac{1}{n_1}+\frac{1}{n_2}})},\tag{9}
\end{equation}
where $\overline{p}$ is the pooled sample proportion:
\begin{equation}
\overline{p}=\frac{\overline{p}_1n_1+\overline{p}_2n_2}{n_1+n_2}. \tag{10}
\end{equation}

The p-value in the Table is calculated based on the standard normal distribution. 



Assume the first 5,000 rows of the dataset were from district 1, and the next 3,000 rows were from district 2.
Can you please test if the probability that a block randomly selected from district 1 being "NEAR OCEAN' is greater than that probability from district 2?

H0: p_1 $\leq$ p_2\
Ha: p_1 $>$ p_2

Two-sample z-test uses **proportion.propprtions_ztest()** from statsmodels.stats


In [10]:
# Let's test if the "NEAR OCEAN" proportion is one district is larger than that in another district

Dist_1=CA_housing.iloc[0:5000,:]
Dist_2=CA_housing.iloc[500:8000,:]

n_1 = 500 # sample size for district 1
n_2 = 300 # sample size for district 2
alpha = 0.05 # level of significance

c_1 = (Dist_1.sample(n_1).OceanProx=="NEAR OCEAN").sum() # count "NEAR OCEAN" in group 1
c_2 = (Dist_2.sample(n_2).OceanProx=="NEAR OCEAN").sum() # count "NEAR OCEAN" in group 2

z_stat, p_value = proportion.proportions_ztest(count=[c_1,c_2], 
                                               nobs=[n_1,n_2],  
                                               alternative='larger'
                                              ) #two-sided, larger, smaller
print('Two-sample z-test:')
print(f'Test statistic is {z_stat:.3f} and p value is {p_value:.3f},\n')


if p_value > alpha:
   print (f'Fail to reject the null hypothesis at the level of significance {alpha}','\n')
else:
   print (f'Reject the null hypothesis at the level of significance {alpha}','\n')

Two-sample z-test:
Test statistic is 0.369 and p value is 0.356,

Fail to reject the null hypothesis at the level of significance 0.05 



## One-way Chi-squared Goodness-of-fit Test

The one-way chi-square goodness-of-fit test examines the null hypothesis that the categorical data has the given frequencies.

Consider $F$ as the distribution of a categorical variable that takes values from the set  $\mathscr{A}=\{c_k|k=1,\dots, K\}$. Let $\{f_k|k=1,\dots,K\}$ be the hypothesized frequency distribution. The one-way $\chi^2$ test examines:

H0: the distribution that generates the data is $\{f_k|k=1,\dots,K\}$ <br>
Ha: the distribution that generates the data  is different from $\{f_k|k=1,\dots,K\}$.

Given a random sample, $x_1, \dots, x_n$, drawn from the distribution $F$, we can estimate the frequency distribution: $\{\widehat{f}_k|k=1,\dots,K\}$:
\begin{equation}
\widehat{f}_k=\sum_{i=1}^K 1\{x_i=c_k\} \tag{10}
\end{equation}
for $i=1,\dots, k$. Then, we calculate the test statistic:
\begin{equation}
Q=\sum_{i=1}^K \frac{(\widehat{f}_k-f_k)^2}{f_k}. \tag{11}
\end{equation}

The p-value is calculated based on the $\chi^2$ distribution with $K-1$ degrees of freedom. It is the probability that a random value drawn from this distribution is larger than the test statistic. If p-value is less than the level of significance, $\alpha$, we can reject the null hypothesis safely.



A hypothesized distribution of survey blocks by 'OceanProx':  [INLAND=0.15, NEAR BAY=0.1, <1H OCEAN=0.4, NEAR OCEAN=0.3, ISLAND'=0.05]. Can you generate a random sample of 2,000 and use the sample to test if the distribution generates the sample is the same as the hypothesized distribution?


H0: f $=$ [0.15, 0.1, 0.4, 0.3, 0.05]\
Ha:f $\neq$ [0.15, 0.1, 0.4, 0.3, 0.05]

One-way $\chi^2$ test using **scipy.stats.chisquare**

In [11]:
n = 2000 # sample size
alpha = 0.05 # level of significance

# what do we expect to see in proportions?
expected_distribution = [0.15, 0.3, 0.1, 0.4, 0.05]


# what counts did we see in our sample?
x=list(CA_housing['OceanProx'].sample(n))
observed_counts = [x.count('INLAND'), x.count('NEAR BAY'),x.count('<1H OCEAN'),x.count('NEAR OCEAN'),x.count('ISLAND')]
observed_counts


# counts based on expected proportions
expected_counts = list((np.array(expected_distribution) * n).astype(int))
expected_counts

# Get the stat data. sp.stats.chisquare(observed_counts,expected_counts)
chi_stat, p_value = sp.stats.chisquare(observed_counts, expected_counts)

# report
print('The one-way chi-squared test:')
print(f'Test statistic is {chi_stat:.3f} and the p value is {p_value:.3f},\n')


if p_value > alpha:
   print (f'Fail to reject the null hypothesis at the level of significance {alpha}','\n')
else:
   print (f'Reject the null hypothesis at the level of significance {alpha}','\n')

[606, 224, 907, 263, 0]

[np.int64(300), np.int64(600), np.int64(200), np.int64(800), np.int64(100)]

The one-way chi-squared test:
Test statistic is 3507.453 and the p value is 0.000,

Reject the null hypothesis at the level of significance 0.05 



## Chi-quared Contingency Test

The $\chi^2$ contingency test is used to determine if the distribution of a categorical variable is homogeneous across different populations.

For example,  to determine if the distribution of survey blocks by OceanProx is homogeneous across three districts,  we selected one sample from each district and summarized the frequency distribution accordingly:

**Table 5: Distribution by District** 

| OceanProx    | District_1 | District_2 | District_3 |
|---------------|-------------|-------------|-------------|
| INLAND        | 164         | 44          | 283         |
| NEAR BAY      | 131         | 0           | 20          |
| <1H OCEAN     | 195         | 244         | 262         |
| NEAR OCEAN    | 10          | 12          | 34          |
| ISLAND        | 0           | 0           | 1           |


We can use $\chi^2$ contingency test to examine if the distribution is homogeneous across the three districts:

H0: the distributions are homogeneous<br>
Ha: the distributions are not homogeneous

The test statistic is:
\begin{equation}
Q=\sum_{k=1}^K\sum_{l=1}^L (O_{k,l}-E_{k,l})^2/E_{k,l}  \tag{12}
\end{equation}
where $O_{k,l}$ is the observed value of category $k$ from distribution $l$, $L$ is the number of distributions, and $K$ is the number of categories of the categorical variable. $E_{k,l}$ is the expected value of category $k$ from distribution $l$, computed as:
\begin{equation}
E_{k,l}=\frac{\sum_{k=1}^K O_{k,l}\sum_{l=1}^L O_{k,l}}{\sum_{l=1}^L\sum_{k=1}^K O_{k,l}}. \tag{13}
\end{equation}

The p-value is the probability that a random value drawn from the chi-square distribution with $(K-1)(L-1)$ degrees of freedom being greater than the test statistic $Q$. If the p-value at the test statistic is less than the level of significance $\alpha$, we can safely reject the null hypothesis and conclude that the distribution of the categorical variable is not homogeneous across different populations.


Assuming the first 5000 rows of CA_housing are in the first district, the next 3000 rows are in the second district, and the following 6000 rows are in the third district.

Let's draw a random sample of size 10% from each district. Using the sample data, can you test if the distribution of `OceanProx' is homogeneous across the three districts?

H0: f_1 = f_2 = f_3\
Ha: $\exists\; f_i\neq f_j$, for $i, j \in\{1,2,3\}$

$\chi^2$ contingency test using **scipy.stats.chi2_contingency**

In [12]:
Dist_1=CA_housing.iloc[0:5000,:]
Dist_2=CA_housing.iloc[5000:8000,:]
Dist_3=CA_housing.iloc[8000:14000,:]

n_1, n_2, n_3 = (0.1*np.array([len(Dist_1), len(Dist_2), len(Dist_3)])).astype(int)
alpha = 0.05 # level of significance


c_1 = Dist_1['OceanProx'].sample(n_1,random_state=0).value_counts()
c_2 = Dist_2['OceanProx'].sample(n_2,random_state=0).value_counts()
c_3 = Dist_3['OceanProx'].sample(n_3,random_state=0).value_counts()
my_table = pd.DataFrame({'District_1': c_1, 'District_2': c_2, 'District_3': c_3}).fillna(0)
my_table


chi_stat, p_value, degrees_of_freedom, expected = sp.stats.chi2_contingency(my_table.T.values)
print('Chi-squared contingincy test:')
print(f'Test statistic is {chi_stat:.3f} and the p value is {p_value:.3f}\n')

if p_value > alpha:
   print (f'Fail to reject the null hypothesis at the level of significance {alpha}','\n')
else:
   print (f'Reject the null hypothesis at the level of significance {alpha}','\n')

Unnamed: 0_level_0,District_1,District_2,District_3
OceanProx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
INLAND,164,44,283
NEAR BAY,131,0,20
<1H OCEAN,195,244,262
NEAR OCEAN,10,12,34
ISLAND,0,0,1


Chi-squared contingincy test:
Test statistic is 320.987 and the p value is 0.000

Reject the null hypothesis at the level of significance 0.05 



Students who need to use sampling and statistical inference in depth should read textbooks in detail. 

Wasserman, Larry. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.

Ross, Sheldon M. Introduction to probability and statistics for engineers and scientists. Academic press, 2020.
