# $\chi^{2}$-Test

## Assumptions

The $\chi^{2}$ -test is a nonparametric test suiting for e.g. dichotomous data. Dichotomous data are a kind of discrete data. For example, Mendel's yellow or green pea color, high or low pest infestation and jagged or round shaped leaves are dichotomous end points. 

### $\chi^{2}$ Goodness-of-Fit Test

The  $\chi^{2}$ Goodness-Of-Fit Test compares a measured distribution with a known, theoretical distribution. The classical example is the comparison of an empirical phenotype ratio with a predicted phenotype ratio in genetics .
Two-sided hypotheses:


$$H_{0}: F_{0}(x) = F_{1}(x)$$
$$H_{1}: F_{0}(x) \neq F_{1}(x)$$

### $\chi^{2}$ Homogeneity Test

The  $\chi^{2}$ Homogeneity Test checks whether the procentual relation of two samples is different (e.g. \texttt{infestation} and \texttt{no infestation} for the treatments with and without insecticide).

$$H_{0}: \pi_{0}(x) = \pi_{1}(x)$$
$$H_{1}: \pi_{0}(x) \neq \pi_{1}(x)$$

Both tests might be calculated one-sided (but it's a special case, statistics software usually only contains the two-sided variant).

## Implementation

### $\chi^{2}$ Goodness-of-Fit Test - <tt>scipy.stats.chisquare()</tt>

The scipy module contains the function <tt>scipy.stats.chisquare()</tt>. It can be called with the following syntax for a goodness-of-fit test:

    scipy.stats.chisquare(observed_values, f_exp=expected_values)
   
Both <tt>observed_values</tt> and <tt>expected_values</tt> should be 1-dimensional numpy arrays.

### $\chi^{2}$ Homogeneity Test for 2x2-Tables - <tt>scipy.stats.chi2_contingency()</tt>


The scipy module contains the function <tt>scipy.stats.chi2_contingency()</tt>. It can be called with the following syntax for a 2x2 homogeneity test:

    scipy.stats.chi2_contingency(data, correction = False)
   
where <tt>data</tt> should be 2x2 numpy array.

<tt>correction</tt> states whether the Yates-correction shall be used (number of observations smaller than 20) or not. The default configuration (FALSE) calculates the original $\chi^{2}$-test according to Pearson. 

## Example Snapdragon

### Experiment

A geneticist, investigating the Mendelian predictions for F2 generations observed the ratio of phenotypes shown in the following table for the F2 generation (Baur et al., 1931 cited according to Samuels and Wittmer, 2003, p. 392). 

| Red | Pink | White |
|-----|------|-------|
| 54  | 122  | 58    |

Does the observed result differ from the expected ratio of 1:2:1 for a F2 generation in the intermediate Mendelian heredity ($\alpha$-error 5\%)?

### Statistical Analysis

$$H_{0}: F_{0}(x) = F_{1}(x)$$
$$H_{1}: F_{0}(x) \neq F_{1}(x)$$

No appliance of the Yates-correction because there exist more than 20 observations. 

In [7]:
import scipy
import scipy.stats as scs
import numpy as np

observed_values=np.array([54,122,58])
counts = observed_values.sum()
expected_values=np.array([counts/4*1, counts/4*2, counts/4*1]) # 1:2:1

scs.chisquare(observed_values, f_exp=expected_values)

Power_divergenceResult(statistic=0.5641025641025641, pvalue=0.754235004823114)

<tt>statistic</tt> represents the test $chi^{2}$-statistic while <tt>pvalue</tt> gives the two-sided p-value.

### Interpretation

The observed ratio of phenotypes does not differ significantly from the Mendelian ratio for a F2 generation in the intermediate heredity. The $H_{0}$ hypothesis cannot be rejected. 

<font size="3"><div class="alert alert-warning"><b>Exercise 4.1:</b> <br> 
"Researchers studied a mutant type of flax seed that they hoped would produce oil for use in margarine and shortening. The amount of palmitic acid in the flax seed was an important factor in this research; a related factor was whether the seed was brown or variegated. The seeds were classified into six combinations of palmitic acid and color, shown in the following table. According to a hypothesized genetic model, the six combinations should occur in a 3:6:3:1:2:1 ratio" (Saedi and Rowland, 1997 cited according to Samuels and Wittmer, 2003, p. 395).


| Color   | Acid level | No |
|---------|------------|----|
| brown   | low        | 15 |
| brown   | medium     | 26 |
| brown   | high       | 15 |
| mottled | low        | 0  |
| mottled | medium     | 8  |
| mottled | high       | 8  |
    
    
Does the observed distribution differ from the hypothesized model?
</div>
</font>

<font size="3">
<b>Try it yourself:</b></font>

**Example Solution:**

$\chi^{2}$ Goodness-of-Fit Test according to Pearson (number of observations greater than 20).

In [2]:
observed_values=np.array([15,26,15,0,8,8])
counts = observed_values.sum()
expected_values=np.array([counts/16*3, counts/16*6, counts/16*3, 
                          counts/16*1, counts/16*2, counts/16*1]) # 3:6:3:1:2:1

scs.chisquare(observed_values, f_exp=expected_values)

Power_divergenceResult(statistic=7.703703703703703, pvalue=0.17333885897201057)

The observed distribution does not differ significantly from the expected distribution to a confidence level of 95\%. H$_{0}$ is not rejected. 

## Example Barley

### Experiment

Researchers investigated the survival rate of barley seeds after a heat treatment. Sample A was used as untreated control group whereas Sample B was exposed to heat. All seeds were cut longitudinal and incubated in 0.1% 2,3,5-triphenyltetrazoliumchloride for half an hour. The breathing, living embryo reduces tetrazoliumchloride to the intensively red colored insoluble substance triphenyl formazan. Surviving seeds were counted according to color (see table below) (Bishop, 1980, p. 76).

|   | Surviving | Dead |
|---|-----------|------|
| A | 64        | 16   |
| B | 34        | 46   |

### Statistical Analysis

Does the heat treatment reduce the survival rate of barely seeds? $\alpha$ = 1\%.

$$H_{0}: \pi_{no heat}(x) \leq \pi_{heat}(x)$$

$$H_{1}: \pi_{no heat}(x) > \pi_{heat}(x)$$

Since the number of observations is adequate, no Yates correction is used. 

In [10]:
barley = np.array([[64, 34], [16, 46]])
chi2, p, df, input_data = scipy.stats.chi2_contingency(barley, correction = False)
print("Chi2: " + str(chi2))
print("p-value: " + str(p))

Chi2: 23.69980250164582
p-value: 1.1259409041392107e-06


<tt>scipy.stats.chi2_contingency()</tt> calculates the two-sided p-value as a matter of principle. Therefore, the p-value has to be divided by two or to be compared with a doubled $\alpha$ for a one-sided comparison:   

In [4]:
print("One-sided p-value is: " + str(p/2))

One-sided p-value is: 5.629704520696054e-07


Yes, the heat treatment does reduce the survival rate of barley seeds significantly to a confidence level of 0.99.

<font size="3"><div class="alert alert-warning"><b>Exercise 4.2:</b> <br> 
Some species occur associated with each other in certain habitats. The reason might be that both are influenced by similar micro climates (e.g. shade plants usually appear together with other shade liking plants), soil conditions (e.g. chalk liking plants will be accompanied by other chalk liking plants), or that one species creates good living conditions for the other one (e.g. host-parasite relationships), or numerous other explanations. (...) A common method for the analysis of such relationships is setting squares in which the respective species are counted. The following table represents an exemplary dataset (Bishop, 1980, p. 111).
    
|            | Presence A | Absence A |
|------------|------------|-----------|
| Presence B | 25         | 75        |
| Absence B  | 25         | 75        |

Are those two species associated? $\alpha$ = 10\%.

</div>
</font>

<font size="3">
<b>Try it yourself:</b></font>

**Example Solution:**

$\chi^{2}$ Homogeneity Test according to Pearson (number of observation greater than 20).

In [5]:
biotope = np.array([[25,75], [25, 75]])
chi2, p, df, input_data = scipy.stats.chi2_contingency(biotope, correction = False)
print("Chi2: " + str(chi2))
print("p-value: " + str(p))

Chi2: 0.0
p-value: 1.0


H$_{0}$ is not rejected. No significant differences in the percentage distribution of species A dependent on species B could be detected. The species are unlikely to be associated (and we don't know the error to that statement).