# Hardy-Weinberg Equilibrium

The Hardy-Weinberg equilibrium (HWE) describes how allele and genotype frequencies remain constant across generations in a population where there is no interference from evolutionary forces.

# Graphical Summary

![Fig](./graphical_summary/slides/Slide3.png)

# Key Formula

For a genetic variants with two alleles (`A` and `a`) with frequencies $f_A$ and $f_a$ respectively (where $f_A + f_a = 1$):

$$
f_A^2 + 2f_A f_a + f_a^2 = 1
$$

Where:
- $f_A^2$ = frequency of genotype AA
- $2f_A f_a$ = frequency of genotype Aa
- $f_a^2$ = frequency of genotype aa

# Technical Details


## Expected Counts of Genotype Under HWE
The expected counts of each genotype under HWE for a population of $N$ individuals are:

$$
E_{AA} = f_A^2 \cdot N
$$

$$
E_{Aa} = 2f_A f_a \cdot N
$$

$$
E_{aa} = f_a^2 \cdot N
$$

where:  
- $f_A$: frequency of allele A
- $f_a = 1 - f_A$: frequency of allele a
- $N$ = Total number of individuals.

## Test HWE Using Chi-squared Test

Then one can use Pearson's chi-squared test to test if HWE holds:

$$
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
$$

where:  
- $O_i$ = Observed genotype count (AA, Aa, aa).  
- $E_i$ = Expected genotype count under Hardy-Weinberg Equilibrium.  
- The summation runs over all genotype categories.  

To determine statistical significance, compare $\chi^2$ with a **chi-square distribution** with **1 degree of freedom** (df = 1). The p-value is computed as:

$$
p = P(\chi^2 > \text{observed } \chi^2)
$$

If $p < 0.05$, we reject the **Hardy-Weinberg Equilibrium** assumption.

## When HWE Doesn't Hold
- **Non-random mating**: When individuals choose mates based on genotype or phenotype, homozygosity increases beyond HWE expectations.
- **Natural selection**: When certain genotypes have survival or reproductive advantages, their frequencies change between generations.
- **Migration**: Gene flow introduces new alleles from other populations, altering local allele frequencies.
- **Genetic drift**: Random sampling effects in small populations cause unpredictable changes in allele frequencies.
- **Mutation**: New alleles emerge through mutation, changing the genetic composition of the population.

## Common Misconceptions
- While entire genomes aren't in HWE, individual loci often are, especially neutral markers.
- HWE is surprisingly robust to minor violations of its assumptions.
- Deviations from HWE often signal important biological phenomena rather than errors.

## Role of HWE in Statistical Genetics

Hardy-Weinberg Equilibrium serves several critical functions in statistical genetics:

- **Quality control baseline**: Significant deviations from HWE often signal genotyping errors or technical artifacts rather than biological phenomena, providing an efficient method to identify problematic markers.

- **Null hypothesis framework**: HWE establishes the expected genotype distribution under neutral conditions, serving as the statistical null model against which evolutionary forces can be detected.

- **Allele frequency estimation**: When only partial genotype data is available, HWE principles allow researchers to estimate complete population allele frequencies.

- **Statistical power improvement**: Filtering out markers that violate HWE improves signal-to-noise ratio in association studies, increasing power to detect true genetic effects.

- **Population structure inference**: Systematic HWE deviations across multiple loci can reveal cryptic population substructure that might confound genetic analyses.

# Example

Related topics:
- [genotype coding](https://gaow.github.io/statgen-prerequisites/genotype_coding.html)
- [minor allele frequency](https://gaow.github.io/statgen-prerequisites/minor_allele_frequency.html)

## Example 1 -- HWE holds

Here we use the original data from [E.B. Ford (1971) on the scarlet tiger moth](https://en.wikipedia.org/wiki/Hardy–Weinberg_principle), for which the phenotypes of a sample of the population were recorded.
>
> **Table 1: Example Hardy–Weinberg Principle Calculation**
>
| Phenotype          | White-spotted (AA) | Intermediate (Aa) | Little spotting (aa) | Total |
|--------------------|-------------------|-------------------|----------------------|-------|
| Number        | 1469              | 138               | 5                    | 1612  |


We first test the HWE on the original data above, and then manually make some data to show when the HWE doesn't hold.

In [21]:
# Clear the environment
rm(list = ls())

# Data from E. B. Ford (1971) on the scarlet tiger moth
N_AA <- 1469  # White-spotted
N_Aa <- 138   # Intermediate
N_aa <- 5     # Little spotting
N_total <- N_AA + N_Aa + N_aa


In [22]:
# Calculate observed allele frequencies
N_A <- (2 * N_AA + N_Aa)
N_a <- (2 * N_aa + N_Aa)

f_A <- N_A / (2 * N_total)  # Frequency of A allele
f_a <- N_a / (2 * N_total)  # Frequency of a allele


Allele frequencies for A and a are:

In [23]:
cat("f_A =", round(f_A, 4), "\n")
cat("f_a =", round(f_a, 4), "\n")

f_A = 0.9541 
f_a = 0.0459 


Then we calculate the expected genotype counts if HWE holds:

In [24]:
# Calculate expected genotype counts under HWE
N_exp_AA <- f_A^2 * N_total
N_exp_Aa <- 2*f_A*f_a * N_total
N_exp_aa <- f_a^2 * N_total

# Create a table of observed vs expected
genotypes <- c("AA", "Aa", "aa")
N_observed <- c(N_AA, N_Aa, N_aa)
N_expected <- c(N_exp_AA, N_exp_Aa, N_exp_aa)
results <- data.frame(Genotype = genotypes, Observed = N_observed, Expected = N_expected)
results$Difference <- results$Observed - results$Expected
results

Genotype,Observed,Expected,Difference
<chr>,<dbl>,<dbl>,<dbl>
AA,1469,1467.397022,1.602978
Aa,138,141.205955,-3.205955
aa,5,3.397022,1.602978


We perform chi-square test to see if HWE holds:

In [27]:
# Perform chi-square test
chi_sq <- sum((N_observed - N_expected)^2/N_expected)
degrees_freedom <- 1  # number of genotypes - number of independent alleles = 3 - 2 = 1
p_value <- 1 - pchisq(chi_sq, degrees_freedom)
cat("Chi-square statistic =", round(chi_sq, 4), "\n")
cat("Degrees of freedom =", degrees_freedom, "\n")
cat("p-value =", format(p_value, scientific = TRUE), "\n")

Chi-square statistic = 0.8309 
Degrees of freedom = 1 
p-value = 3.619985e-01 


If we set $\alpha=0.05$, then here is the conclusion:

In [28]:
# Conclusion
alpha <- 0.05
if(p_value < alpha) {
  cat("The population deviates significantly from Hardy-Weinberg equilibrium (p < 0.05)")
} else {
  cat("The population is in Hardy-Weinberg equilibrium (p >= 0.05)")
}

The population is in Hardy-Weinberg equilibrium (p >= 0.05)

## Example 2 -- HWE doens't hold

Now that imagine that we observe more of the Intermediate (Aa) and zero of the Little spotting (aa), and re-do the analysis:
>
> **Table 2: Example Hardy–Weinberg Principle Calculation (manually adjusted in this notebook)**
>
| Phenotype          | White-spotted (AA) | Intermediate (Aa) | Little spotting (aa) | Total |
|--------------------|-------------------|-------------------|----------------------|-------|
| Number        | 1469              | 500               | 0                    | 1969  |


In [37]:
# Clear the environment
rm(list = ls())

# Data from E. B. Ford (1971) on the scarlet tiger moth
N_AA <- 1469  # White-spotted
N_Aa <- 500   # Intermediate
N_aa <- 0     # Little spotting
N_total <- N_AA + N_Aa + N_aa


In [38]:
# Calculate observed allele frequencies
N_A <- (2 * N_AA + N_Aa)
N_a <- (2 * N_aa + N_Aa)

f_A <- N_A / (2 * N_total)  # Frequency of A allele
f_a <- N_a / (2 * N_total)  # Frequency of a allele


Allele frequencies for A and a are:

In [39]:
cat("f_A =", round(f_A, 4), "\n")
cat("f_a =", round(f_a, 4), "\n")

f_A = 0.873 
f_a = 0.127 


Then we calculate the expected genotype counts if HWE holds:

In [40]:
# Calculate expected genotype counts under HWE
N_exp_AA <- f_A^2 * N_total
N_exp_Aa <- 2*f_A*f_a * N_total
N_exp_aa <- f_a^2 * N_total

# Create a table of observed vs expected
genotypes <- c("AA", "Aa", "aa")
N_observed <- c(N_AA, N_Aa, N_aa)
N_expected <- c(N_exp_AA, N_exp_Aa, N_exp_aa)
results <- data.frame(Genotype = genotypes, Observed = N_observed, Expected = N_expected)
results$Difference <- results$Observed - results$Expected
results

Genotype,Observed,Expected,Difference
<chr>,<dbl>,<dbl>,<dbl>
AA,1469,1500.742,-31.742
Aa,500,436.516,63.484
aa,0,31.742,-31.742


We perform chi-square test to see if HWE holds:

In [41]:
# Perform chi-square test
chi_sq <- sum((N_observed - N_expected)^2/N_expected)
degrees_freedom <- 1  # number of genotypes - number of independent alleles = 3 - 2 = 1
p_value <- 1 - pchisq(chi_sq, degrees_freedom)
cat("Chi-square statistic =", round(chi_sq, 4), "\n")
cat("Degrees of freedom =", degrees_freedom, "\n")
cat("p-value =", format(p_value, scientific = TRUE), "\n")

Chi-square statistic = 41.6461 
Degrees of freedom = 1 
p-value = 1.093853e-10 


If we set $\alpha=0.05$, then here is the conclusion:

In [42]:
# Conclusion
alpha <- 0.05
if(p_value < alpha) {
  cat("The population deviates significantly from Hardy-Weinberg equilibrium (p < 0.05)")
} else {
  cat("The population is in Hardy-Weinberg equilibrium (p >= 0.05)")
}

The population deviates significantly from Hardy-Weinberg equilibrium (p < 0.05)