# Summary

This notebook introduces the basic concepts in statistical genetics, including:

- genotype
- MAF, HWE

# Intuition

Here we will put a cartoon --- use Figure 1 below plus coding into 012?

## Something from previous slides

![figure](https://www.frontiersin.org/files/Articles/127738/fbioe-03-00013-HTML-r1/image_m/fbioe-03-00013-g001.jpg)
> Figure 1. Common genetic variations. Variations at the (A) nucleotide level and (B) structural level. (C) Single nucleotide polymorphism A/T across a population.
> 
> Cardoso JGR, Andersen MR, Herrgård MJ and Sonnenschein N (2015) Analysis of genetic variation and potential applications in genome-scale metabolic modeling. Front. Bioeng. Biotechnol. 3:13. doi: 10.3389/fbioe.2015.00013

# Notations

## Genotype

In the lectures, we use $\mathbf{X}$ to denote a $N$ by $J$ genotype matrix, and for people who is dipolid,  $x_{i,j} \in \{0,1,2\}$ represents the genotype of individual $i=1,...,N$ at variant $j=1,...,J$ (assuming single allelic).

Assume that the genotype matrix is denoted as a $N$ by $J$ matrix, $\mathbf{X}$, where

\begin{equation*}
\mathbf{X} =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1M} \\
x_{21} & x_{22} & \cdots & x_{2M} \\
\vdots & \vdots & \ddots & \vdots \\
x_{N1} & x_{N2} & \cdots & x_{NM}
\end{bmatrix}
\end{equation*}

- Rows ($i = 1, \dots, N$) correspond to individuals.
- Columns ($j = 1, \dots, M$) correspond to variants.
- Each entry $x_{ij}$ represents the genotype of individual $i$ for variant $j$, where:
  - $0$: Homozygous for the reference allele.
  - $1$: Heterozygous.
  - $2$: Homozygous for the alternative allele.
  

## MAF

The **minor allele frequency (MAF)** is a fundamental concept in statistical genetics that quantifies the frequency of the less common allele at a given genetic locus in a population.  

Given a genotype matrix $\mathbf{X}$ as above, the **allele frequency** for variant $j$ is given by the expectation of $X_{\cdot,j}$ divided by $2$, accounting for human diploidy:  

$$
f_j = \frac{\mathbb{E}[X_{\cdot,j}]}{2} = \frac{1}{2N} \sum_{i=1}^{N} X_{ij}
$$

where $N$ is the total number of individuals.  

The **minor allele frequency (MAF)** is defined as:  

$$
\min(f_j, 1 - f_j)
$$

ensuring that it always represents the frequency of the **less** common allele in the population.


## Hardy-Weinberg Equilibrium (HWE)


**Hardy-Weinberg equilibrium (HWE)** is a principle in population genetics that describes the relationship between allele frequencies and genotype frequencies in a non-evolving population. Under HWE, allele and genotype frequencies in a population remain constant from generation to generation, assuming there is no selection, mutation, migration, genetic drift, or non-random mating.

For a biallelic locus with alleles $A$ (major allele) and $a$ (minor allele), and we observe that the frequencies of them are $f_A$ and $f_a$.

Normally one would expect that HWE holds, i.e., 

$$
P(AA) = f_A^2
$$
$$
P(Aa) = 2f_Af_a
$$
$$
P(aa) = f_a^2
$$

These frequencies must satisfy the equation:

$$
f_A^2 + 2f_A f_a + f_a^2 = 1
$$

In the absence of selection, mutation, genetic drift, or other forces, allele frequencies $f_A$ and $f_a$ are constant between generations, so equilibrium is reached.

**An example of when HWE may not hold**

Selection: If natural or artificial selection is acting on the population, certain genotypes may have a higher or lower fitness, which alters the allele frequencies over time. For example, individuals with the homozygous dominant genotype may have higher survival rates, leading to an increase in the major allele frequency.



# Example

In [131]:
rm(list=ls())
# Genotype matrix for 5 individuals and 2 variants
# Rows correspond to individuals, columns to variants
N=5
J=3
# genotypes <- matrix(c(0, 1, 1, 2, 2, 0, 0, 1, 1, 0),
#                    nrow = N, ncol = J, byrow = TRUE)
genotypes <- matrix(c(0, 1, 1, 2, 2, 0, 1, 1, 1, 0, 2, 1, 0, 0, 2), 
                    nrow = N, ncol = J, byrow = TRUE)
# Adding row and column names
rownames(genotypes) <- paste("Individual", 1:N)
colnames(genotypes) <- paste("Variant", 1:J)
genotypes

Unnamed: 0,Variant 1,Variant 2,Variant 3
Individual 1,0,1,1
Individual 2,2,2,0
Individual 3,1,1,1
Individual 4,0,2,1
Individual 5,0,0,2


In [132]:
# MAF Calculation for each variant
MAF <- apply(genotypes, 2, function(x) min(mean(x) / 2, 1 - mean(x) / 2))
# Print MAF
cat("Minor Allele Frequencies (MAF):\n")
for (j in 1:ncol(genotypes)) {
  cat(paste("MAF for Variant", j, ":", MAF[j]), "\n")
}


Minor Allele Frequencies (MAF):
MAF for Variant 1 : 0.3 
MAF for Variant 2 : 0.4 
MAF for Variant 3 : 0.5 


In [133]:
# Function to calculate the expected genotype frequencies under HWE
HWE_expected <- function(f_A, f_a) {
  c(f_A^2, 2*f_A*f_a, f_a^2)  # P(AA), P(Aa), P(aa)
}

# Calculate allele frequencies
f_A1 <- mean(genotypes[,1] == 0) + 0.5 * mean(genotypes[,1] == 1) # Variant 1
f_a1 <- 1 - f_A1
f_A2 <- mean(genotypes[,2] == 0) + 0.5 * mean(genotypes[,2] == 1) # Variant 2
f_a2 <- 1 - f_A2
f_A3 <- mean(genotypes[,3] == 0) + 0.5 * mean(genotypes[,3] == 1) # Variant 2
f_a3 <- 1 - f_A3

# Expected genotypic frequencies for each variant under HWE
hwe_expected_variant1 <- HWE_expected(f_A1, f_a1)
hwe_expected_variant2 <- HWE_expected(f_A2, f_a2)
hwe_expected_variant3 <- HWE_expected(f_A3, f_a3)
cat("Expected Genotype Frequencies for Variant 1 (HWE):", hwe_expected_variant1, "\n")
cat("Observed Genotype Frequencies for Variant 1:", table(genotypes[,1])/nrow(genotypes), "\n")
cat("Expected Genotype Frequencies for Variant 2 (HWE):", hwe_expected_variant2, "\n")
cat("Observed Genotype Frequencies for Variant 2:", table(genotypes[,2])/nrow(genotypes), "\n")
cat("Expected Genotype Frequencies for Variant 3 (HWE):", hwe_expected_variant3, "\n")
cat("Observed Genotype Frequencies for Variant 3:", table(genotypes[,3])/nrow(genotypes), "\n")


Expected Genotype Frequencies for Variant 1 (HWE): 0.49 0.42 0.09 
Observed Genotype Frequencies for Variant 1: 0.6 0.2 0.2 
Expected Genotype Frequencies for Variant 2 (HWE): 0.16 0.48 0.36 
Observed Genotype Frequencies for Variant 2: 0.2 0.4 0.4 
Expected Genotype Frequencies for Variant 3 (HWE): 0.25 0.5 0.25 
Observed Genotype Frequencies for Variant 3: 0.2 0.6 0.2 


The expected and observed frequencies seem like close to each other, so we can say **HWE roughly holds for both variants**.

More formally one can test with a **chi-squared test** to compare the observed genotype frequencies with the expected frequencies under HWE. 

In [134]:
# Chi-squared test for deviation from Hardy-Weinberg equilibrium for Variant 1
observed_freqs_variant1 <- table(factor(genotypes[,1], levels = 0:2)) / nrow(genotypes)
chisq_test_variant1 <- chisq.test(observed_freqs_variant1, p = hwe_expected_variant1)

# Chi-squared test for deviation from Hardy-Weinberg equilibrium for Variant 2
observed_freqs_variant2 <- table(factor(genotypes[,2], levels = 0:2)) / nrow(genotypes)
chisq_test_variant2 <- chisq.test(observed_freqs_variant2, p = hwe_expected_variant2)

# Chi-squared test for deviation from Hardy-Weinberg equilibrium for Variant 3
observed_freqs_variant3 <- table(factor(genotypes[,3], levels = 0:2)) / nrow(genotypes)
chisq_test_variant3 <- chisq.test(observed_freqs_variant3, p = hwe_expected_variant3)

# Interpretation of results
if (chisq_test_variant1$p.value < 0.05) {
  cat("\nVariant 1: HWE does not hold.\n")
} else {
  cat("\nVariant 1: HWE holds.\n")
}

if (chisq_test_variant2$p.value < 0.05) {
  cat("\nVariant 2: HWE does not hold.\n")
} else {
  cat("\nVariant 2: HWE holds.\n")
}

if (chisq_test_variant3$p.value < 0.05) {
  cat("\nVariant 3: HWE does not hold.\n")
} else {
  cat("\nVariant 3: HWE holds.\n")
}

“Chi-squared approximation may be incorrect”
“Chi-squared approximation may be incorrect”
“Chi-squared approximation may be incorrect”



Variant 1: HWE holds.

Variant 2: HWE holds.

Variant 3: HWE holds.


Now let's assume that **individuals carrying 2 risk alleles (a2) cannot survive after born**, then the population become:

In [135]:
genotypes_selected <- genotypes[genotypes[, 2] != 2, ]

In [136]:
# Calculate allele frequencies
f_A1_selected <- mean(genotypes_selected[,1] == 0) + 0.5 * mean(genotypes_selected[,1] == 1) # Variant 1
f_a1_selected <- 1 - f_A1_selected
f_A2_selected <- mean(genotypes_selected[,2] == 0) + 0.5 * mean(genotypes_selected[,2] == 1) # Variant 2
f_a2_selected <- 1 - f_A2_selected
f_A3_selected <- mean(genotypes_selected[,3] == 0) + 0.5 * mean(genotypes_selected[,3] == 1) # Variant 3
f_a3_selected <- 1 - f_A3_selected


# Expected genotypic frequencies for each variant under HWE
hwe_expected_variant1_selected <- HWE_expected(f_A1_selected, f_a1_selected)
hwe_expected_variant2_selected <- HWE_expected(f_A2_selected, f_a2_selected)
hwe_expected_variant3_selected <- HWE_expected(f_A3_selected, f_a3_selected)
cat("===========After selection===========\n")
cat("Expected Genotype Frequencies for Variant 1 (HWE):", hwe_expected_variant1_selected, "\n")
cat("Observed Genotype Frequencies for Variant 1:", table(genotypes_selected[,1])/nrow(genotypes_selected), "\n")
cat("Expected Genotype Frequencies for Variant 2 (HWE):", hwe_expected_variant2_selected, "\n")
cat("Observed Genotype Frequencies for Variant 2:", table(genotypes_selected[,2])/nrow(genotypes_selected), "\n")
cat("Expected Genotype Frequencies for Variant 3 (HWE):", hwe_expected_variant3_selected, "\n")
cat("Observed Genotype Frequencies for Variant 3:", table(genotypes_selected[,3])/nrow(genotypes_selected), "\n")

Expected Genotype Frequencies for Variant 1 (HWE): 0.6944444 0.2777778 0.02777778 
Observed Genotype Frequencies for Variant 1: 0.6666667 0.3333333 
Expected Genotype Frequencies for Variant 2 (HWE): 0.4444444 0.4444444 0.1111111 
Observed Genotype Frequencies for Variant 2: 0.3333333 0.6666667 
Expected Genotype Frequencies for Variant 3 (HWE): 0.1111111 0.4444444 0.4444444 
Observed Genotype Frequencies for Variant 3: 0.6666667 0.3333333 


Then we can use the **chi-squared test** again

In [137]:
observed_freqs_variant1_selected <- table(factor(genotypes_selected[,1], levels = 0:2))/nrow(genotypes_selected)
observed_freqs_variant2_selected <- table(factor(genotypes_selected[,2], levels = 0:2)) / nrow(genotypes_selected)
observed_freqs_variant3_selected <- table(factor(genotypes_selected[,3], levels = 0:2)) / nrow(genotypes_selected)

# Chi-squared test for Variant 1
chisq_test_v1_selected <- chisq.test(observed_freqs_variant1_selected, p = hwe_expected_variant1_selected)
cat("\nChi-squared p-value for Variant 1 (HWE):", chisq_test_v1_selected$p.value, "\n")

# Chi-squared test for Variant 2
chisq_test_v2_selected <- chisq.test(observed_freqs_variant2_selected, p = hwe_expected_variant2_selected)
cat("Chi-squared p-value for Variant 2 (HWE):", chisq_test_v2_selected$p.value, "\n")

# Chi-squared test for Variant 3
chisq_test_v3_selected <- chisq.test(observed_freqs_variant3_selected, p = hwe_expected_variant3_selected)
cat("Chi-squared p-value for Variant 3 (HWE):", chisq_test_v3_selected$p.value, "\n")

# Interpretation of results
if (chisq_test_v1_selected$p.value < 0.05) {
  cat("\nVariant 1: HWE does not hold after selection.\n")
} else {
  cat("\nVariant 1: HWE holds after selection.\n")
}

if (chisq_test_v2_selected$p.value < 0.05) {
  cat("\nVariant 2: HWE does not hold after selection.\n")
} else {
  cat("\nVariant 2: HWE holds after selection.\n")
}

if (chisq_test_v3_selected$p.value < 0.05) {
  cat("\nVariant 3: HWE does not hold after selection.\n")
} else {
  cat("\nVariant 3: HWE holds after selection.\n")
}

“Chi-squared approximation may be incorrect”



Chi-squared p-value for Variant 1 (HWE): 0.9801987 


“Chi-squared approximation may be incorrect”


Chi-squared p-value for Variant 2 (HWE): 0.8824969 


“Chi-squared approximation may be incorrect”


Chi-squared p-value for Variant 3 (HWE): 0.8824969 

Variant 1: HWE holds after selection.

Variant 2: HWE holds after selection.

Variant 3: HWE holds after selection.


# **TODO**
- [ ] change from $X$ to $X_{raw}$  -- nc stands for not-centered, raw data
- [ ] how to remove the warning in the message --- is it because data too small? -- something is wrong in the code
- [ ] how to make HWE doesn't hold for variant 2 after selection..