# Minor Allele Frequency

The minor allele frequency (MAF) represents the proportion of the less common allele in a population, which equals half the expected genotype value in diploid organisms like humans since each individual carries two alleles per locus.

# Graphical Summary

![MAF](./cartoons/minor_allele_frequency.svg)

# Key Formula

$$\text{MAF}_j = \frac{\mathbb{E}[X_{\text{additive},j}]}{2} = \frac{1}{2N}\sum_{i=1}^{N} X_{\text{additive},ij}$$

Where:
- $X_{\text{additive},ij}$ represents the count of alternative alleles (0,1,2) for individual $i$ at variant $j$
- The division by 2 is necessary because in the additive model for diploid organisms, each individual contributes two alleles

# Technical Details

If there are only two alleles at the same locus, then the frequency of them can be denoted as $f_j$ and $1-f_j$, and the $\text{MAF}_j$ is always defined as $\min(f_j, 1 - f_j)$ (ensuring that it always represents the frequency of the **less** common allele in the population, i.e., **minor allele**). If there are more alleles, the **MAF** is specific for each minor allele.


# Example

The R example here demonstrates a simple workflow for calculating Minor Allele Frequency (MAF). It organizes genotype data, extracts individual alleles, counts their frequencies, and identifies the minor allele. This approach illustrates the fundamental method for quantifying genetic variation at a specific locus.

In [1]:
# Clear the environment
rm(list = ls())

# Define genotypes for 5 individuals at 3 variants
# These represent actual alleles at each position
# For example, Individual 1 has genotypes: CC, CT, AT
genotypes <- c(
 "CC", "CT", "AT",  # Individual 1
 "TT", "TT", "AA",  # Individual 2
 "CT", "CT", "AA",  # Individual 3
 "CC", "TT", "AA",  # Individual 4
 "CC", "CC", "TT"   # Individual 5
)
# Reshape into a matrix
geno_matrix <- matrix(genotypes, nrow=5, ncol=3, byrow=TRUE)
rownames(geno_matrix) <- paste("Individual", 1:5)
colnames(geno_matrix) <- paste("Variant", 1:3)

The raw genotype matrix is:

In [2]:
geno_matrix

Unnamed: 0,Variant 1,Variant 2,Variant 3
Individual 1,CC,CT,AT
Individual 2,TT,TT,AA
Individual 3,CT,CT,AA
Individual 4,CC,TT,AA
Individual 5,CC,CC,TT


In [4]:
# initialize the output data frame
results <- data.frame(
  Variant = colnames(geno_matrix),
  Major_Allele = character(ncol(geno_matrix)),
  Minor_Allele = character(ncol(geno_matrix)),
  MAF = numeric(ncol(geno_matrix)),
  stringsAsFactors = FALSE
)


In [5]:
# Process each variant separately
for (j in 1:ncol(geno_matrix)) {
  variant_name <- colnames(geno_matrix)[j]  
  # Step 1: Extract all alleles from the genotype column
  alleles <- c()
  for (genotype in geno_matrix[, j]) {
    # Extract first and second allele from each genotype
    first_allele <- substr(genotype, 1, 1)
    second_allele <- substr(genotype, 2, 2)
    alleles <- c(alleles, first_allele, second_allele)
  }
  
  # Count frequency of each allele
  allele_table <- table(alleles)
  total_alleles <- sum(allele_table)
  allele_freq <- allele_table / total_alleles
  
  # Step 2: Identify major and minor alleles
  ordered_freqs <- sort(allele_freq, decreasing = TRUE)
  major_allele <- names(ordered_freqs)[1]
  minor_allele <- names(ordered_freqs)[2]
  
  # Step 3: Calculate minor allele frequency (MAF)
  minor_freq <- ordered_freqs[2]
  results$Major_Allele[j] <- major_allele
  results$Minor_Allele[j] <- minor_allele
  results$MAF[j] <- minor_freq
}


The minor allele frequencies for the three variants are:

In [6]:
results

Variant,Major_Allele,Minor_Allele,MAF
<chr>,<chr>,<chr>,<dbl>
Variant 1,C,T,0.3
Variant 2,T,C,0.4
Variant 3,A,T,0.3
