# Summary

This notebook introduces the basic concepts in statistical genetics, including:

- LD, LD score

# Intuition

Here we will put a cartoon 

# Notations

## Linkage Disequilibrium (LD)

Linkage Disequilibrium (LD) refers to the non-random association of alleles at two or more loci. In a population, if two variants (or loci) are in LD, their allele combinations occur more or less frequently than expected based on their individual allele frequencies. 

## Covariance of $\mathbf{X}$

Given the genotype matrix $\mathbf{X_{\text{raw}}}$, we first perform normalization on it (i.e., the mean genotype for each variant is subtracted), then the covariance matrix can be computed as:

$$
\text{Cov}(\mathbf{X}) = \mathbf{X}^T \mathbf{X}
$$

The covariance between two variants $j_1$ and $j_2$ is computed as:

\begin{equation*}
\text{cov}(x_{i j_1}, x_{i j_2}) = \frac{1}{N} \sum_{i=1}^{N} \left( x_{i j_1} - \bar{x}_{j_1} \right) \left( x_{i j_2} - \bar{x}_{j_2} \right),
\end{equation*}

where $\bar{x}_{j_1}$ and $\bar{x}_{j_2}$ are the mean genotypes for variants $j_1$ and $j_2$, respectively, calculated as (and this step is called **centering**)

\begin{equation*}
\bar{x}_{j_1} = \frac{1}{N} \sum_{i=1}^{N} x_{i j_1}, \quad \bar{x}_{j_2} = \frac{1}{N} \sum_{i=1}^{N} x_{i j_2}.
\end{equation*}


Where:
- $\mathbf{X_{\text{raw}}}$ is the raw genotype matrix.
- $\mathbf{X}$ is the centered genotype matrix.
- $N$ is the number of individuals.

## $r$ and $r^2$
To calculate the **correlation** between variants, we need to normalize the covariance matrix by the variance of each variant. If **$\mathbf{X}$** is centered but not normalized, then the correlation matrix can be computed as:

$$
\mathbf{R} = \frac{\mathbf{X}^T \mathbf{X}}{\sqrt{\text{diag}(\mathbf{X}^T \mathbf{X}) \cdot \text{diag}(\mathbf{X}^T \mathbf{X})^T}}
$$

Where:
- $\text{diag}(\mathbf{X}^T \mathbf{X})$ is the diagonal of the covariance matrix, representing the variances of each variant.
- The division normalizes the covariance matrix to produce the correlation matrix.


## LD Score

The **LD score** is a measure of the extent to which a given variant is in linkage disequilibrium (LD) with other variants across the genome. It is used to summarize the amount of genetic information (in terms of LD) that a variant shares with all other variants in a region of interest. The LD score for a variant $j$ is defined as the sum of the squared correlation coefficients $r^2$ between that variant and all other variants in the genome, typically within a specified genomic window or region. Mathematically, the LD score for variant $j$ is given by:

\begin{equation*}
\text{LD Score}_j = \sum_{k \neq j} r_{j,k}^2
\end{equation*}

where:

- $r_{j,k}$ is the correlation coefficient between variants $j$ and $k$,
- The sum is taken over all variants $k \neq j$ in the region of interest.

This score reflects how much a variant is correlated with other variants across the genome, providing insight into the local structure of LD around that variant.

### Purpose of LD Score

- **Genetic Association Studies**: LD scores are useful in genetic association studies to account for the correlations between variants when performing polygenic risk score (PRS) analysis or genome-wide association studies (GWAS).
- **Controlling for Confounding**: In association studies, high LD between variants can lead to confounding effects, where the signal from a variant may be shared with other nearby variants. By using LD scores, researchers can assess the relative contribution of each variant and better control for LD when interpreting results.
- **Estimating Heritability**: LD scores are used to estimate the heritability of complex traits by calculating how much of the genetic variation in a trait can be explained by LD between variants.

### Interpretation

- **High LD Score**: A variant with a high LD score indicates that it is in strong LD with many other variants in the genome, meaning it shares a substantial amount of genetic variation with neighboring variants.
- **Low LD Score**: A variant with a low LD score indicates that it is not strongly correlated with many other variants, implying that it may have a unique genetic contribution or be in a region with low LD.

In summary, the LD score is a way to quantify the genetic "information content" of a variant based on its correlations with surrounding variants, and it is used in the context of genetic studies to account for LD structure in the genome.


# A case example

In [19]:
rm(list=ls())
# Genotype matrix for 5 individuals and 3 variants
# Rows correspond to individuals, columns to variants
N=5
J=3
genotypes <- matrix(c(0, 1, 1, 2, 2, 0, 1, 1, 1, 0, 2, 1, 0, 0, 2), 
                    nrow = N, ncol = J, byrow = TRUE)
# genotypes <- matrix(sample(0:2, N*J, replace = TRUE), 
#                    nrow = N, ncol = J)
# Adding row and column names
rownames(genotypes) <- paste("Individual", 1:N)
colnames(genotypes) <- paste("Variant", 1:J)
genotypes

Unnamed: 0,Variant 1,Variant 2,Variant 3
Individual 1,0,1,1
Individual 2,2,2,0
Individual 3,1,1,1
Individual 4,0,2,1
Individual 5,0,0,2


In [20]:
X <- scale(genotypes, center = TRUE, scale = TRUE)
X

Unnamed: 0,Variant 1,Variant 2,Variant 3
Individual 1,-0.6708204,-0.2390457,0.0
Individual 2,1.5652476,0.9561829,-1.414214
Individual 3,0.4472136,-0.2390457,0.0
Individual 4,-0.6708204,0.9561829,0.0
Individual 5,-0.6708204,-1.4342743,1.414214


In [21]:
calculate_ld <- function(X) {
  M <- ncol(X)
  ld_matrix <- matrix(NA, nrow = M, ncol = M)
  for (j in 1:M) {
    for (k in 1:M) {
      if (j == k) {
        ld_matrix[j, k] <- 1  # r^2 of a variant with itself is 1
      } else {
        r_jk <- cor(X[, j], X[, k])  # Calculate correlation
        ld_matrix[j, k] <- r_jk^2    # Store r^2 (LD)
      }
    }
  }
  return(ld_matrix)
}


ld_matrix <- calculate_ld(X)
ld_matrix

0,1,2
1.0,0.21875,0.625
0.21875,1.0,0.7142857
0.625,0.7142857,1.0


In [22]:
calculate_ld_scores <- function(ld_matrix) {
  apply(ld_matrix, 1, sum)  # Sum across rows to get LD scores
}

# Compute the LD matrix
ld_matrix <- calculate_ld(X)

# Compute LD scores
ld_scores <- calculate_ld_scores(ld_matrix)
ld_scores


The second and third variants are in relatively higher LD (0.714) compared to the LD between other variants, leading to a higher LD score.

# TODO

- [ ] double check details of normalization and everything 
- [ ] check with Gao to see if we want to say how normalization is performed
- [ ] in literatures, we use $r$ instead of $r^2$ to represent LD (find ref)
- [ ] LD score interpretation too long.