# Summary

This notebook introduces the basic concepts in statistical genetics, including:

- LD, LD score

# Intuition

Here we will put a cartoon 

# Notations

## Linkage Disequilibrium (LD)

Linkage Disequilibrium (LD) refers to the non-random association of alleles at two or more loci. In a population, if two variants (or loci) are in LD, their allele combinations occur **more or less frequently than expected** based on their individual allele frequencies. In short words, LD describes the **sharing of certain combination of variants**.

Given the scaled genotype matrix $\mathbf{X}$, the LD matrix can be computed as :
$$
\mathbf{R} = \frac{\mathbf{X}^T \mathbf{X}}{N}
$$

where:

- $\mathbf{X}$ is the centered genotype matrix.
- $N$ is the number of individuals.

When $\mathbf{X}$ is scaled, the covariance matrix is the same as correlation matrix.

<!-- The covariance between two variants $j_1$ and $j_2$ is computed as:

\begin{equation*}
\text{cov}(x_{i j_1}, x_{i j_2}) = \frac{1}{N} \sum_{i=1}^{N} \left( x_{i j_1} - \bar{x}_{j_1} \right) \left( x_{i j_2} - \bar{x}_{j_2} \right),
\end{equation*}

where $\bar{x}_{j_1}$ and $\bar{x}_{j_2}$ are the mean genotypes for variants $j_1$ and $j_2$, respectively, calculated as (and this step is called **centering**)

\begin{equation*}
\bar{x}_{j_1} = \frac{1}{N} \sum_{i=1}^{N} x_{i j_1}, \quad \bar{x}_{j_2} = \frac{1}{N} \sum_{i=1}^{N} x_{i j_2}.
\end{equation*}
- $\mathbf{X_{\text{raw}}}$ is the raw genotype matrix. -->

<!-- 
## $r$ and $r^2$
To calculate the **correlation** between variants, we need to normalize the covariance matrix by the variance of each variant. If **$\mathbf{X}$** is centered but not normalized, then the correlation matrix can be computed as:

$$
\mathbf{R} = \frac{\mathbf{X}^T \mathbf{X}}{\sqrt{\text{diag}(\mathbf{X}^T \mathbf{X}) \cdot \text{diag}(\mathbf{X}^T \mathbf{X})^T}}
$$

Where:
- $\text{diag}(\mathbf{X}^T \mathbf{X})$ is the diagonal of the covariance matrix, representing the variances of each variant.
- The division normalizes the covariance matrix to produce the correlation matrix.

When $X_j$'s are normalized, the LD matrix $R$ is:
$$
\mathbf{R} = \frac{\mathbf{X}^T \mathbf{X}}{n}
$$

 -->

## LD Score

The **LD score** is a measure of the extent to which a given variant is in linkage disequilibrium (LD) with other variants across the genome. It is used to summarize the amount of genetic information (in terms of LD) that a variant shares with all other variants in a region of interest. The LD score for a variant $j$ is defined as the sum of the squared correlation coefficients $r^2$ between that variant and all other variants in the genome, typically within a specified genomic window or region. Mathematically, the LD score for variant $j$ is given by:

$$
l_j = \sum_{k=1, k \neq j}^Mcor^2(\textbf{X}_j, \textbf{X}_k)
$$

Since $\textbf{X}$ is standardized, we can just calculate the sum of the squared sample correlations like this:

$$\widetilde{l}_{j} = \frac{1}{N^2}\textbf{X}^\top_j\textbf{X}\textbf{X}^\top\textbf{X}_j$$
However, this is not an unbiased estimate. We can correct for the bias like this:

$$l_j = \frac{\widetilde{l}_{j} N - M}{N + 1}$$


This score reflects how much a variant is correlated with other variants across the genome, providing insight into the local structure of LD around that variant.



# Example

In [8]:
rm(list=ls())
# Genotype matrix for 5 individuals and 2 variants
# Rows correspond to individuals, columns to variants
N=5
M=3
X_raw <- matrix(c(0, 1, 1, 2, 2, 0, 1, 1, 1, 0, 2, 1, 0, 0, 2), 
                    nrow = N, ncol = M, byrow = TRUE)
# Adding row and column names
rownames(X_raw) <- paste("Individual", 1:N)
colnames(X_raw) <- paste("Variant", 1:M)
X_raw

Unnamed: 0,Variant 1,Variant 2,Variant 3
Individual 1,0,1,1
Individual 2,2,2,0
Individual 3,1,1,1
Individual 4,0,2,1
Individual 5,0,0,2


In [9]:
# standardize genotype matrix
X = scale(X_raw, scale=TRUE)
X

Unnamed: 0,Variant 1,Variant 2,Variant 3
Individual 1,-0.6708204,-0.2390457,0.0
Individual 2,1.5652476,0.9561829,-1.414214
Individual 3,0.4472136,-0.2390457,0.0
Individual 4,-0.6708204,0.9561829,0.0
Individual 5,-0.6708204,-1.4342743,1.414214


## Calculate LD matrix

In [19]:
LD = cor(X)
LD

Unnamed: 0,Variant 1,Variant 2,Variant 3
Variant 1,1.0,0.4677072,-0.7905694
Variant 2,0.4677072,1.0,-0.8451543
Variant 3,-0.7905694,-0.8451543,1.0


## Calculate LD scores

In [21]:
# Calculate the squared correlation matrix (LD matrix)
LD_squared <- LD^2  # Element-wise square to get r^2

# Calculate the LD score for each variant
# Sum the squared correlations in each column (excluding diagonal)
ld_scores_raw <- colSums(LD_squared) - diag(LD_squared)
# Bias correction
ld_scores_corrected <- ((ld_scores_raw * N) - M) / (N + 1)

# Print the corrected LD scores
ld_scores_corrected

The third variant is in high LD with the first two variants, leading to a higher LD score.

# Supplementary


### Ref
> LDSC: slide 110-117 from Xin HE
>
> LDSC: slide 82-84 from GW
> 
> slide 48-50 from GW


### Purpose of LD Score

- **Genetic Association Studies**: LD scores are useful in genetic association studies to account for the correlations between variants when performing polygenic risk score (PRS) analysis or genome-wide association studies (GWAS).
- **Controlling for Confounding**: In association studies, high LD between variants can lead to confounding effects, where the signal from a variant may be shared with other nearby variants. By using LD scores, researchers can assess the relative contribution of each variant and better control for LD when interpreting results.
- **Estimating Heritability**: LD scores are used to estimate the heritability of complex traits by calculating how much of the genetic variation in a trait can be explained by LD between variants.

### Interpretation

- **High LD Score**: A variant with a high LD score indicates that it is in strong LD with many other variants in the genome, meaning it shares a substantial amount of genetic variation with neighboring variants.
- **Low LD Score**: A variant with a low LD score indicates that it is not strongly correlated with many other variants, implying that it may have a unique genetic contribution or be in a region with low LD.

In summary, the LD score is a way to quantify the genetic "information content" of a variant based on its correlations with surrounding variants, and it is used in the context of genetic studies to account for LD structure in the genome.
