cLD: a novel statistic capturing linkage disequilibrium between rare variants

Linkage disequilibrium (LD) is a fundamental concept in genetics; critical for studying genetic associations and molecular evolution. However, LD measurements are only reliable for common genetic variants, leaving low-frequency variants unanalyzed. In this work, we introduce cumulative LD (cLD), a stable statistic that captures the rare-variant LD between genetic regions and opens the door for furthering biological knowledge using rare genetic variants. In application, we find cLD reveals an increased genetic association between genes in 3D chromatin interactions, a phenomenon recently reported negatively by calculating standard LD between common variants. Additionally, we show that cLD is higher between gene pairs reported in interaction databases, identifies unreported protein-protein interactions, and reveals interacting genes distinguishing case/control samples in association studies.

The R & Python code for the cLD project and instructions on how to run it are included in this repository. Using our toy example you can verify the cLD value and the gene distance value easily since we only have 5 genes and 6 individuals in it. If you want to go through the whole process, you may use the demo example to run these codes.

Python codes are under version python/anaconda-3.6-5.1.0
R codes are under version R/3.6.2

Data

In this analysis, you need to provide:
- the genotype file
- the gene information file
- the gene-gene interaction file
In the paper, we are using the 1000 Genome dataset as the genotype file. The gene information file should contain the start point, end point and gene's Ensembl ID. The gene-gene interaction file should contain the gene-gene interaction in terms of genes' Ensembl ID. The example data is given. If you want to run a anaylsis on the Hi-C interactions, you'll need to use Hi-C_InteractionTransfer.py to transfer the interaction intervals into gene-gene interactions.

Asymptotic Properties of cLD

Integrate SNVs

'geneIntegrate.py' could integrate the SNVs into the gene lines, you can find the detail in the Supplementary material Chapter 1. The inputs are:
- gene information file
- original genotype data, contains the SNV information
- specify the chromosome
- Threshold of rare SNVs
The output is:
- the intergrated gene file

'geneFilter.py' could remove the genes with cMAF = 0. The input is:
- the integrated gene file
The output is:
- the filtered gene file

Calculate cLD

'cld.py' could calculate the cLD matrix given the filtered gene file. The input is:
- the filtered gene data from the filter
The output is:
- the cLD file without Ensemble ID

'namedcld.py' could add gene names to the cLD result.
The input is:
- cLD file from cLD.py
- gene information list, you can find the example in the demo
The output is:
- output is the cLD file with Gene Esmbel ID

Gene distance

'genemid.py' and 'genedistance.py' are used to calculate the gene distances between gene pairs. In 'genemid.py', the input is:
- cLD file with Ensembl ID
The output is:
- gene mid point list

In 'genedistance.py', the inputs are:
- cLD file with Ensembl ID
- gene mid point file
The outputs are:
- gene distance file
- distance sequence file, a string

Variance Comparison

'cldVar.py' calculate the variance of cLD based on the filtered gene file. The input is:
- filtered gene file from the filter
'ldVar.py' estimate the variance of LD based on the original SNV file. The inputs are: - SNP file
- specify the chromosome
- threshold of rare variant
- sample size
The output of these two python files are the estimated variance.

Bootstrap Methods

'geneRandom.py' is used to sample a subset of genes from the filtered gene file.
The inputs are:
- Filtered genefile
- sample size
The output is a gene sample file.

'ldBootstrap.py' and 'cldBootstrap' are used to run bootstrap algorithm.

For LD bootstrap:

The inputs are:
- gene sample
- threshold of rare variant
- Sample size of each iteration
- epoch
The output is the bootstrap result file.

For cLD bootstrap:

The inputs are:
- gene sample
- gene information list, you can find the example in the demo
- samplesize
- epoch
The output is the cLD Bootstrap result, each row represents an iteration.

'bootstrapGroups.py' and 'bootstrapResultSeparate.py' are used to separate gene pairs into several cMAF groups. You may read the detail in the Supplementary material.

Hi-C interaction transfer

'Hi-C_InteractionTransfer.py' could transfer the interaction intervals into gene-gene interactions.
The input is:
- Hi-C interaction file
- cLD file with Ensembl ID
- specify the chromosome
The output is:
- the gene-gene interactions in terms of Esmbel ID

Applying cLD to real sequence data

Run MH test & Fisher's Exact test

'interactionDistGroup.py' could separate the cLD gene pairs into 13*2 groups, 13 means 13 distance groups, 2 means with/without interactions.
The inputs are:
- gene distance file
- cLD with gene esembl ID
- Gene interaction file, e.g. Hi-Cgeneint.txt
The output is:
- output is a matrix. negtive value means with interaction, positive means no-interaction

'interactionDistseparate.py' and 'tests.py' could run the MH test and the Fisher's exact test.
The inputs are:
- cLD with gene esembl ID
- interaction-distance information matrix
- threshold of success (in quantile)

The outputs are the cld without interaction, this file contains 13 lines, each one represent a distance group. Based on these outputs, 'tests.py' could provide the test statistics and p-values for the test.

Sample LD value from the genotype file

The folder called LDSampling contains the Java code to sample SNPs according to a fixed distance. The steps are:

we filtered and separated the VCFs by populations
Using the gene list to calculate their LD
Generating 1000 null regions per gene. We use each gene block's distance and size to generate the null regions

Also, a recheck is done to ensure all null distribution blocks have SNPs, if they do not these regions alone are regenerated.

Example Input Data

Example SNV file

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099	HG00100	HG00101	HG00102
1	22	rs367896724	A	AC	100	PASS	AC=2130;AF=0.425319;AN=5008;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=\|\|\|unknown(NO_COVERAGE);VT=INDEL	GT	1\|0	0\|1	0\|1	1\|0	0\|0	1\|0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Example gene information file

CHR	genename	Startpoint	Endpoint	Ensemble ID
chr1	gene	10	20	Ensemble1
chr1	gene	15	24	Ensemble2
chr1	gene	20	29	Ensemble3
chr1	gene	26	101	Ensemble4
chr1	gene	99	150	Ensemble5
...	...	...	...	...

In the practice, we don't need the headline.

Example Hi-C interaction file

Count	interval 1 (chr;lower;upper)	interval 2 (chr;lower;upper)
C0	1;0;10	1;19;25
C1	1;25;40	1;90;110
C2	1;2;13	1;20;24
...	...	...

Run Time

The run time of the demo example should be less than 20 mins.
In practice, the run time depends on the size of the data. For example, when we run the geneIntegrate.py on Chr 1, it may take 5 to 10 days to integrate the SNPs to genes.

Protein Docking

The software are

Protein docking software: HDOCKlite-v1.1: http://huanglab.phys.hust.edu.cn/software/hdocklite/
Visualization software:
Pymol 2.5.1: https://pymol.org/2/
LigPlot+’s: https://www.ebi.ac.uk/thornton-srv/software/LigPlus/manual/manual.html

Find High cLD with no interactions reported

'HighcldInteractionFilter.py' is to find the gene pairs with large cLD but no interaction reported in the given databases. The inputs are:
- thres of cld # Define the high cLD
- thres of cMAF # uesd to filter the gene pairs with a condition on the cMAF
- interaction file # The combination of existing interaction databases
- cLD with Ensembl ID
- MAF of genes
The output is:
- High cld gene pairs with no-interactions, could have other results if you change the codes.

Association mapping using cLD and annotations

Data source and quality control

We use the software plink: plink="/path_to_plink/plink". The code is 'Data source and quality control.sh'

Go and KEGG pathway analysis

Using ‘clusterProfiler’ R package, the code is 'Go and KEGG pathway analysis.R'

Calculating cLD for each gene pair in case-control datasets

First, we calculated the cMAF within each gene region for case and control separately (using WriteInputcLD.py); We then retained genes with cMAF>0.05 in both case and control groups, and then iteratively calculated cLD within the gene regions for each gene pair for case and control separately (using CalcLD.py).

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Applying cLD to real sequence data		Applying cLD to real sequence data
Association mapping using cLD and annotations		Association mapping using cLD and annotations
Asymptotic Properties of cLD		Asymptotic Properties of cLD
Demo		Demo
Graph		Graph
LDSampling		LDSampling
Protein interactions and docking		Protein interactions and docking
LICENSE		LICENSE
README.md		README.md
cLD_Paper_FigureData.xlsx		cLD_Paper_FigureData.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cLD: a novel statistic capturing linkage disequilibrium between rare variants

Data

Asymptotic Properties of cLD

Integrate SNVs

Calculate cLD

Gene distance

Variance Comparison

Bootstrap Methods

For LD bootstrap:

For cLD bootstrap:

Hi-C interaction transfer

Applying cLD to real sequence data

Run MH test & Fisher's Exact test

Sample LD value from the genotype file

Example Input Data

Example SNV file

Example gene information file

Example Hi-C interaction file

Run Time

Protein Docking

Find High cLD with no interactions reported

Association mapping using cLD and annotations

Data source and quality control

Go and KEGG pathway analysis

Calculating cLD for each gene pair in case-control datasets

About

Releases

Packages

Languages

License

QingrunZhangLab/cLD

Folders and files

Latest commit

History

Repository files navigation

cLD: a novel statistic capturing linkage disequilibrium between rare variants

Data

Asymptotic Properties of cLD

Integrate SNVs

Calculate cLD

Gene distance

Variance Comparison

Bootstrap Methods

For LD bootstrap:

For cLD bootstrap:

Hi-C interaction transfer

Applying cLD to real sequence data

Run MH test & Fisher's Exact test

Sample LD value from the genotype file

Example Input Data

Example SNV file

Example gene information file

Example Hi-C interaction file

Run Time

Protein Docking

Find High cLD with no interactions reported

Association mapping using cLD and annotations

Data source and quality control

Go and KEGG pathway analysis

Calculating cLD for each gene pair in case-control datasets

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages