Haplotype Block Based Dimensionality Reduction for Complex Variant-Disease Associations
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs
ref_data/ldetect_GRCh38
src
test_data
.gitignore
LICENSE
README.md
requirements.txt
training.ipynb

README.md


Association of Haplotype blocks to phenotypes (in Python) using a neural network machine learning method

A tool to test the association of variants in haplotype blocks to phenotypes. This tool takes variants (VCF format) called by any technolgy like Exome, WGS, RNASeq or SNP Genotyping Arrays and generates association test results.

Slides from our presentation at UCSC NCBI Hackathon

Install

  1. Clone the repo
    • git clone https://github.com/NCBI-Hackathons/HapPyNet.git
  2. Install dependencies (varies depending on input)

Usage

  1. Generate SNP count matrix (Number of SNPs per LD block)
    See README here
  2. Run a neural net to classify samples into disease vs. normal
    See README here

alt text

Method

  • Call variants using any platform (RNASeq, Exome, Whole Genome or SNP Arrays)
  • Group variants by haplotype blocks to compute SNP load in each haplotype block
  • Classify samples into disease vs normal, based on SNP load(number of SNPs per LD block) using a TensorFlow classifier
  • Associate haplotypes with phenotype. As of Apr 2018, this is NOT implemented

alt text

Data sources

  • LD Blocks : Non-overlapping LD blocks derived from 1KG data (hg19) were obtained from : Approximately independent linkage disequilibrium blocks in human populations, Bioinformatics. 2016 Jan 15; 32(2): 283–285 doi: 10.1093/bioinformatics/btv546. Using NCBI's online remapping tool these regions were mapped to GRCh38 with merge fragments turned ON to make sure each LD block is not fragmented

  • RNASeq samples: Initial training set from healthy and disease samples were obtained from SRA. The disease sample selection query was: (AML) AND "Homo sapiens"[orgn:__txid9606] NOT ChIP-Seq. List of SRR samples used are provided here

RNASeq Variant Calling Pipeline

  • RNASeq sample reads were aligned using HiSat2
  • Variants were called using GATK version 4.0.3.0 and quality filtered at read depth of 50 and genotype quality of 90

alt text

alt text

alt text

Machine Learning

  • We trained a classifier with a 4 layer NeuralNet using TensorFlow with leave-one-out cross validation.

alt text

Results

Our classifier model trained on our test AML and normal samples showed a 99% cross validated accuracy!

Next steps

  • Rerun on a large set of samples, with demographics and batch controlled normals
  • Explore standard differential gene expression methods from Bioconductor
  • Explore other normalization methods for Haplotype length and number of SNPs in samples