Deep-GDAE

Gene-Disease Association Extraction

Deep-GDAE integrates the specificities of a Convolution Neural Network (CNN) and an Attention-based Bidirectional Long Short-Term Memory Network to classify Gene-Disease Associations.

Deep-GDAE Corpus

Along with the benchmark dataset, we have generated a Gene-Disease Association Corpus using DisGeNET (database of GDAs) and PubTator (to retrieve biomedical texts). Using PubTator, we find all the PMIDs containing at least one gene and disease name. Then all the sentences are passed through three steps of filtering for producing the false instances. Samples of the true class are extracted from DisGeNET, considering only curated associations. Deep-GDAE Corpus contains 8000 sentences (4000 samples for True Associations and 4000 samples for False Associations) with 1904 and 3635 unique diseases and genes respectively.

Execution

1. Pre-trained word embedding models

Download one of the following pre trained word embedding files: Add the path of downloaded file to the preProcess notebooks (replace 'wefile' with your own path )

PubMed-shuffle-win-30: https://github.com/cambridgeltl/BioNLP-2016
Fast Text (crawl-300d-2M): https://fasttext.cc/docs/en/english-vectors.html
PubMed w2v: http://jbjorne.github.io/TEES/

2. Run the preProcess notebooks to generate the required pickle files for training the model

3. Execute one of the benchmark datasets as listed here to verify the performance.

utils.ipynb contains the required methods which are called by other notebooks

1.[Befree].

preProcess.ipynb Reads the data set and creates the primitive features including word and position embeddings, and saves the required file for training as a pickle file.
BeFree-3class.ipynb Evaluation on the Genetic Association Database (GAD) : GAD is an archive of human genetic association studies of complex diseases and disorders.
BeFree-2class_EUADR.ipynb Evaluation on the EU-ADR dataset. It contains annotations on drugs, diseases, genes and proteins, and associations between them. Here we focus on gene disease associations.

2.[SNPPhenA corpus] corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature.

SNP.ipynb Results of prforming Deep-GDAE on the SNPPhenA corpus, which was developed with the purpose of extracting the ranked associations of SNPs and phenotypes from GWA studies.
SNP-Transfer Learning.ipynb We selected the SNP-phenotype dataset for transferring knowledge from the gene-disease domain. The rich features transferred from the base model can help to train the new model with SNP-phenotype sequences

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
Befree_EUADR		Befree_EUADR
Befree_GAD		Befree_GAD
SNP_allCandidates		SNP_allCandidates
SNP_transfer_learning		SNP_transfer_learning
data		data
README.md		README.md
Requirements.txt		Requirements.txt
utils.ipynb		utils.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Befree_EUADR

Befree_EUADR

Befree_GAD

Befree_GAD

SNP_allCandidates

SNP_allCandidates

SNP_transfer_learning

SNP_transfer_learning

data

data

README.md

README.md

Requirements.txt

Requirements.txt

utils.ipynb

utils.ipynb

Repository files navigation

Deep-GDAE

Deep-GDAE Corpus

Execution

1. Pre-trained word embedding models

2. Run the preProcess notebooks to generate the required pickle files for training the model

3. Execute one of the benchmark datasets as listed here to verify the performance.

About

Releases

Packages

Languages

EsmaeilNourani/Deep-GDAE

Folders and files

Latest commit

History

Repository files navigation

Deep-GDAE

Deep-GDAE Corpus

Execution

1. Pre-trained word embedding models

2. Run the preProcess notebooks to generate the required pickle files for training the model

3. Execute one of the benchmark datasets as listed here to verify the performance.

About

Resources

Stars

Watchers

Forks

Languages