Skip to content

EsmaeilNourani/Deep-GDAE

Repository files navigation

Deep-GDAE

Gene-Disease Association Extraction

Deep-GDAE integrates the specificities of a Convolution Neural Network (CNN) and an Attention-based Bidirectional Long Short-Term Memory Network to classify Gene-Disease Associations.

Deep-GDAE Corpus

Along with the benchmark dataset, we have generated a Gene-Disease Association Corpus using DisGeNET (database of GDAs) and PubTator (to retrieve biomedical texts). Using PubTator, we find all the PMIDs containing at least one gene and disease name. Then all the sentences are passed through three steps of filtering for producing the false instances. Samples of the true class are extracted from DisGeNET, considering only curated associations. Deep-GDAE Corpus contains 8000 sentences (4000 samples for True Associations and 4000 samples for False Associations) with 1904 and 3635 unique diseases and genes respectively.

Execution

1. Pre-trained word embedding models

Download one of the following pre trained word embedding files: Add the path of downloaded file to the preProcess notebooks (replace 'wefile' with your own path )

2. Run the preProcess notebooks to generate the required pickle files for training the model

3. Execute one of the benchmark datasets as listed here to verify the performance.

  • utils.ipynb contains the required methods which are called by other notebooks

1.[Befree].

  • preProcess.ipynb Reads the data set and creates the primitive features including word and position embeddings, and saves the required file for training as a pickle file.

  • BeFree-3class.ipynb Evaluation on the Genetic Association Database (GAD) : GAD is an archive of human genetic association studies of complex diseases and disorders.

  • BeFree-2class_EUADR.ipynb Evaluation on the EU-ADR dataset. It contains annotations on drugs, diseases, genes and proteins, and associations between them. Here we focus on gene disease associations.

2.[SNPPhenA corpus] corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature.

  • SNP.ipynb Results of prforming Deep-GDAE on the SNPPhenA corpus, which was developed with the purpose of extracting the ranked associations of SNPs and phenotypes from GWA studies.

  • SNP-Transfer Learning.ipynb We selected the SNP-phenotype dataset for transferring knowledge from the gene-disease domain. The rich features transferred from the base model can help to train the new model with SNP-phenotype sequences

About

Gene-Disease Association Extraction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published