Skip to content
R Package for Gene Ontology Label Discernment and Identification.
R C++ Python Other
Branch: master
Clone or download
Latest commit 6eed951 Jul 5, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R workaround sentences bug Jun 23, 2017
data-raw handle null exception in script Aug 17, 2016
data fix checks Aug 29, 2016
inst dont know Jun 28, 2017
man rebuild Jun 23, 2017
src rebuild Jun 23, 2017
tests Expected output equal now with improved columns headings Oct 16, 2016
vignettes update description and vignette Aug 30, 2016
.Rbuildignore all changes before CRAN Jun 27, 2017
.gitignore ignore refs Aug 30, 2016
.gitmodules added pdfiner submodule Jul 13, 2016
.travis.yml more Jun 27, 2017
DESCRIPTION update DESCRIPTION date field Jun 27, 2017
LICENSE rename Jul 5, 2017
NAMESPACE update readme Aug 30, 2016
NEWS.md add in news Nov 23, 2016
README.md remove CrAN bit Jun 28, 2017
appveyor.yml add appveyor Jun 1, 2016
cran-comments.md edit cran-comments.md Jun 27, 2017
project_ideas.md Create project_ideas.md Sep 9, 2016

README.md

goldi Build Status codecov CRAN_S- RcppCore/Rcpptatus_Badge

Gene Ontology Label Discernment and Identification

goldi is a tool for identifying key terms in text. It has been developed with the intention of identifying ontological labels in free form text with specific application to finding Gene Ontology terms in the biomedical literature with strict canonical NLP quality control.

Status

The package is currently checked on R-oldrel (v3.3.3), R-release (v3.4.0), and R-devel (v3.5.0) on

Installation

goldi can be installed from CRAN with

install.packages("goldi")

Or, you may choose to install the latest stable development version with

devtools::install_github("Chris1221/goldi")

Minimal Example

goldi attempts to identify terms in free text through semantic similarity. This means that if a term and a sentence share a high number of words, the sentence has a higher probability of talking about the term.

Given the following input text and the included pre-computed term document matrix for approximately 10,000 Gene Onotlogy molecular function terms, we can find which are discussed in our text.

# Give the free form text
doc <- "In this sentence we will talk about ribosomal chaperone activity. In this sentence we will talk about nothing. Here we discuss obsolete molecular terms."

# Load in the included term document matrix for the terms
data("TDM.go.df")

# Pipe output and log to /dev/null
output = "/dev/null"
log = "/dev/null"

# Run the function
goldi(doc = doc, 
  term_tdm = TDM.go.df,
  output = output,
  log = log,
  object = TRUE)

Note in the above example, we impliment a few other options. Firstly, we don't want to see the output or the log for this example, so we pipe them to /dev/null. Secondly, we would like to return the output as an R object instead of writing it to a file, so we specify object = TRUE.

This will output the following table:

Term Context
ribosomal_chaperone_activity In this sentence we will talk about ribosomal chaperone activity

This will give the term identified and the context in the free form where it was identified. This table will form the basis for all further analysis.

FAQ

Q: This is all really confusing, where can I read more about this package?

A: Please see the pre print of our paper.

Q: How does goldi match terms to sentences?

A: goldi accomplishes this by finding the number of similar words in a term and in a sentence, comparing this to a user defined acceptance function A(n) based on the length of the term n. The default function is given by the following:

A

This may be represented as a vector in R lims <- c(1,2,3,3,4,5,6,6,7,8,9) If the number of words present equals or exceeds this function, then a match is declared. You are encouraged to play around and find what acceptance function works for you.

Q: What if I don't have my text in R, but instead as a text or PDF file?

A: goldi has four distinct methods for importing text locally, please see the wiki article on the subject.

Q: Installation from CRAN is not working and it says something about slam, what's going on?

A: Newer versions of the tm package, which is a dependency of goldi require a package named slam which needs to be compiled from Fortran. Try the following, and if it doesn't work, raise an issue on the repository and we'll get it fixed! Type the following into terminal (on Mac OSX):

curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2
sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /

Install slam:

install.packages("slam")

Reinstall goldi:

install.packages("goldi")

Q: When I install the package, I get messages about libc or gcc versions. What's happening?

A: The most likely scenario is that your gcc compiler (which compiles the c++ code) is out of date, espcially if you are on an older version of linux distribution like CentOS on some cluster systems. Contact your system administrator and try to update gcc.

Q: How can I work with abstracts from pubmed?

A: We recommend the RISmed package.

Q: Where can I see some examples of this package in use?

A: Please see the included vignettes, especially the overexpression analysis implimented in the paper.

Q: I am looking for a project to work on with goldi, do you have any ideas?

A: Please see here.

Q: Nothing is working, who can I complain to?

A: Please raise an issue on this repository, that's most likely to get answered.

Citation

Cole, Christopher B., et al. "Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi." bioRxiv (2016): 073460.

@article{cole2016semi,
  title={Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi},
  author={Cole, Christopher B and Patel, Sejal and French, Leon and Knight, Jo},
  journal={bioRxiv},
  pages={073460},
  year={2016},
  publisher={Cold Spring Harbor Labs Journals}
}
You can’t perform that action at this time.