Skip to content

GurditC/Herbert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Herbert

Website

You can visit the website by clicking here!

What is Herbert

Herbert is a semantic search engine aggregates and distills the most reliable herbal medicine information into a curated single report view for consumers curious about using herbal medicine.

Why is it important

Total retail sales of herbal pharmaceuticals surpassed $8.8 billion domestically and the growth has been accelerating over the past decade. Although there is a lot of interest in herbal medication it is difficult for consumers to get trustworthy, reliable, and easy to understand information about the treatments. Standard web searches return a lot of information, but leave it up to the user to sift through all the pages to find the information that is relevant to them. Other services are designed for medical professionals and use complicated jargon that make it difficult for the layperson to understand. While there is lots of interest in alternative herbal medicine, there is also information overload.

What are our advantages

Efficient: Herbert filters irrelevant information and focuses on the semenatic relationship among herbs, conditions and interactions
Trustworthy: Herbert aggregates information from trustworthy data sources and cross references among them
User Friendly: Herbert avoids using overly technical terms and uses layman vocabularies for easy understanding
Transparent: Herbert provides links back to original data sources for user reference

Technical Description

Proccess gif

Our solution to distilling our sources into the relevant data points about an herb involves a multi-stage pipeline of finer granularity of text at each stage. Essentially, we break down pages of texts to get relevant paragraphs, and then relevant paragraphs to relevant sentences, until we finally reach relevant phrases which are turned into our bullet points.

To get our pages of text we use a combination of restful APIs for available sources such as PubMed and Wikipedia and BeautifulSoup + Requests Python libraries for the others that don’t have APIs such as NCCIH.

In order to explain our general extraction pipeline, we refer to an example with a Wikipedia page on ginger illustrated by the animation above.

We start off by getting the relevant paragraphs by selecting the relevant headings from the table of contents.

To get the desired headings, we set ‘seed words’ that are related to topics of interest (e.g. the topic of side-effects with “adverse”, “side-effect”, “interact”, and the like).

We essentially compare our seed words with the content headings and those “similar enough” to our seed words dictate which paragraphs are relevant while the rest are discarded. The two are made comparable through word embeddings augmented with character level n-grams which can be seen in Mikolov et. al’s paper “Enriching Word Vectors with Subword Information”. Essentially, we look at not only whole words but chunks (character n-grams) in case we don’t have the word in the vocabulary but can make use of root words, prefixes, suffixes, etc. The word embedding model was trained on millions of PubMed abstracts, full text from the PubMed Central Open Access subset, and texts from an English Wikipedia text dump. The vectors are compared by a cosine similarity and we set a threshold (empirically determined) to decide how “similar” the vectors need to be in order to retain the heading.

Now that we have our relevant paragraphs (pointed by the content headings), we look to grab relevant sentences. We do this by using a government supported medical ontology software called UMLS to identify sentences with medical content relating to conditions, symptoms, or other medical objects of interest.

From each relevant sentence, we find our relevant phrases through relation extraction in which we look for subject-verb-object triples (SVOs)-the example above being “Ginger alleviates nausea”.

Finally, we apply a similar process to the process described to extract content headings (word2vec + cosine similarity + thresholding). We look for verbs in the SVOs that indicate whether the phrase is explaining what the herb either treats, interacts with, or causes. We normalize the words and combine them along other data sources and put them on the landing pages for our herbs.

Summary Diagram

For summaries of conditions + herbs, we use Wikipedia for both entity resolution and the source for summary content. We summarize via an unsupervised graph-based approach known as TextRank.

Query Formula

The underlying search engine is supported by the Python library Whoosh with the Okapi Best Matching-25 (BM-25) algorithm for relevance.

About

Herbert Website Code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages