# MULTIVAC - Meta-model Unification Learned Through Inquiry Vectorization and Automated Comprehension
## Introduction
Gallup’s MULTIVAC effort supports the goals of the DARPA ASKE program by developing a system that absorbs scientific knowledge — in the form of facts, relationships, models and equations — from a particular domain corpus into a Markov Logic Network (MLN) ontology and learns to query that ontology in order to accelerate scientific exploration within the target domain. MULTIVAC will consist of an expert query generator trained on a corpus of historical expert queries and tuned dialectically with the use of a Generative Adversarial Network (GAN) architecture. As a prototype system, MULTIVAC will focus on the domain of epidemiological research, and specifically the realm of SIR/SEIR (Susceptible-Infected-Recovered, often with an additional “Exposed” element) compartmental model approaches. It is Gallup’s intent that this system includes a “human-in-the-loop” element, especially during training, to ensure that the system is properly tuned and responsive to the needs and interests of the human researchers it is intended to augment.

DARPA’s Information Innovation Office’s Automating Scientific Knowledge Extraction (ASKE) program seeks to develop approaches to make it easier for scientists to build, maintain and reason over rich models of complex systems — which could include physical, biological, social, engineered or hybrid systems. By interpreting and exposing scientific knowledge and assumptions in existing model code and documentation, researchers can identify new data and information resources automatically, extracting useful information from these sources, and integrating this useful information into machine-curated expert models for robust modeling.

## Replication
The MULTIVAC pipeline, to run from beginning to end, is encapsulated in command-line executable code found on Gallup's [MULTIVAC repository](https://github.com/GallupGovt/multivac). To run that file -- `conductor.py`, found in the top-level MULTIVAC directory -- one needs to do the following:
1. Within the [MULTIVAC repository](https://github.com/GallupGovt/multivac):
  * Clone the Gallup MULTIVAC repository
  * Instantiate a virtual environment
  * `pip install -r requirements.txt` to install all necessary Python dependencies
2. Within the [QG-Net repository](https://github.com/GallupGovt/qgnet) (a secondary, cloned repository that the Gallup team uses for a portion of its modeling effort):
  * Clone the Gallup QG-Net repository (this should be at the same level directory as MULTIVAC)
  * Clone the Facebook Research [DrQA](https://github.com/facebookresearch/DrQA) repository and follow the [install instructions](https://github.com/facebookresearch/DrQA#installing-drqa) (this should be at the same level directory as MULTIVAC). Note, this requires a Linux/OSX machine.
3. Other downloads and dependencies
  * The [Stanford Core NLP](https://stanfordnlp.github.io/CoreNLP/download.html) toolset needs to be downloaded and installed. **Be sure to follow the setup instructions** to ensure proper install for this to work.
  * The [GloVe pre-trained data](http://nlp.stanford.edu/data/glove.42B.300d.zip) need to be downloaded and placed in the `data_dir` as defined in `settings.py` in the top-level directory of the MULTIVAC repository. 
  * Have [R](https://cran.r-project.org/) installed on your machine; part of MULTIVAC's code reaches back into R.

Finally, QG-Net specifically, but the project overall, works on a GPU. **The QG-Net portion of code will not work unless on a GPU** (tested with an Nvidia Quadro Pro-4000 card) but other pieces, especially the Markov logic network (MLN) will have dramatically reduced performance on a CPU. Also note, end-to-end, this will take approximately 60 hours to complete. 

## Piece-by-piece execution
If one chooses not to replicate the entire pipeline, then below are pieces to step in and out of particular pieces, given data pre-compiled by Gallup. These files can be accessed through a public, non-secure FTP site hosted at Gallup. Navigate to ftp://ftp.gallup.com in a web browser. You will be promopted for credentials that Ben Ryan will have sent over. From there, navigate to the `aske` folder where you will see all of the various inputs and outputs of the MULTIVAC system. The instructions are as follows with what to do with these data:
* Scraping
  * Inputs: There are no inputs per se, since this first step instantiates MULTIVAC by scraping articles from the internet from arXiv, Springer, and PubMed around the epidemilogy topic of choice for MULTIVAC.
  * Outputs: The `20181212.json` file encompasses all of the scraped data and are used by the parsing step.
* Parsing
  * Inputs: The `20181212.json` file.
  * Outputs: The `articles-with-equations.json` and the files in the `dim_files` directory. The JSON file is used with GloVe modeling and QG-Net, which the DIM files are used with the MLN.
* GloVe modeling
  * Inputs: The `articles-with-equations.json` file.
  * Outputs: The `da_embeddings.txt` file. This is a domain-adapted word embeddings used in QG-Net.
* QG-Net
  * Inputs: The `da_embeddings.txt` and the `articles-with-equations.json` files.
  * Outputs: The `input.txt` is a processed version of data that is fed through QG-Net itself and the `output_questions_QG-Net.pt.txt` is a set of questions from QG-Net.
* MLN
  * Inputs: The files in the `dim_files` directory.
  * Outputs: The `mln.pkl` file that wraps up the knowledge graph and attributes, including semantic clustering.

The rest of this notebook will step through language around the particular component of MULTIVAC and then code -- using the data referenced above -- to exemplify execution.

## 0. Settings
In the cell below, update the directory list as needed, following the example text. 

In [1]:
import configparser
import os

from dotenv import load_dotenv
from pathlib import Path

from multivac.src import utilities


cfg = configparser.ConfigParser()
cfgDIR = Path('').resolve()

try:
    cfg.read(cfgDIR / config_file_name)
except NameError:
    cfg.read(cfgDIR / 'multivac.cfg')

root_dir = cfg['PATHS'].get('root_dir', cfgDIR/'multivac')
qgnet_dir = cfg['PATHS'].get('qgnet_dir', cfgDIR/'qgnet')

data_dir = cfg['PATHS'].get('data_dir', root_dir/'data')
raw_dir = cfg['PATHS'].get('raw_dir', data_dir/'raw')
interim_dir = cfg['PATHS'].get('interim_dir', data_dir/'interim')
processed_dir = cfg['PATHS'].get('processed_dir', data_dir/'processed')
metadata_dir = cfg['PATHS'].get('metadata_dir', processed_dir/'metadata')
models_dir = cfg['PATHS'].get('models_dir', root_dir/'models')
stanf_nlp_dir = cfg['PATHS'].get('stanf_nlp_dir',
                                 root_dir/'stanford_nlp_models')
mln_dir = cfg['PATHS'].get('mln_dir', root_dir/'mln_models')

# Get search and filter settings; default to empty lists
terms = eval(cfg['SEARCH'].get('terms', '[]'))
sources = eval(cfg['SEARCH'].get('sources', '[]'))
arxiv_drops = eval(cfg['FILTER'].get('drops', '[]'))

# make data directories if they don't already exist
dirs = [
    data_dir,
    raw_dir,
    interim_dir,
    processed_dir,
    metadata_dir,
    models_dir,
    stanf_nlp_dir,
    mln_dir,
]
dirs += [raw_dir / x for x in sources]
for _dir in dirs:
    utilities.mkdir(_dir)


In [2]:
from multivac.src import utilities


# make data directories if they don't already exist
dirs = [
    data_dir,
    raw_dir,
    interim_dir,
    processed_dir,
    metadata_dir,
    models_dir,
    stanf_nlp_dir,
    mln_dir,
]
dirs += ['{}/{}'.format(raw_dir, x) for x in sources]
for _dir in dirs:
    utilities.mkdir(_dir)


## 1. Scraping
To build a robust and diverse set of source models, MULTIVAC set a target of 2,000 scientific journal articles fitting set specifications and filters. These specifications are coded in a user-editable configuration file (by default, `multivac.cfg`), and cover both search parameters (search terms, and sources to search) as well as filter parameters, to weed out duplicates or unrelated content that might return in a more naive match on the specified search terms.
 
To achieve the desired sample document size, MULTIVAC targets three different online sources of epidemiological research: arXiv.org, PubMed, and Springer. MULTIVAC accesses these resources through source-specific APIs and authenticates with user API access keys. Each source is searched for articles containing the specified search terms The results are saved as a combined JSON file. Each article ID is a top-level key, with a Metadata key containing various metadata keys and values and a Text key containing the plain body text. This file then serves as the intermediate "source" datastore for subsequent analysis and processing.
 
These three sources also serve to demonstrate MULTIVAC's ability to work with a variety of data storage types: PubMed articles are ingested from XML, Springer from HTML, and arXiv from PDF. In the final version of this prototype system, MULTIVAC scraped a combined 2,740 articles (after pruning malformed or un-parseable results) from these three sources: 897 from arXiv, 1,280 from PubMed, and 573 from Springer.

### 1.A. Collect data

In [3]:
import copy
import feedparser
import os
import pickle
import requests
import time

from bs4 import BeautifulSoup as bs

from multivac import settings
from multivac.src.data.get import get_total_number_of_results, prep_terms, query_api


# load environment variables from .env
springer_api_key = os.environ.get('SPRINGER_API_KEY')
user_email = os.environ.get('USER_EMAIL')  # courtesy to NIH to include your email

wait_time = 3


def collect_get_main():
    # ------------------------------------------------------------------------
    # arxiv

    # build query and get metadata of articles matching our search criteria
    params = {'start': 0, 'max_results': 100, 'sortBy': 'relevance'
             ,'sortOrder': 'descending'}
    li = [x.replace('-', ' ').split(' ') for x in settings.terms]
    q = 'OR'.join(['%28' + prep_terms(x) + '%29' for x in li])
    url = 'http://export.arxiv.org/api/query?search_query=' + q
    arxiv_metadata = query_api(url, q, params, wait_time=1, verbose=True)

    # save pdfs of articles that matched our search criteria
    # we use doi as the filename when that id is present; otherwise we use the
    # arxiv id
    for ix, md in enumerate(arxiv_metadata):
        url = md['id']
        pdf_url = url.replace('/abs/', '/pdf/')
        article_fn = url.split('/abs/')[-1]
        article_fn = '_'.join(article_fn.split('/')) + '.pdf'
        # specify filename so we can associate each pdf with its metadata down
        # the road
        arxiv_metadata[ix]['fn'] = article_fn
        dst = settings.raw_dir / 'arxiv' / article_fn
        if not os.path.exists(dst):
            r = requests.get(pdf_url)
            with open(dst, 'wb') as f:
                f.write(r.content)
            time.sleep(0.3)

    # save arxiv metadata
    fn = 'arxiv' + '.pkl'
    dst = settings.metadata_dir / fn
    with open(dst, 'wb') as f:
        pickle.dump(arxiv_metadata, f)

    # ------------------------------------------------------------------------
    # springer

    # build query to retrieve metadata
    make_q = lambda li: '(' + ' OR '.join(['"' + s + '"' for s in li]) + ')'
    q = make_q(settings.terms)
    base = 'http://api.springernature.com/openaccess/json?q='
    url = base + q
    params = {
        'source': 'springer',
        'openaccess': 'true',
        'api_key': springer_api_key, 'p': 20, 's': 1
    }
    params_ = copy.deepcopy(params)

    # retrieve metadata
    springer_metadata = []
    while True:
        r = requests.get(url, params_)
        if len(r.json()['records']) == 0:
            break
        params_['s'] = params_['s'] + params_['p']
        springer_metadata += r.json()['records']
        time.sleep(wait_time)
    print('%s total Springer articles' % len(springer_metadata))

    # iterate over springer metadata and download html for each article
    # we use a generator to increase wait times with each connection error
    waits = (2**x for x in range(0,6))
    for ix, md in enumerate(springer_metadata):
        fn = md['doi'].replace('/', '-')
        if len(fn) == 0:
            fn = md['identifier']
        fn = fn + '.html'
        springer_metadata[ix]['fn'] = fn
        dst = settings.raw_dir / 'springer' / fn
        if not os.path.exists(dst):
            try:
                r = requests.get(md['url'][0]['value'])
            except ConnectionError:
                time.sleep(waits.__next__)
                r = requests.get(md['url'][0]['value'])
            html = bs(r.text).encode('utf-8').decode('utf-8')
            with open(dst, 'w', encoding='utf-8') as f:
                f.write(html)
            time.sleep(3)

    # save springer metadata
    dst = settings.metadata_dir / 'springer.pkl'
    with open(dst, 'wb') as f:
        pickle.dump(springer_metadata, f)

    # ------------------------------------------------------------------------
    # pubmed

    # search pubmed central for free full text articles containing selected
    # query
    # get the ids which we then use to get the xml text data
    replace = lambda s: s.replace(' ', '+')
    quote = lambda s: '%22' + s + '%22'
    terms = [quote(replace(s)) for s in settings.terms]
    term = 'term='+ '%28'+ '+OR+'.join(terms) + '%29'
    fulltext = 'free+fulltext%5bfilter%5d'
    retmax = 'retmax=2000'
    base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc'
    params = {'retmax': 5000, 'email': user_email}
    url = base + '&' + term + '+' + fulltext + '&' + retmax
    r = requests.get(url)
    ids = [x.contents[0] for x in bs(r.text).find_all('id')]

    print('%s Pubmed Central (PMC) articles' % ids)

    # get xml text data and save to disk
    for i in ids:
        pmc_id = 'pmc' + str(i)
        fn = (pmc_id + '.xml')
        dst = settings.raw_dir / 'pubmed' / fn
        if not os.path.exists(dst):
            url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=' + str(i)
            r = requests.get(url, params={'id': i})
            xml = r.text
            with open(dst, 'w') as f:
                f.write(xml)
            time.sleep(0.5)


In [4]:
collect_get_main()

988 total results, 1 second wait time between each call
*

KeyboardInterrupt: 

### 1.B. Parse results (note, Part 1.A. must be run first)

In [5]:
import copy
import json
import os
import pickle
import pubmed_parser
import slate

from bs4 import BeautifulSoup as bs
from collections import OrderedDict

from multivac import settings
from multivac.src import utilities
from multivac.src.data.process import (
    aggregate_pubmed, filter_arxiv, parse_articles_data, parse_html, parse_pdf, parse_pubmed, save_outputs
)


def collect_process_main():
    output = {}
    for source in settings.sources:
        data_raw_dir = settings.raw_dir / source
        if source in ['arxiv', 'springer']:
            data = parse_articles_data(source, data_raw_dir)
        elif source == 'pubmed':
            srcs = [data_raw_dir / x for x in os.listdir(data_raw_dir)]
            data = aggregate_pubmed(srcs)
        if len(output) == 0:
            output = copy.deepcopy(data)
        else:
            output.update(data)
    arxiv_drops = [x.split()[0] for x in settings.arxiv_drops]
    filtered_output = filter_arxiv(output, arxiv_drops)
    save_outputs(filtered_output)
    return True


In [None]:
collect_process_main()

## 2. Parsing
The parsing component of MULTIVAC takes a JSON file of scraped journal articles, and parses the text as well as LaTeX notation contained within the text. It is largely driven by parsing.py which accepts five arguments:
-b
OPTIONAL
Specify the index of the first document to start processing (useful for stopping and continuing the parsing process)
 
-s
REQUIRED
Path to the Stanford NLP model
-c
REQUIRED
Indicator for whether or not to create a JSON file with tokenized LaTeX equations (y/n)
-d
REQUIRED
Path to JSON input file
-o
REQUIRED
Path where dependency, input and morphology files should be written. The folder must contain three subfolders labeled dep, input and morph.
 
Initial MULTIVAC text parsing relies on two natural language processing engines – stanfordnlp and spaCy – to construct dependency trees, tag parts of speech and lemmatize tokens. MULTIVAC leverages both NLP engines because of their complementary strengths and value to the overall system. The spaCy library provides unparalleled speed in transforming and manipulating texts, while Stanford’s dependency parsing is intentionally designed to emphasize and prioritize semantically meaningful syntactic structure. While spaCy also performs dependency tree parsing, the model and strategy are much more purely syntactic, making it a poor fit for semantically parsing our texts into a Markov Logic network knowledge graph.
 
Each sentence is processed individually to identify the dependency structure of its tokens. When LaTeX notation occurs in text the notation block is extracted and a “dummy” token is substituted, allowing the NLP dependency parsing to interpret the sentence as a proper English language construct. This is especially important for in-line LaTeX notations, which otherwise render many of the most important sentences in an article un-parseable.

![LaTeX equation parsing](images/latex_parse_1.png)

The LaTeX equation itself is separately parsed and then re-inserted into the sentence, with the root of the LaTeX tree taking the place of the dummy token in the dependency structure. The LaTeX representations are parsed by converting them first into a sympy representation that enables deconstructing expressions into a nested tree structure that contains a series of functions and arguments. For example, the expression 2x + (x*y) would be expressed as Add(Pow(Number(2), Entity('x')), Mul(Entity('x'), Entity('y'))) where Add(), Pow() and Mul() are functions; and Number(2) and Symbol(‘x’) are arguments. MULTIVAC transforms these nested parenthetical representations into a collapsed dependencies format and inserts the entire chain back into the source sentence, updating token indices as appropriate. The individual relationship and entity tokens from these equations are also expanded out in string representation and replace the LaTeX notation in the original text.

![LaTeX entity relationshop](images/latex_parse_2.png)

The outputs of this translation process are three sets of files: Dependency files, Morphology files, and Input files. Each file represents a parse of one article and is formatted in blocks, with one block for each sentence in the article. “Input” files record original word or punctuation as well as the part of speech (POS) tag, while “Morph” files record the token lemma, and each line contains a separate token. “Dep” files record the Stanford Universal Dependency relationships between pairs of words as well as the indices of the component words in the sentence. The article texts with processed equation tokens re-inserted are also written out to a file (articles-with-equations.json) for further use in preparing GloVe embeddings.

In [8]:
import copy
import json
import pickle
import re as reg
import spacy
import stanfordnlp

import multivac.src.data.equationparsing as eq

from interruptingcow import timeout

from multivac import settings
from multivac.src.data.textparsing import clean_doc
from multivac.src.data.parsing import (
    create_parse_files, get_adjustment_position, get_token_governor, 
    load_data
)

def nlp_parse_main(args_dict):
    ''' Main run file that orchestrates everything
    '''

    ## Load NLP engines
    spacynlp = spacy.load('en_core_web_sm')
    nlp = stanfordnlp.Pipeline(models_dir=settings.stanf_nlp_dir,
                               treebank='en_ewt', use_gpu=False,
                               pos_batch_size=3000)

    ## Load documents
    jsonObj, allDocs = load_data(settings.processed_dir / 'data')

    ## Process and Clean documents
    try:
        allDocsClean = pickle.load(open('allDocsClean.pkl', "rb" ))
        print('Loaded pickle!')
    except FileNotFoundError:
        print('Starting from scratch')
        allDocsClean= []
        for i, doc in enumerate(allDocs):
            if i%10==0:
                print(i)
            allDocsClean.append(clean_doc(doc, spacynlp))

        with open('allDocsClean.pkl', 'wb') as f:
            pickle.dump(allDocsClean, f)


    allDocs2 = [eq.extract_and_replace_latex(doc) for docNum, doc in
                enumerate(allDocsClean)]
    print('Number of LateX Equations parsed: {}'.format(len(eq.LATEXMAP)))


    ## Put equations back into text - this will be fed to glove embedding
    if args_dict['nlp_newjson']:
        print('***************\nBuilding JSON file for glove embedding...')
        allDocs3 = []
        percentCompletedMultiple = int(len(allDocs2)/10)
        for i, doc in enumerate(allDocs2[0:]):
            if i%percentCompletedMultiple == 0:
                print('{}% completed'.format(round(i/(len(allDocs2))*100, 0)))
            newDoc = reg.sub(r'Ltxqtn[a-z]{8}', eq.put_equation_tokens_in_text,
                             doc)
            allDocs3.append(newDoc)

        jsonObj2 = copy.deepcopy(jsonObj)
        allDocs3Counter = 0

        for key, value in list(jsonObj2.items()):
            if value['text']:
                jsonObj2[key]['text']=allDocs3[allDocs3Counter]
                allDocs3Counter = allDocs3Counter+1

        with open('{}/articles-with-equations.json'.format(settings.data_dir),
                  'w', encoding='utf8') as fp:
            json.dump(jsonObj2, fp)


    ## Parse files into DIM
    startPoint=-1
    if args_dict['nlp_bp'] is not None:
        startPoint = args_dict['nlp_bp']

    for i, doc in enumerate(allDocs2[0:]):
        print('Processing document #{}'.format(i))
        if i > startPoint:

            # Use exception handling so that the process doesn't get stuck and
            # time out because of memory errors
            try:
                with timeout(300, exception=RuntimeError):
                    nlpifiedDoc = nlp(doc)
                    thisDocumentData = create_parse_files(
                        nlpifiedDoc, i, True, settings.data_dir
                    )
            except RuntimeError:
                print("Didn't finish document #{} within five minutes. Moving to next one.".format(i))


SyntaxError: invalid syntax (equationparsing.py, line 317)

## 3. GloVe models
MULTIVAC also trains a 300-dimensional domain-adapted Global Vectors (GloVe) word-embeddings model on the corpus and saves this file in the same folder. GloVe embeddings derive multi-dimensional vector spaces describing word associations based on calculations of word co-occurrences over a large corpus.<sup>[1]</sup> 
 
MULTIVAC begins with a pre-trained 300-dimensional GloVe model incorporating 2 million terms found in the Common Crawl corpus, a collection of over 2 billion webpages scraped monthly.<sup>[2]</sup> This model represents a best-in-class embedding model for generic English language text. However, given the specific and highly technical domain we are attempting to understand and model, much domain-specific semantic knowledge – not to mention domain-specific vocabulary – are not accounted for in this generic model. MULTIVAC augments this model by training a domain-specific model on our corpus, and combining embeddings using Canonical Correctional Analysis (CCA) on the intersection of tokens between the two models.<sup>[3]</sup> The vectors for each token of the domain adapted GloVe embedding model are derived from a weighted average of the canonical vectors (N = 100) from the CCA analysis.
 
This alignment occurs on words that exist in both the domain-specific and generic model vocabularies, but for terms that are entirely domain-specific the vector representations are projected into the 100-dimensional canonical vector space from the CCA analysis via matrix multiplication and appended to the domain-adapted embedding vectors. The resulting domain-adapted model encompasses all terms in our corpus and combines semantic meaning from both the domain and wider global context.

<hr>

<sup>[1]</sup> Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” Full text available at https://nlp.stanford.edu/pubs/glove.pdf

<sup>[2]</sup> See http://commoncrawl.org/connect/blog/ for up to date statistics on the corpus. As of this report the total is now 3.1 billion pages, though this has varied over time since project inception, and not simply increased monotonically. When the pre-trained GloVe model was created the corpus was closer to 2 billion pages in size.

<sup>[3]</sup> Prathusha K Sarma, YIngyu Liang, William A Sethares, “Domain Adapted Word Embeddings for Improved Sentiment Classification,” Submitted on 11 May 2018. arXiv:1805.04576 [cs.CL] Full text available at: https://arxiv.org/pdf/1805.04576