# MULTIVAC Parsing Text and LaTeX equations

<b>PACKAGE DEPENDENCIES (AND INSTALL DIRECTIONS):</b> 
    
* stanfordnlp - https://github.com/stanfordnlp/stanfordnlp
* spaCy - https://spacy.io/usage
* sympy - https://docs.sympy.org/latest/install.html
* antlr-python-runtime - https://anaconda.org/conda-forge/antlr-python-runtime


<b>INPUTS:</b> 
    
* JSON file from web scrape


<b>OUTPUTS:</b> 
    
* JSON file with tokenized LaTeX equations
* dependency (*.dep) files
* input (*.input) files
* morphology (*.morph) files

<b>SUMMARY:</b> 
    
The parsing component of MULTIVAC takes a JSON file of scraped journal articles, and parses the text as well as LaTeX notation contained within the text. LaTeX equation parsing requires the `sympy` library and `antlr` runtime engine. Scripts in `equationparsing.py` use regular expressions to identify the occurrence of properly formatted LaTeX code, and the `sympy` library is used to parse these into string representations.  The tokens from these equations are expanded out in string representation and replace the LaTeX notation in the original text. The text files are written out to `articles-with-equations.json` for further use in preparing GloVe embeddings. 

The text parsing relies on two natural language processing engines – `stanfordnlp` and `spaCy` – to construct dependency trees, tag parts of speech and lemmatize tokens. Each sentence is processed individually to identify the dependency structure of its tokens. When LaTeX notation occurs in text, its string representation is parsed into dependency trees and included in the sentence structure. Text files are written out to three separate folders: `dep` (contains *.dep* files with Standford dependency trees); `morph` (contains *.morph* files with lemmatized tokens); and `input` (contains *.input* files with parts-of-speech tagged tokens). 

<hr>

<div class="alert alert-block alert-info">
<b>Load packages/libraries:</b> This block loads standard, third-party and local application modules
   
</div>

#### Import standard libraries


<div class="alert alert-block alert-success">
<b>Instructions:</b> 
Add a system path that points to files in `src/data/` for custom modules
</div>



In [None]:
import argparse
import copy
import gc
import json
import os
import pickle
import re as reg
import sys

sys.path.append(os.path.abspath('../../src/data'))

#### Import third party libraries

In [None]:
import spacy
import stanfordnlp
from interruptingcow import timeout

#### Import local application libraries

In [None]:
import equationparsing as eq
from textparsing import clean_doc
from parsing import create_parse_files, get_adjustment_position, get_token_governor, load_data

<hr />
<div class="alert alert-block alert-info">
<b>Load natural language processors:</b> 
    
* spaCy is used for quick validation of words (particularly in the `clean_doc` function in the `textparsing` module)
* stanfordNLP is used for tokenizing, dependency parsing, POS-tagging and lemmatization
    
</div>


<div class="alert alert-block alert-success">
<b>Instructions:</b> 
Ensure that the path passed to the `models_dir` argument in the `stanfordnlp.Pipeline` method points to the correct location of the StanfordNLP resources on your local machine. 
    
For some installations, spaCy may also require a path.

</div>


In [None]:
spacynlp = spacy.load('en_core_web_sm')
nlp = stanfordnlp.Pipeline(models_dir='../../../../multivac/stanfordnlp_resources/', 
                            treebank='en_ewt', use_gpu=False, pos_batch_size=3000)

<hr />
<div class="alert alert-block alert-info">
<b>Load documents:</b> 
Retrieve the JSON file that contains article data and metadata
</div>


<div class="alert alert-block alert-success">
<b>Instructions:</b> 
Ensure that the path passed to the `load_data` function matches the location of the JSON file
</div>



In [None]:
jsonObj, allDocs = load_data('../../../../multivac/data/20181212.json')

<div class="alert alert-block alert-info">
<b>Clean documents:</b> 
    
Retrieve the text elements from the articles and prepare for further processing. The process first checks to see if a pickle of the cleaned data already exists ( `allDocsClean.pkl` ) - if it exists, the pickle is loaded. Otherwise, texts are cleaned and a pickle file is outputted to the working directory as `allDocsClean.pkl`. Cleaned documents are stored as a list in the `allDocsClean` variable. </div>

    
The cleaning process attempts to:
* remove authors
* remove citations
* remove URLs
* remove emails
* remove other metadata artefacts from journal scraping embedded within the text
* adjust hyphenated words 

In [None]:
try: 
    allDocsClean = pickle.load(open('allDocsClean.pkl', "rb" ))
    print('Loaded pickle!')
except FileNotFoundError:
    print('No saved pickle. Starting from scratch.')
    allDocsClean= []
    percentCompletedMultiple = int(len(allDocs)/10)
    for i, doc in enumerate(allDocs):
        if i%percentCompletedMultiple == 0: 
            print('{}% completed'.format(round(i/(len(allDocs))*100, 0)))
        allDocsClean.append(clean_doc(doc, spacynlp))

    with open('allDocsClean.pkl', 'wb') as f:
        pickle.dump(allDocsClean, f)

<div class="alert alert-block alert-info">
<b>Extract LaTeX notation:</b> 
    
Find LaTeX notation and store them in the global dictionary variable, `eq.LATEXMAP`. In `eq.LATEXMAP`, the keys are the tags assigned to the instance of the LaTeX notation identified (in the format `LtxqtnXXXXXX`), and the value is the actual LaTeX code.

</div>

In [None]:
allDocs2 = [eq.extract_and_replace_latex(doc) for docNum, doc in enumerate(allDocsClean)]
print('Number of LateX Equations parsed: {}'.format(len(eq.LATEXMAP)))

#### OPTIONAL - Replace LaTeX code in original text with string representation tokens


If the equations can be parsed out into a string representation (using the `sympy` library -- see https://docs.sympy.org/latest/tutorial/manipulation.html), then replace the notation with the actual tokens used. The tokens are replaced in the text key of the original JSON file. The updated JSON file is then written out to `articles-with-equations.json`. This file is used for creating GloVe embeddings in a later step.

In [None]:
allDocs3 = []
percentCompletedMultiple = int(len(allDocs2)/10)
for i, doc in enumerate(allDocs2[0:]):
    if i%percentCompletedMultiple == 0: 
        print('{}% completed'.format(round(i/(len(allDocs2))*100, 0)))
    newDoc = reg.sub(r'Ltxqtn[a-z]{8}', eq.put_equation_tokens_in_text, doc)
    allDocs3.append(newDoc)

jsonObj2 = copy.deepcopy(jsonObj)
allDocs3Counter = 0 

for key, value in list(jsonObj2.items()):
    if value['text']:
        jsonObj2[key]['text']=allDocs3[allDocs3Counter]
        allDocs3Counter = allDocs3Counter+1

with open('articles-with-equations.json', 'w', encoding='utf8') as fp:
    json.dump(jsonObj2, fp)

<div class="alert alert-block alert-info">
<b>Create dependendy parse, POS-tagged tokens and lemmatized token files:</b> 
    
This process creates the `.dep`, `.input`, and `.morph` files for further processing in the Markov Logic Networks. It processes each text using the Stanford NLP pipeline and saves the output to a specified folder. The NLP-ification process is time consuming and can crash for certain texts that can't be parsed out. A runtime error is raised if the process takes more than 5 minutes and the file is skipped. 
</div>

<div class="alert alert-block alert-success">
<b>Instructions:</b> 
    
Ensure that the path passed to `create_parse_files` matches the desired output directory. The output directory must contain three subfolders:
* `dep`
* `input`
* `morph`

Output files will be written to each of these folders. 

The start point can be adjusted using the `startPoint` variable. When set to 0, it starts with the first text. This can be useful for pausing and coming back to the process (e.g., between system reboots). 


</div>

In [None]:
startPoint=950

for i, doc in enumerate(allDocs2[0:]):
    print('Processing document #{}'.format(i))
    if i > startPoint:

        # Use exception handling so that the process doesn't get stuck and time out because of memory errors
        try:
            with timeout(300, exception=RuntimeError):
                nlpifiedDoc = nlp(doc)
                thisDocumentData = create_parse_files(nlpifiedDoc, i, True, 'output_data/')
        except RuntimeError:
            print("Didn't finish document #{} within five minutes. Moving to next one.".format(i))