# CoreNLP Parsing with NLTK Wrapper

<br>
This utilizes the NLTK Wrapper for CoreNLP in order to parse sentences from BBN/ACCENT to identify additionally verbs to add to PETRARCH Dictionaries to increase precision and Recall
<br>
<br>
Setup Environment

In [1]:
from nltk.parse.corenlp import CoreNLPServer
from nltk.parse.corenlp import CoreNLPDependencyParser
from nltk.parse import CoreNLPParser

import pandas as pd

import pprint
import nltk

from nltk.tree import *
from IPython.display import display

### Server Terminal > cd / > cd $CORENLP_HOME
<br>
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000
<br>
<br>
Connect parser to server running CoreNLP

In [2]:
parser = CoreNLPParser('http://localhost:9000')
depr = CoreNLPDependencyParser('http://localhost:9000')

# NYTBatch50 Accent
#### Read in NYTbatch50 BBN/ACCENT DATA, trim to CAMEO 145, and create a subset of events to test

In [6]:
extract = pd.read_csv('nytextract.csv')
protest_violent = extract[extract.code == 145].reset_index()
small = protest_violent.head()
small

Unnamed: 0,index,aid,code,text,bad
0,240,22338793,145,\n Yesterday the news agency reported that...,0
1,241,22339503,145,"""Thirty-four men arrested late last night were...",0
2,242,22395783,145,"""In the worst outbreak of street violence in 1...",0
3,243,22398252,145,Paris policemen and leftist extremists clash a...,0
4,244,22407123,145,"Violence occurred through most of the day, des...",0


#### Create regular sentence parser: input data and column

In [4]:
def core_parser(df, col):
    parse = []
    i = ""
    for i in range(len(df)):
        parse.append(i)
        parse[i] = next(parser.raw_parse(df[col].iloc[i]))
    return parse

#### Run core_parser with small batch data

In [None]:
%time nyt = core_parser(protest_violent, 'text')

#### To view each parse-tree, index result of function starting at 0

In [None]:
nyt[0]

#### Create dependency sentence parser: input data and column

In [None]:
def dep_parser(df, col):
    dep = []
    i = ""
    for i in range(len(df)):
        dep.append(i)
        dep[i] = next(depr.raw_parse(df[col].iloc[i]))
    return dep

In [None]:
next(parser.raw_parse("Israel said a mortar bomb was launched at it from the Gaza strip on Tuesday"))

#### Run dep_parser with small batch test data

In [None]:
%time nyt_dep = dep_parser(protest_violent, 'text')

#### To view each dependency parse-tree, index result of function starting at 0

In [None]:
nyt_dep[0]

#### To view both trees simultaneously, use function 'easy_read' with three arguments (corpus 1, corpus 2, and index number)

In [None]:
def easy_read(parse, dep, corp, index_num, text_col_name):
    display(parse[index_num])
    display(dep[index_num])
    display(corp.iloc[index_num].loc[text_col_name])

In [None]:
easy_read(nyt, nyt_dep, protest_violent, 24, 'text')

In [None]:
protest_violent

# NYTbd Sample 14-18
Read in data

In [None]:
sample1418 = pd.read_csv("ACCENT NYTbd sample 14 18.csv")
%time bd1418 = core_parser(sample1418, 'text')

In [None]:
%time bd1418_dep = dep_parser(sample1418, 'text')

In [None]:
easy_read(bd1418, bd1418_dep, sample1418, 0, 'text')

In [None]:
sample1418

# Extraneous Projects
#### py-CoreNLP Wrapper

In [None]:
from pycorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP('http://localhost:9000')

def depparse(text):
    parsed=""
    output = nlp.annotate(text, properties={
      'annotators': 'depparse',
      'outputFormat': 'json'
      })

    for i in output["sentences"]:
        for j in i["basicDependencies"]:
            parsed=parsed+str(j["dep"]+'('+ j["governorGloss"]+' ')+str(j["dependentGloss"]+')'+' ')
        return parsed
text="I put the book in the box on the table."

#### Stanford NLP Python Package Parser
The sentence structure of the BBN/ACCENT output makes it impossible to loop through... going to need to do some REGEX work on it to get it into a format where sentences are not broken up. Assuming 'nlp' command breaks up sentences at any '.' which is going to be a problem when trying to clean it up. 

In [None]:
import stanfordnlp
stanfordnlp.download('en')   # This downloads the English models for the neural pipeline
nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
doc.sentences[0].print_dependencies()

In [None]:
def stanlp(df, col):
    for i in range(len(df)):
        arg = nlp(df[col].iloc[i])
    return arg

In [None]:
flop = stanlp(protest_violent, "text")

In [None]:
flop.sentences[1].print_dependencies()