## Surprisal Models
**Brief**:<br>
This will be the main file for all of our model loading, data organizing, searching, etc.<br><br>
**Sections**:
1. [Methodology And Instructions For Running](#pre)
2. [Starting A StanfordCoreNLP Server](#1)
    - [Background](#1_a)
    - [How To Run The Server](#1_b)
    - [Resources](#1_c)
    - [Code](#1_d)
3. [Applying Models](#2)
___
<a id='pre'>

## Methodology And Instructions For Running
In this section, we'll summarize and catch up to where we are now.<br>
**Installations**:
There are two quick installations that will catch you up with the workflow of using Python.
1. Git Bash (https://git-scm.com/downloads)
    - This is a modified version of the command line (or terminal in MacOS) that enables the user to download GitHub repositories to their local machine, contribute to these repositories, and then publish their results.
    - I'm not going to do a big whole thing on Git here, but I will have you:
        - **clone** our repository to your machine, 
        - **add** a few files, 
        - **commit** your changes, 
        - and finally submit a **pull request**.
    - Why go through this hassle? Now, you can work with our code on your machine in Juypter, as opposed to having look at it on the GitHub interface!
2. Anaconda (https://www.anaconda.com/distribution/)
    - This is an extremely popular platform for data science work. I think you might already have it, so I won't go into much detail. Basically we'll be using this so we can access Juypter Notebooks (like this one!).
<br><br>

**A Road Map to Tag and Word Level Probability**:
It's a long road from a raw sentence to a number than represents the likelihood of a given word or tag. Let's see how we can get there.
1. Sentence Tagging
    - sentences -> list of tuples, each tuple has word and tag
    - sometimes this step will be consumed by the subsequent parsing step. for example, the Stanford parser accepts raw sentences as input, does the tagging, and then does the parsing.
        - *parser.raw_parse('The King of France is Bald.')* does this
2. Sentence Parsing
    - list of tags -> a tree
    - where we are at. need to talk to Na Rae

<a id='#1'>

### Starting A StanfordCoreNLP Server
<a id='1_a'>

**Background**:<br>
We've spent a lot of time looking at the StanfordCoreNLP software, and we ultimately decided we want to use both the parser and tagger (really, the parser automatically implements the tagger, but we'll see later). While the software is very useful, it's written in Java--which is not quite as nice to play with as Python for a number of reasons (lacks NLP libraries, lower level language, etc). The problem then is finding a way to use a Java program like it's a Python program.<br><br>
The two main options I looked into was a traditional import and a private server. One route of solving our problem is to use a traditional "wrapper" library/program. This program is essentially a translator between Java and Python. Unfortunately, the Stanford team itself doesn't actually make these wrappers (they would have to make them for a *lot* of languages). The existing wrappers--specifically the <u>stanfordcorenlp</u> library--weren't available through Anaconda (the platform through which we are running this exciting program right now), so I went another route.<br><br>
The direction I chose was to host a server that runs the out-of-the-box Java program, and to access it through a Python API. This involves a small amount of command line setup, but it saves the trouble of changing environment variables or using directories in Python.<br><br>
<a id='1_b'>

**How To Run A Server**<br>
*Note*: Huge thanks to Khalid Alnajjar, linked his guide in resources.
I've never actually hosted any sort of server before, so here's a quick summary:
1. download and extract the CoreNLP somewhere.
2. on the command line, cd into that directory
    - *stanford-corenlp-full-2018-10-05* should be the folder
3. run this command to host the server:
    - java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000 
4. pick up [here](#1_d)
<a id='1_c'>

**Resources**:<br>
1. documentation for the parser: http://www.nltk.org/_modules/nltk/parse/stanford.html
2. for more on running the server: https://www.khalidalnajjar.com/setup-use-stanford-corenlp-server-python/
3. StanfordCoreNLP's GitHub page: https://github.com/stanfordnlp/CoreNLP
4. stanfordcorenlp wrapper library: https://pypi.org/project/stanfordcorenlp/

<a id='1_d'>

**Code**:

In [1]:
#Imports
import nltk
import pickle
from nltk import StanfordPOSTagger
from nltk.parse import stanford
from nltk.parse import CoreNLPParser
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
#builds parser
parser = CoreNLPParser(url='http://localhost:9000')

In [3]:
#testing parser on sentence
list(parser.raw_parse('The King of France is Bald.'))
#if you want to see the list of commands
dir(parser)

[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['The']), Tree('NNP', ['King'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NNP', ['France'])])])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('JJ', ['Bald'])])]), Tree('.', ['.'])])])]

['_OUTPUT_FORMAT',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_params',
 'api_call',
 'encoding',
 'evaluate',
 'grammar',
 'make_tree',
 'parse',
 'parse_all',
 'parse_one',
 'parse_sents',
 'parse_text',
 'parser_annotator',
 'raw_parse',
 'raw_parse_sents',
 'raw_tag_sents',
 'session',
 'span_tokenize',
 'span_tokenize_sents',
 'tag',
 'tag_sents',
 'tagtype',
 'tokenize',
 'tokenize_sents',
 'url']

*Note*: How to make sense of this object, especially with NLTK's trees? Here's some help: https://stackoverflow.com/questions/26210567/get-entities-from-nltk-tree-result

In [4]:
type(parser)
type(parser.raw_parse('The King of France is Bald.'))
parsed_sent = parser.raw_parse('The King of France is Bald.')

nltk.parse.corenlp.CoreNLPParser

list_iterator

In [5]:
parsed_tree = nltk.Tree.fromstring(parsed_sent)

TypeError: expected string or bytes-like object

In [None]:
parser.grammar()