## Surprisal Models
**Brief**:<br>
This will be the main file for all of our model loading, data organizing, searching, etc.<br><br>
**Sections**:
1. [Methodology And Instructions For Running](#pre)
2. [Starting A StanfordCoreNLP Server](#1)
    - [Background](#1_a)
    - [How To Run The Server](#1_b)
    - [Resources](#1_c)
    - [Code](#1_d)
3. [Applying Models](#2)
___
<a id='pre'>

## Methodology And Instructions For Running
In this section, we'll summarize and catch up to where we are now.<br>
**Installations**:
There are two quick installations that will catch you up with the workflow of using Python.
1. Git Bash (https://git-scm.com/downloads)
    - This is a modified version of the command line (or terminal in MacOS) that enables the user to download GitHub repositories to their local machine, contribute to these repositories, and then publish their results.
    - I'm not going to do a big whole thing on Git here, but I will have you:
        - **clone** our repository to your machine, 
        - **add** a few files, 
        - **commit** your changes, 
        - and finally submit a **pull request**.
    - Why go through this hassle? Now, you can work with our code on your machine in Juypter, as opposed to having look at it on the GitHub interface!
2. Anaconda (https://www.anaconda.com/distribution/)
    - This is an extremely popular platform for data science work. I think you might already have it, so I won't go into much detail. Basically we'll be using this so we can access Juypter Notebooks (like this one!).
<br><br>

**A Road Map to Tag and Word Level Probability**:
It's a long road from a raw sentence to a number than represents the likelihood of a given word or tag. Let's see how we can get there.
1. Sentence Tagging
    - sentences -> list of tuples, each tuple has word and tag
    - sometimes this step will be consumed by the subsequent parsing step. for example, the Stanford parser accepts raw sentences as input, does the tagging, and then does the parsing.
        - *parser.raw_parse('The King of France is Bald.')* does this
2. Sentence Parsing
    - list of tags -> a tree
    - where we are at. need to talk to Na Rae

<a id='#1'>

### Starting A StanfordCoreNLP Server
<a id='1_a'>

**Background**:<br>
We've spent a lot of time looking at the StanfordCoreNLP software, and we ultimately decided we want to use both the parser and tagger (really, the parser automatically implements the tagger, but we'll see later). While the software is very useful, it's written in Java--which is not quite as nice to play with as Python for a number of reasons (lacks NLP libraries, lower level language, etc). The problem then is finding a way to use a Java program like it's a Python program.<br><br>
The two main options I looked into was a traditional import and a private server. One route of solving our problem is to use a traditional "wrapper" library/program. This program is essentially a translator between Java and Python. Unfortunately, the Stanford team itself doesn't actually make these wrappers (they would have to make them for a *lot* of languages). The existing wrappers--specifically the <u>stanfordcorenlp</u> library--weren't available through Anaconda (the platform through which we are running this exciting program right now), so I went another route.<br><br>
The direction I chose was to host a server that runs the out-of-the-box Java program, and to access it through a Python API. This involves a small amount of command line setup, but it saves the trouble of changing environment variables or using directories in Python.<br><br>
<a id='1_b'>

**How To Run A Server**<br>
*Note*: Huge thanks to Khalid Alnajjar, linked his guide in resources.
I've never actually hosted any sort of server before, so here's a quick summary:
1. download and extract the CoreNLP somewhere.
2. on the command line, cd into that directory
    - *stanford-corenlp-full-2018-10-05* should be the folder
3. run this command to host the server:
    - java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000 
4. pick up [here](#1_d)
<a id='1_c'>

**Resources**:<br>
1. documentation for the parser: http://www.nltk.org/_modules/nltk/parse/stanford.html
2. for more on running the server: https://www.khalidalnajjar.com/setup-use-stanford-corenlp-server-python/
3. StanfordCoreNLP's GitHub page: https://github.com/stanfordnlp/CoreNLP
4. StanfordCoreNLP's download (3.9.2): https://stanfordnlp.github.io/CoreNLP/
5. stanfordcorenlp wrapper library: https://pypi.org/project/stanfordcorenlp/

<a id='1_d'>

**Code**:

In [1]:
#Imports
import nltk
import pickle
import pandas as pd
from nltk import StanfordPOSTagger
from nltk.parse import stanford
from nltk.parse import CoreNLPParser
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
#builds parser
parser = CoreNLPParser(url='http://localhost:9000')

*Note*: How to make sense of this object, especially with NLTK's trees? Here's some help: https://stackoverflow.com/questions/26210567/get-entities-from-nltk-tree-result

In [3]:
foo_trees = parser.raw_parse("It was a bright cold day in April and the clocks were striking thirteen.")

In [4]:
type(foo_trees)

list_iterator

In [5]:
for tree in foo_trees:
    tree.leaves()

['It',
 'was',
 'a',
 'bright',
 'cold',
 'day',
 'in',
 'April',
 'and',
 'the',
 'clocks',
 'were',
 'striking',
 'thirteen',
 '.']

In [6]:
stimuli_sheet = pd.read_excel('Stimuli.xlsx')
print(stimuli_sheet)

    ORDER                                               Item
0       9  In the right hands, location data can be a for...
1      19  Paying for college seems out of reach for many...
2      20  Very few issues can bring together lawmakers o...
3      49  Koalas have been running into hard times. They...
4      51  Some recent reports have suggested that women ...
..    ...                                                ...
58    206  Earth is warming, and we know why. Light is re...
59    208  Nobody ever really tells you how to breathe. Y...
60    211  The immense wall of crumbling, whitewashed sto...
61    212  You never know where you will find love. Even ...
62    215  Imagine an animal that looks like a dinosaur, ...

[63 rows x 2 columns]


In [7]:
sents = stimuli_sheet["Item"]

In [8]:
sents[:5]
sents[0]

0    In the right hands, location data can be a for...
1    Paying for college seems out of reach for many...
2    Very few issues can bring together lawmakers o...
3    Koalas have been running into hard times. They...
4    Some recent reports have suggested that women ...
Name: Item, dtype: object

'In the right hands, location data can be a force for good. It can provide publishers and app developers with advertising revenue so consumers can access free services and information.'

In [9]:
sent1_trees = []
sent1_productions = []
sent2_trees = []
sent2_productions = []
sentCount=1
for entry in sents:
    sent1, sent2 = nltk.sent_tokenize(entry)
    trees1 = parser.raw_parse(sent1)
    trees2 = parser.raw_parse(sent2)
    for tree in trees1:
        print("The first tree for sentence #",sentCount," is:")
        tree.pretty_print()
        sent1_productions.append(tree.productions())
        sent1_trees.append(tree)
    for tree in trees2:
        print("The second tree for sentence #",sentCount," is:")
        tree.pretty_print()
        sent2_productions.append(tree.productions())
        sent2_trees.append(tree)
    sentCount+=1

The first tree for sentence # 1  is:
                                         ROOT                                           
                                          |                                              
                                          S                                             
          ________________________________|___________________________________________   
         |               |            |                VP                             | 
         |               |            |         _______|________                      |  
         |               |            |        |                VP                    | 
         |               |            |        |    ____________|____                 |  
         |               |            |        |   |                 NP               | 
         |               |            |        |   |        _________|_______         |  
         PP              |            |        |   |       |        

The first tree for sentence # 3  is:
                                ROOT                                         
                                 |                                            
                                 S                                           
  _______________________________|_________________________________________   
 |        |               VP                                               | 
 |        |           ____|________________                                |  
 |        |          |                     VP                              | 
 |        |          |     ________________|__________                     |  
 |        |          |    |      |                    NP                   | 
 |        |          |    |      |          __________|___                 |  
 |        |          |    |      |         |              PP               | 
 |        |          |    |      |         |       _______|____            |  
ADVP      NP         

The first tree for sentence # 6  is:
                          ROOT                       
                           |                          
                           S                         
             ______________|_______________________   
            |                   VP                 | 
            |           ________|___               |  
            NP         |            NP             | 
     _______|____      |    ________|______        |  
    JJ           NN   VBZ  DT       JJ     NN      . 
    |            |     |   |        |      |       |  
Pancreatic     cancer has  a       bad reputation  . 

The second tree for sentence # 6  is:
                                                             ROOT                                                                      
                                                              |                                                                         
                                                    

The first tree for sentence # 8  is:
                                              ROOT                                                          
                                               |                                                             
                                               S                                                            
    ___________________________________________|__________________________________________________________   
   |        |                   VP                                                                        | 
   |        |        ___________|_______________________                                                  |  
   |        |       |                                  SBAR                                               | 
   |        |       |     ______________________________|___________                                      |  
   |        |       |    |                                          S                  

The first tree for sentence # 10  is:
                                 ROOT                                    
                                  |                                       
                                  S                                      
         _________________________|____________________________________   
        |             |          ADVP           |       VP             | 
        |             |      _____|______       |    ___|______        |  
        NP            |    ADVP   |     ADVP    |   |         ADJP     | 
  ______|_______      |     |     |      |      |   |          |       |  
JJS     NN     NNS    ,     RB    CC     RB     ,  VBP         JJ      . 
 |      |       |     |     |     |      |      |   |          |       |  
Most cricket species  ,  indoors  or  outdoors  ,  are     omnivorous  . 

The second tree for sentence # 10  is:
                                          ROOT                                                         

The first tree for sentence # 12  is:
                                                ROOT                                                       
                                                 |                                                          
                                                 S                                                         
       __________________________________________|_______________________________________________________   
      |                 VP                                                                               | 
      |            _____|____                                                                            |  
      |           |          VP                                                                          | 
      |           |      ____|______                                                                     |  
      |           |     |          SBAR                                                       

The first tree for sentence # 14  is:
                                                                           ROOT                                                                           
                                                                            |                                                                              
                                                                            S                                                                             
        ____________________________________________________________________|___________________________________________________________________________   
       |                                                                         VP                                                                     | 
       |                                    _____________________________________|__________                                                            |  
       |                     

The first tree for sentence # 16  is:
                                                             ROOT                                                       
                                                              |                                                          
                                                              S                                                         
                  ____________________________________________|_______________________________________________________   
                 |                                      VP                                                            | 
                 |                        ______________|________________                                             |  
                 |                       |                               VP                                           | 
                 |                       |      _________________________|______                                

The first tree for sentence # 18  is:
                                                                  ROOT                                                                        
                                                                   |                                                                           
                                                                   S                                                                          
                     ______________________________________________|________________________________________________________________________   
                    |                               |    |         VP                                                                       | 
                    |                               |    |     ____|_____                                                                   |  
                    |                               |    |    |          NP                          

The first tree for sentence # 20  is:
                                   ROOT                              
                                    |                                 
                                    S                                
      ______________________________|______________________________   
     PP          |             |               VP                  | 
  ___|_____      |             |           ____|____               |  
 |        ADJP   |             NP         |        ADJP            | 
 |         |     |       ______|____      |     ____|_______       |  
 IN        JJ    ,      JJ         NNS   VBP   RB           JJ     . 
 |         |     |      |           |     |    |            |      |  
 In     general  ,  mammalian     hearts are quite      malleable  . 

The second tree for sentence # 20  is:
                                 ROOT                                    
                                  |                                      

The first tree for sentence # 23  is:
                                                                  ROOT                                                            
                                                                   |                                                               
                                                                   S                                                              
                      _____________________________________________|____________________________________________________________   
                     |                                             VP                                                           | 
                     |                              _______________|_________________                                           |  
                     |                             |                                 NP                                         | 
                     |                    

The first tree for sentence # 25  is:
                                                                                                      ROOT                                                                                               
                                                                                                       |                                                                                                  
                                                                                                       S                                                                                                 
                    ___________________________________________________________________________________|_______________________________________________________________________________________________   
                   |                                     VP                                                                                             

The first tree for sentence # 27  is:
                                                ROOT                                                    
                                                 |                                                       
                                                 S                                                      
                _________________________________|____________________________________________________   
               |                                              VP                                      | 
               |                       _______________________|________                               |  
               |                      |                                VP                             | 
               |                      |     ___________________________|_________                     |  
               |                      |    |     |                              SBAR                  | 
             

The first tree for sentence # 29  is:
                                                   ROOT                                                                
                                                    |                                                                   
                                                    S                                                                  
       _____________________________________________|________________________________________________________________   
      |                       VP                                                                                     | 
      |                 ______|_____                                                                                 |  
      |                |            S                                                                                | 
      |                |       _____|_______                                                                         | 

The first tree for sentence # 31  is:
                                                                   ROOT                                                      
                                                                    |                                                         
                                                                    S                                                        
                   _________________________________________________|______________________________________________________   
                  |                          |      |                                    VP                                | 
                  |                          |      |        ____________________________|____                             |  
                  |                          |      |       |       |                         PP                           | 
                  |                          |      |       |       |        

The first tree for sentence # 34  is:
                                             ROOT                                                    
                                              |                                                       
                                              S                                                      
            __________________________________|____________________________________________________   
           |                              VP                                                       | 
           |                           ___|___                                                     |  
           |                          |       S                                                    | 
           |                          |       |                                                    |  
           |                          |       VP                                                   | 
           |                          | 

The first tree for sentence # 36  is:
                   ROOT                                
                    |                                   
                    S                                  
  __________________|________________________________   
 |        VP                                         | 
 |     ___|____                                      |  
 |    |        VP                                    | 
 |    |    ____|__________                           |  
 |    |   |               NP                         | 
 |    |   |          _____|________                  |  
 |    |   |         |              PP                | 
 |    |   |         |          ____|____             |  
 NP   |   |         NP        |         NP           | 
 |    |   |     ____|_____    |     ____|_____       |  
 DT  VBZ VBN   DT   JJ    NN  IN   NN         NN     . 
 |    |   |    |    |     |   |    |          |      |  
This has been  a  golden age for brain     research  . 

T

The first tree for sentence # 38  is:
                                           ROOT                                             
                                            |                                                
                                            S                                               
      ______________________________________|_____________________________________________   
     |                           VP                                                       | 
     |            _______________|________________                                        |  
     |           |                                PP                                      | 
     |           |      __________________________|___                                    |  
     |           |     |                              NP                                  | 
     |           |     |          ____________________|________________________           |  
     NP          |     |   

The first tree for sentence # 40  is:
         ROOT                       
          |                          
          S                         
  ________|_______________________   
 |             VP                 | 
 |    _________|___               |  
 |   |             VP             | 
 |   |     ________|___           |  
 |   |    |            S          | 
 |   |    |            |          |  
 |   |    |            VP         | 
 |   |    |         ___|____      |  
 NP  |    |        |        VP    | 
 |   |    |        |        |     |  
PRP VBD  VBN       TO       VB    . 
 |   |    |        |        |     |  
 It was bound      to     happen  . 

The second tree for sentence # 40  is:
                                                                     ROOT                                                                  
                                                                      |                                                                     
    

The first tree for sentence # 43  is:
                                             ROOT                                          
                                              |                                             
                                              S                                            
           ___________________________________|__________________________________________   
          |         |           |             VP                                         | 
          |         |           |        _____|______                                    |  
          |         |           |       |            VP                                  | 
          |         |           |       |      ______|___________                        |  
          |         |           |       |     |                  PP                      | 
          |         |           |       |     |       ___________|____                   |  
          |         |           |    

The first tree for sentence # 45  is:
                                                          ROOT                                                        
                                                           |                                                           
                                                           S                                                          
      _____________________________________________________|________________________________________________________   
     |             VP                                                                                               | 
     |         ____|___________                                                                                     |  
     |        |                VP                                                                                   | 
     |        |     ___________|___________________________                                                         |  
     |

The first tree for sentence # 48  is:
                     ROOT                          
                      |                             
                      S                            
   ___________________|__________________________   
  |                   VP                         | 
  |      _____________|____                      |  
  |     |   |              VP                    | 
  |     |   |     _________|____                 |  
  NP    |  ADVP  |              NP               | 
  |     |   |    |     _________|___________     |  
 NNP   VBZ  RB  VBN   DT   NN   CC    NN    NN   . 
  |     |   |    |    |    |    |     |     |    |  
Dublin has long been  a   beer and whiskey town  . 

The second tree for sentence # 48  is:
                                                                                                    ROOT                                                                                    
                                                   

The first tree for sentence # 50  is:
                     ROOT                         
                      |                            
                      S                           
     _________________|_________________________   
    |                 VP                        | 
    |       __________|____                     |  
    |      |               VP                   | 
    |      |     __________|___                 |  
    S      |    |     |        PP               | 
    |      |    |     |     ___|____            |  
    VP     |    |     NP   |        NP          | 
    |      |    |     |    |    ____|_____      |  
   VBG     MD   VB    NN   IN PRP$        NN    . 
    |      |    |     |    |   |          |     |  
Traveling can wreak havoc  on your     fitness  . 

The second tree for sentence # 50  is:
                                                                ROOT                                                               
                  

The first tree for sentence # 52  is:
                          ROOT                             
                           |                                
                           S                               
        ___________________|_____________________________   
       |                        VP                       | 
       |                ________|___                     |  
       |               |            VP                   | 
       |               |    ________|___                 |  
       |               |   |            S                | 
       |               |   |         ___|__________      |  
       NP              |   |        NP            ADJP   | 
   ____|_______        |   |     ___|_______       |     |  
  JJ   NN     NNS      MD  VB   DT  NN      NN     JJ    . 
  |    |       |       |   |    |   |       |      |     |  
Great snow conditions can make  a  ski     trip magical  . 

The second tree for sentence # 52  is:
               

The first tree for sentence # 54  is:
                                              ROOT                                                
                                               |                                                   
                                               S                                                  
                         ______________________|________________________________________________   
                        |                               VP                                      | 
                        |                   ____________|_____                                  |  
                        |                  |   |              NP                                | 
                        |                  |   |         _____|_____________________            |  
                        NP                 |   |        |                          SBAR         | 
         _______________|___               |   |        |          


The first tree for sentence # 56  is:
                                             ROOT                                        
                                              |                                           
                                              S                                          
           ___________________________________|________________________________________   
          PP                                  |        |                |              | 
  ________|________                           |        |                |              |  
 |                 NP                         |        |                |              | 
 |         ________|____________              |        |                |              |  
 |        |                     PP            |        |                VP             | 
 |        |                  ___|______       |        |             ___|___           |  
 |        NP                |          NP     |        N

The first tree for sentence # 58  is:
                                             ROOT                                       
                                              |                                          
                                              S                                         
     _________________________________________|_______________________________________   
    |                                         VP                                      | 
    |        _________________________________|___________                            |  
    |       |      |           |                          PP                          | 
    |       |      |           |               ___________|____                       |  
    |       |      |           |              |                S                      | 
    |       |      |           |              |                |                      |  
    |       |      |           |              |                VP  


The first tree for sentence # 60  is:
                         ROOT                            
                          |                               
                          S                              
   _______________________|____________________________   
  |     |                 VP                           | 
  |     |      ___________|___________                 |  
  |     |     |      |    |          SBAR              | 
  |     |     |      |    |      _____|____            |  
  |     |     |      |    |     |          S           | 
  |     |     |      |    |     |          |           |  
  |     |     |      |    |     |          VP          | 
  |     |     |      |    |     |      ____|_____      |  
  NP   ADVP  ADVP    |    NP  WHADVP  |          VP    | 
  |     |     |      |    |     |     |          |     |  
  NN    RB    RB    VBZ  PRP   WRB    TO         VB    . 
  |     |     |      |    |     |     |          |     |  
Nobody ever really tells 

The first tree for sentence # 63  is:
                                                            ROOT                                                  
                                                             |                                                     
                                                             S                                                    
                         ____________________________________|__________________________________________________   
                        S                                    |    |                     |                       | 
                        |                                    |    |                     |                       |  
                        VP                                   |    |                     |                       | 
    ____________________|_____                               |    |                     |                       |  
   |                          NP      

In [10]:
sent1_productions[50]

[ROOT -> S,
 S -> SBAR , NP VP .,
 SBAR -> RB IN S,
 RB -> 'Even',
 IN -> 'if',
 S -> NP VP,
 NP -> PRP,
 PRP -> 'you',
 VP -> VBP PP,
 VBP -> 'strive',
 PP -> IN NP,
 IN -> 'for',
 NP -> NP PP,
 NP -> DT ADJP NN,
 DT -> 'a',
 ADJP -> JJ,
 JJ -> 'sustainable',
 NN -> 'lifestyle',
 PP -> IN NP,
 IN -> 'at',
 NP -> NN,
 NN -> 'home',
 , -> ',',
 NP -> PRP,
 PRP -> 'it',
 VP -> MD VP,
 MD -> 'may',
 VP -> VB ADJP,
 VB -> 'be',
 ADJP -> JJ S,
 JJ -> 'tempting',
 S -> VP,
 VP -> TO VP,
 TO -> 'to',
 VP -> VB S,
 VB -> 'avoid',
 S -> VP,
 VP -> VBG PP,
 VBG -> 'thinking',
 PP -> IN NP,
 IN -> 'about',
 NP -> NP SBAR,
 NP -> DT NN,
 DT -> 'the',
 NN -> 'impact',
 SBAR -> S,
 S -> NP VP,
 NP -> PRP$ NNS,
 PRP$ -> 'your',
 NNS -> 'travels',
 VP -> MD VP,
 MD -> 'could',
 VP -> VB PP,
 VB -> 'have',
 PP -> IN NP,
 IN -> 'on',
 NP -> DT NN,
 DT -> 'the',
 NN -> 'environment',
 . -> '.']

In [15]:
all_prods = []
for lists in sent1_productions:
    for rules in lists:
        all_prods.append(rules)
for lists in sent2_productions:
    for rules in lists:
        all_prods.append(rules)
len(all_prods)
set_prods = set(all_prods)    
len(set_prods)

3728

1246

In [17]:
all_prods[:10]

[ROOT -> S,
 S -> PP , NP VP .,
 PP -> IN NP,
 IN -> 'In',
 NP -> DT JJ NNS,
 DT -> 'the',
 JJ -> 'right',
 NNS -> 'hands',
 , -> ',',
 NP -> NN NNS]

In [23]:
leaves = []
interior = []
count=0
for rule in all_prods:
    test = str(rule) #can't do .contains()/in for a productions datatype, cool
    if "'" in test:
        leaves.append(rule)
    else:
        interior.append(rule)
    count+=1
len(leaves)
len(interior)
count
leaves[:10]
interior[:10]

1932

1796

3728

[IN -> 'In',
 DT -> 'the',
 JJ -> 'right',
 NNS -> 'hands',
 , -> ',',
 NN -> 'location',
 NNS -> 'data',
 MD -> 'can',
 VB -> 'be',
 DT -> 'a']

[ROOT -> S,
 S -> PP , NP VP .,
 PP -> IN NP,
 NP -> DT JJ NNS,
 NP -> NN NNS,
 VP -> MD VP,
 VP -> VB NP,
 NP -> NP PP,
 NP -> DT NN,
 PP -> IN ADJP]