# Neural Model Predictions (for English)

Created Feburary 2023 by [Forrest Davis](https://conf.ling.cornell.edu/forrestdavis/). Get in touch if you have any questions!

The following colab script will make concrete some of the issues about probability I discussed today and allow you to explore a bit on your own. If you've never used colab before, [here](https://colab.research.google.com) is a nice introductory document. It links to this, sort of unsettling, [video](https://www.youtube.com/watch?v=inN8seMm7UI).  

What's critical for this notebook is running code. You can run code by hovering over "code blocks" and pressing the play button to the left. 



In [None]:
# Push the play button to the left
print('hello')

hello


# Setting up

The code blocks in this section do the following: 

1. Install the necessary packages
2. Import the libraries
3. Clone a repo I made for evaluating models
4. Move into the git repo

In [None]:
!pip install transformers
!pip install datasets
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.0
Looking in indexes: https://pypi.org/simple, https://us

In [None]:
import transformers
import torch
import pandas as pd
import sentencepiece

In [None]:
#Clone the evaluation repo
!git clone https://github.com/forrestdavis/PublicModelsAPI.git
# Move to evaluation repo for ease of running
%cd /content/PublicModelsAPI

Cloning into 'PublicModelsAPI'...
remote: Enumerating objects: 244, done.[K
remote: Total 244 (delta 0), reused 0 (delta 0), pack-reused 244[K
Receiving objects: 100% (244/244), 45.64 MiB | 28.17 MiB/s, done.
Resolving deltas: 100% (119/119), done.
/content/PublicModelsAPI


# Experimenting with the model

I've provided three ways to query a neural model (the default being gpt2 small, linked [here](https://huggingface.co/gpt2) on huggingface): 

1. Interactive mode, where you enter sentences or phrases and incremental metrics are retreived from the model 
2. Targeted mode, where you have a fixed context and want to explore a set of possible continuations
3. Completion mode, where you have a fixed context and want to know the top K next words (or subwords)

## Interactive Mode

The following code runs interactive mode with an English neural model (gpt2 small). The key columns are prob which gives you the probability assigned to that word in the input and surp which gives you the surprisal assigned to that word in the input. A video of running the code can be found [here](https://github.com/forrestdavis/PublicModelsAPI/blob/main/demo/Interactive.gif). The video is for slightly different code, but the mechanics are the same.

Run the following block of support code.

In [None]:
#@title Interactive code

def getInteract(modelType='gpt2', modelName='gpt2'):

    # set path
    import sys
    sys.path.append("/content/PublicModelsAPI/")
    from src.experiments.Interact import Interact

    config = {"exp": "Interact", 
            "models": {modelType: [modelName]}, 
            "lower": False, 
            "include_punct": False
            }
    exp = Interact(config)
    exp.run_interact()

In [None]:
getInteract()

Running on cpu
Using pad_token, but it is not set yet.
Pad token was set


string: The man who is tall is happy
word                 | Split | Unk | Punct | ModelName            | surp     | prob      
-----------------------------------------------------------------------------------------
The                  |     0 |   0 |     0 | gpt2                 |        0 |          1
man                  |     0 |   0 |     0 | gpt2                 |   10.116 |     0.0009
who                  |     0 |   0 |     0 | gpt2                 |    3.143 |    0.11321
is                   |     0 |   0 |     0 | gpt2                 |    5.448 |    0.02291
tall                 |     0 |   0 |     0 | gpt2                 |   14.033 |      6e-05
is                   |     0 |   0 |     0 | gpt2                 |     4.86 |    0.03442
happy                |     0 |   0 |     0 | gpt2                 |   11.345 |    0.00038
string: The man who is tall are happy
word                 | Split | Unk | Punct | ModelName            | surp     | prob      
-------------------------

KeyboardInterrupt: ignored

## Targeted Mode

In targeted mode, you provide a context string and a set of target words. The probability of these words is returned to you. First you'll need to run the following block of code which sets up the relevant helper functions (this only needs to be run once). Then just change the context and target variables' values as you desire and run the code block. 

In [None]:
#@title Targeted code
def getTargeted(context, targets, 
                modelType='gpt2', modelName='gpt2'):
    
    # set path
    import sys
    sys.path.append("/content/PublicModelsAPI/")
    from src.models import models

    run_config = {'models': {modelType: [modelName]}}

    LM = models.load_models(run_config)[0]

    Ps = []
    for target in targets:
        sent = context.strip() + ' '+target.strip()
        #Get likelihood of final word
        prob = LM.get_aligned_words_probabilities(sent)[0][-1]
        assert prob.word == target
        Ps.append((target, prob.prob))

    return Ps

In [None]:
# Change these variables
context = 'I saw a fragile'
targets = ['of', 'whale']

# Leave this as is 
probs = getTargeted(context, targets)
print('----------------------------------')
for p in probs:
    print(f"P({p[0]}|{context}) = {round(p[1], 6)}")

Running on cpu
Using pad_token, but it is not set yet.
Pad token was set


----------------------------------
P(of|I saw a fragile) = 0.000334
P(whale|I saw a fragile) = 0.000169


## Completion Mode

In completion mode, the top K next word (or subword) predictions are returned to you. Run the following code block once to set up the relevant code. Then, the final code block is used to set your relevant context value and the number of completions you want to see. That final block should be run in order to generate the results.

In [None]:
#@title Completion code
def getTopK(context, k=10, 
            modelType='gpt2', modelName='gpt2'):
    
    # set path
    import sys
    import torch
    sys.path.append("/content/PublicModelsAPI/")
    from src.models import models

    run_config = {'models': {modelType: [modelName]}}

    LM = models.load_models(run_config)[0] 

    output = LM.get_output(context)
    logits = output[-1]
    #Final predictions
    final = torch.nn.functional.softmax(logits[0,-1,:], dim=-1)
    #Get topk predictions
    topK = torch.topk(final, k)
    values = topK.values.tolist()
    indices = topK.indices
    #Convert to token representations
    tokens = LM.tokenizer.convert_ids_to_tokens(indices)
    #Strip off byte
    tokens = list(map(lambda x: x.replace('Ġ', ''), tokens))
    #Safety check
    assert len(tokens) == len(values)

    return list(zip(tokens, values))

In [None]:
context = 'Noam Chomsky is a'
K = 10

predictions = getTopK(context, K)

print(f"Context: {context}")
print("\tword"+" "*16+"prob")
for pred in predictions:
    print(f"\t{pred[0]: <20}{round(pred[1], 6)}")


Running on cpu
Using pad_token, but it is not set yet.


Context: Noam Chomsky is a
	word                prob
	professor           0.107418
	former              0.044045
	writer              0.021567
	journalist          0.020247
	lingu               0.018673
	senior              0.014657
	historian           0.013433
	philosopher         0.013281
	political           0.012487
	scholar             0.012242


Pad token was set
