# Pocking at Ever Larger Language Models
## An introduction for (digital) humanists
### From Neural to Pretrained Language Models


Sources used in this tutorial
- programming historian
- Jurafsky & Martin

## What are language models

LMs tell us what is likely to come next in sequence. More technically:

> “[Language models] assign a probability* to each possible next word. (Jurafsky & Martin)”

Given the sentence **“Predicting the future is hard, but not …”**

- P(“impossible” | sentence) is greater than P(“aardvark” | sentence)


```Read P(“impossible” | sentence) as the probability of observing the token “impossible” given the sequence “Predicting the future is hard, but not ..."```


```Probabilities are values between 0 and 1 that sum up to 1.```

**Peaking ahead**: if you can predict what comes next in a text sequence you learn quite a lot about language use and the world in general.

- Paris is located in [BLANK]
- He was late. I was really angry and told [BLANK]

## Quick recap
- Language modelling is the task of predicting the next word *w* given a history *h* (i.e. P(w | h))
- At each step, we can compute the probability over all the following words
- We can measure the **performance** of a model by evaluating how well a model can predict the next word (it will assign higher probabilities to actual texts)

## Pretrained Language Models

- Transition from N-Gram to Neural Language Models (ca. 2013)
    - word2vec Predict the center word given a context of n words, or predict context given a center word (fixed context)
    - PLMs: predict the next word given sequence or predict masked words in a sequence (variable length)
    - Models become 'larger', more parameters. They can model token meaning in context

## Terminology

<img src="https://soundgas.com/wp-content/uploads/2021/02/Vintage-mixers-from-Roland-Yamaha-1024x576.jpg" alt="knobs" width="500">

- Parameters are "knobs" you can adjust to transform an input to the output you want
- For a language model, the input is a sentence, the output is a probability over words (which should resemble the actual next word)
- Deep Learning algorithms attempt to find the optimal setting of these knobs. The more knobs, the more complex stuff you can do (but equally, it becomes harder to understand how the machine actually works).

![simpleNN](https://miro.medium.com/v2/resize:fit:624/1*U3FfvaDbIjr7VobJj89fCQ.png)



## Common PLM variants
- Causal/Autoregressive language models (GPT series): Predict the next [BLANK]
- Masked Language Models (BERT and family): Predict the [BLANK] word.
- By training a model on this task it learns a lot about language, and we can use this knowledge for generating new texts or other tasks.

Let's have a closer look at a real language model, GPT-2

In [None]:
!pip install transformers xformers

## Text Generation with GPT-2


Why is generating texts interesting for DH research? Can we use fictitious data?
- Sampling texts that could have been
- If the model learns some valuable patterns and associations in a corpus, we can possibly by studying it's behaviour in reaction to prompts and new data
    - a concrete example we will be looking
        - GTP-Brexit
        - Perception of Theresa May versus Boris Johnson



While more complex, GPT-2 operates similarly to a simple N-Gram LM.
- Given a prompt or input sequence, it returns a probability over the following word
- Then we can sample a word from this distribution, add it to the prompt, and repeat!

Materials inspired by this [blog post](https://huggingface.co/blog/how-to-generate) and the excellent Programming Historian lesson.



## Next word prediction with GPT-2

Next word prediction is the building block of generative AI and we will also encounter it when playing with larger language models such as GPT-3 or ChatGPT.

In the following example, we generate just one toke to show a language model creates a probability distribution over possible next words.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Model
import numpy as np
from torch.nn import Softmax
import pandas as pd

In [None]:
# tokenizer will split a text in units the LM is built on
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [None]:
# load the gpt-2 model
gpt2 = GPT2Model.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [None]:
prompt = 'Hello my name is' # define a prompt
predictions = model(**tokenizer(prompt, return_tensors='pt')) # get logits from model

In [None]:
predictions.logits.shape # the predictions as logits

In [None]:
# get words with highest probability
tokenizer.decode(np.argmax(predictions.logits[0,-1,:].detach().numpy()))

In [None]:
softmax = Softmax(dim=0) # initialize softmax function
series = pd.Series(softmax(predictions.logits[0,-1,:]).detach()).sort_values(ascending=False)
index = [tokenizer.decode(x) for x in series.index] # change index to tokens
series.index = index # set tokens as index
series[:100].plot(kind='bar',figsize=(20,5)) # plot results

## Generating texts from prompts

The preceding process is rather cumbersone, we just generated one additional word. The `transformers` library provides more convenient functions for generating texts based on a prompt.

In [None]:
#sequence = 'the duke of'
#sequence = 'A no deal Brexit'
sequence = 'The UK is'

In [None]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'gpt2',pad_token_id=tokenizer.eos_token_id)
generator(sequence, max_length = 30, num_return_sequences=10)

## Refining text generation

There are multiple settings we can adjust to drive the text generation in specific direction.

## Temperature
A very common parameter is `temperature` (which we will also encounter when playing with larger language models). 

Temperature regulates the creativity of a language model.

Increasing the temperature can make predictions more creative (or random if you [like](https://medium.com/mlearning-ai/softmax-temperature-5492e4007f71#:~:text=Temperature%20is%20a%20hyperparameter%20of%20LSTMs%20(and%20neural%20networks%20generally,utilize%20the%20Softmax%20decision%20layer.)))


Image taken for this [blogpost](https://medium.com/mlearning-ai/softmax-temperature-5492e4007f71) on temperature in Softmax.

![temperature](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*7xj72SjtNHvCMQlV.jpeg)


In [None]:
import torch
torch.manual_seed(0)
generator(sequence, 
          max_length = 30, 
          num_return_sequences=5,
          do_sample=True, 
          top_k = 0,
          temperature=.000000001, # change temparature to .7
         )


### Top k sampling

To prevent that outliers will mess up the generation, you can restrict the options and select only a word from the k most probable. 

In [None]:
generator(sequence, 
          max_length = 30, 
          do_sample=True, 
          num_return_sequences=2,
          top_k=50)

### Top p or nucleus sampling

Another strategy is to sample from the smallest set of words whose cumulative probability exceeds the probability p.

In [None]:
generator(sequence, 
          max_length = 30, 
          do_sample=True, 
          num_return_sequences=2,
          top_k=0,
          top_p=.92)

## Adapting a language model

It is possible to change a language model by further training or fine-tuning it on new documents. Based on the tutorial on GPT-2 in Programming Historian we trained a model on news snippets related to Brexit. In other words, we've built a GPT-Brexit model on top of GPT-2.



In [None]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'Kaspar/gpt-brexit',tokenizer='gpt2',pad_token_id=tokenizer.eos_token_id)


### Exercise

- Define a prompt implicitly related to Brexit for example "The UK is"
- Using the `pipeline` can you generate 3 documents with GPT-2 and GPT-Brexit
- Does this show interesting difference, how would you about studying these models?

In [None]:
# write answer here

## Tracing Semantic Change with Masked Language Models

In [None]:
from transformers import pipeline
sentence = "Our sewing [MASK] stood near the wall where grated windows admitted sunshine, and their hymn to Labour was the only sound that broke the brooding silence."

masker = pipeline("fill-mask", model='bert-base-uncased')
print(masker(sentence))

victorian_masker = pipeline("fill-mask", model='Livingwithmachines/bert_1760_1850')
print(victorian_masker(sentence))

# Supervised Classification with BERT

## 1. Get training examples and annotate them

In [None]:
%%bash
wget https://bl.iro.bl.uk/downloads/59a8c52f-e0a5-4432-9897-0db8c067627c?locale=en -O animacy.zip 
unzip animacy.zip

In [18]:
!pip install datasets
!pip install --upgrade transformers 
!pip install --upgrade accelerate


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp39-cp39-macosx_12_0_arm64.whl (401 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.8/401.8 kB[0m [31m7.2 

In [19]:
import numpy as np
from sklearn.metrics import f1_score, classification_report, accuracy_score
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import DataCollatorWithPadding

In [20]:
dataset = load_dataset('biglam/atypical_animacy')

Found cached dataset atypical_animacy (/Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba)


  0%|          | 0/1 [00:00<?, ?it/s]

In [21]:
dataset = dataset['train']

In [22]:
dataset

Dataset({
    features: ['id', 'sentence', 'context', 'target', 'animacy', 'humanness', 'offsets', 'date'],
    num_rows: 594
})

## Divide data in training and test split

In [23]:
test_size = int(len(dataset)*.3)
train_test = dataset.train_test_split(test_size=test_size , seed=42)
test_set = train_test['test']
val_size = int(len(train_test['train'])*.05)
train_val =  train_test['train'].train_test_split(test_size=val_size,seed=42)

Loading cached split indices for dataset at /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-9652ac799873fb0c.arrow and /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-8482e57ede6cfa77.arrow
Loading cached split indices for dataset at /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-70443c389b2cbb02.arrow and /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-eab173811a52de51.arrow


## Load a Pretrained Language Model

In [26]:
checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/kasparbeelen/anaconda3/envs/py39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/d2/ydv0grbd38985h6_95t0vdjw0000gp/T/ipykernel_50393/1569015581.py", line 3, in <cell line: 3>
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)
  File "/Users/kasparbeelen/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
  File "/Users/kasparbeelen/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2542, in from_pretrained
    f"{pretrained_model_name_or_path} does not appear to have a file named"
  File "/Users/kasparbeelen/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 433, in load_state_dict
    state_dict = loader(os.path.join(folder, shard_file))
NameError: name

## Preprocess data for classification (tokenization)

In [25]:
def preprocess_function(examples, target_col):
    return tokenizer(examples[target_col], truncation=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

train_val = train_val.map(preprocess_function,fn_kwargs={'target_col': 'sentence'})

Loading cached processed dataset at /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-66a796427ff630d3.arrow


Map:   0%|          | 0/20 [00:00<?, ? examples/s]

## Instantiate a training routine and train model on examples

In [17]:
training_args = TrainingArguments(
    output_dir=f"../results",
    seed = 42,
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
        )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_val["train"],
    eval_dataset=train_val["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
        )


trainer.train()

NameError: name 'PartialState' is not defined

## Evaluate on test examples

In [None]:
test_set = test_set.map(preprocess_function,fn_kwargs={'target_col': sent_col})
predictions = trainer.predict(test_set)
preds = np.argmax(predictions.predictions, axis=-1)
f1_score(preds,predictions.label_ids,average='binary')
f1_score(preds,predictions.label_ids,average='macro')
f1_score(preds,predictions.label_ids,average='micro')
accuracy_score(preds,predictions.label_ids)

The model only returns logits by class

In [None]:
predictions

# Fin.