# Pocking at Ever Larger Language Models
## An introduction for (digital) humanists
### From Neural to Pretrained Language Models


Sources used in this tutorial
- programming historian
- Jurafsky & Martin

## What are language models

LMs tell us what is likely to come next in sequence. More technically:

> “[Language models] assign a probability* to each possible next word. (Jurafsky & Martin)”

Given the sentence **“Predicting the future is hard, but not …”**

- P(“impossible” | sentence) is greater than P(“aardvark” | sentence)


```Read P(“impossible” | sentence) as the probability of observing the token “impossible” given the sequence “Predicting the future is hard, but not ..."```


```Probabilities are values between 0 and 1 that sum up to 1.```

**Peaking ahead**: if you can predict what comes next in a text sequence you learn quite a lot about language use and the world in general.

- Paris is located in [BLANK]
- He was late. I was really angry and told [BLANK]

## Quick recap
- Language modelling is the task of predicting the next word *w* given a history *h* (i.e. P(w | h))
- At each step, we can compute the probability over all the following words
- We can measure the **performance** of a model by evaluating how well a model can predict the next word (it will assign higher probabilities to actual texts)

## Pretrained Language Models

- Transition from N-Gram to Neural Language Models (ca. 2013)
    - word2vec Predict the center word given a context of n words, or predict context given a center word (fixed context)
    - PLMs: predict the next word given sequence or predict masked words in a sequence (variable length)
    - Models become 'larger', more parameters. They can model token meaning in context

## Terminology

<img src="https://soundgas.com/wp-content/uploads/2021/02/Vintage-mixers-from-Roland-Yamaha-1024x576.jpg" alt="knobs" width="500">

- Parameters are "knobs" you can adjust to transform an input to the output you want
- For a language model, the input is a sentence, the output is a probability over words (which should resemble the actual next word)
- Deep Learning algorithms attempt to find the optimal setting of these knobs. The more knobs, the more complex stuff you can do (but equally, it becomes harder to understand how the machine actually works).

![simpleNN](https://miro.medium.com/v2/resize:fit:624/1*U3FfvaDbIjr7VobJj89fCQ.png)



## Common PLM variants
- Causal/Autoregressive language models (GPT series): Predict the next [BLANK]
- Masked Language Models (BERT and family): Predict the [BLANK] word.
- By training a model on this task it learns a lot about language, and we can use this knowledge for generating new texts or other tasks.

Let's have a closer look at a real language model, GPT-2

## HuggingFace 🤗 and the Transformers library

HuggingFace is a company specialised in distributing deep learning models and data. 

Their open source `transformers` library has become one of the most popular libraries for NLP:
* State-of-the-art NLP easier to use.
* Provides APIs to download and use pretrained models, but also allows you to load and fine-tune your own models.
* It is open source! 
* Maintains a **model hub**: central point for people to share and find models. They host more than 50K models, supporting different languages and different tasks, and also more than 7K datasets.

We'll just scratch the surface, but if you are interested in this, we highly recommend the HuggingFace course: https://huggingface.co/course

### What are Transformers (the T in GPT and BERT)

A **transformer** is a deep learning model that uses the **attention** mechanism (a mechanism which is based on cognitive attention, and which focuses on where the key information in a sequence is produces while forgetting less relevant information). Its development has had a huge impact in deep learning, especially in natural language processing and computer vision. It allows a more effective modeling of long term dependencies between the words in a sequence, and more efficient training, not limited by the sequence order of the input sequence.

You can read the original paper [here](https://arxiv.org/abs/1706.03762). It is by far the most impactful paper (in computer science) of the last decade and 79588 citations on Google Scholar (last checked 29/06/2023 at 6:58 AM)



### Install the required HuggingFacelibraries 

In [None]:
%%bash
pip install transformers xformers accelerate datasets

## Text Generation with GPT-2


Why is generating texts interesting for DH research? Can we use fictitious data?
- Sampling texts that could have been
- If the model learns some valuable patterns and associations in a corpus, we can possibly by studying it's behaviour in reaction to prompts and new data
    - a concrete example we will be looking
        - GTP-Brexit
        - Perception of Theresa May versus Boris Johnson



While more complex, GPT-2 operates similarly to a simple N-Gram LM.
- Given a prompt or input sequence, it returns a probability over the following word
- Then we can sample a word from this distribution, add it to the prompt, and repeat!

Materials inspired by this [blog post](https://huggingface.co/blog/how-to-generate) and the excellent Programming Historian lesson.



## Next word prediction with GPT-2

Next word prediction is the building block of generative AI and we will also encounter it when playing with larger language models such as GPT-3 or ChatGPT.

In the following example, we generate just one toke to show a language model creates a probability distribution over possible next words.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Model
import numpy as np
from torch.nn import Softmax
import pandas as pd

In [None]:
# tokenizer will split a text in units the LM is built on
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [None]:
# load the gpt-2 model
gpt2 = GPT2Model.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

In [None]:
prompt = 'Hello my name is' # define a prompt
predictions = model(**tokenizer(prompt, return_tensors='pt')) # get logits from model

In [None]:
predictions.logits.shape # the predictions as logits

In [None]:
# get words with highest probability
tokenizer.decode(np.argmax(predictions.logits[0,-1,:].detach().numpy()))

In [None]:
softmax = Softmax(dim=0) # initialize softmax function
series = pd.Series(softmax(predictions.logits[0,-1,:]).detach()).sort_values(ascending=False)
index = [tokenizer.decode(x) for x in series.index] # change index to tokens
series.index = index # set tokens as index
series[:100].plot(kind='bar',figsize=(20,5)) # plot results

## Generating texts from prompts

The preceding process is rather cumbersone, we just generated one additional word. The `transformers` library provides more convenient functions for generating texts based on a prompt.

In [None]:
#sequence = 'the duke of'
#sequence = 'A no deal Brexit'
sequence = 'The UK is'

In [None]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'gpt2',pad_token_id=tokenizer.eos_token_id)
generator(sequence, max_length = 30, num_return_sequences=10)

## Refining Text Generation

There are multiple settings we can adjust to drive the text generation in specific direction.

## Temperature
A very common parameter is `temperature` (which we will also encounter when playing with larger language models). 

Temperature regulates the creativity of a language model.

Increasing the temperature can make predictions more creative (or random if you [like](https://medium.com/mlearning-ai/softmax-temperature-5492e4007f71#:~:text=Temperature%20is%20a%20hyperparameter%20of%20LSTMs%20(and%20neural%20networks%20generally,utilize%20the%20Softmax%20decision%20layer.)))


Image taken for this [blogpost](https://medium.com/mlearning-ai/softmax-temperature-5492e4007f71) on temperature in Softmax.

![temperature](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*7xj72SjtNHvCMQlV.jpeg)


In [None]:
import torch
torch.manual_seed(0)
generator(sequence, 
          max_length = 30, 
          num_return_sequences=5,
          do_sample=True, 
          top_k = 0,
          temperature=.000000001, # change temparature to .7
         )


### Top k sampling

To prevent that outliers will mess up the generation, you can restrict the options and select only a word from the k most probable. 

In [None]:
generator(sequence, 
          max_length = 30, 
          do_sample=True, 
          num_return_sequences=2,
          top_k=50)

### Top p or nucleus sampling

Another strategy is to sample from the smallest set of words whose cumulative probability exceeds the probability p.

In [None]:
generator(sequence, 
          max_length = 30, 
          do_sample=True, 
          num_return_sequences=2,
          top_k=0,
          top_p=.92)

## Adapting a language model

It is possible to change a language model by further training or fine-tuning it on new documents. Based on the tutorial on GPT-2 in Programming Historian we trained a model on news snippets related to Brexit. In other words, we've built a GPT-Brexit model on top of GPT-2.



In [None]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'Kaspar/gpt-brexit',tokenizer='gpt2',pad_token_id=tokenizer.eos_token_id)


### Exercise

- Define a prompt implicitly related to Brexit for example "The UK is"
- Using the `pipeline` can you generate 3 documents with GPT-2 and GPT-Brexit
- Does this show interesting difference, how would you about studying these models?

In [None]:
# write answer here

### Exercise

Find another model for text generation on the Hugging Face hub, inspect the model card and generate some text.

In [None]:
# write answer here

## Modeling Word Meaning in Context with BERT

**BERT** (Bidirectional Encoder Representations from Transformers) is a transformer-based model, hugely successful, that creates contextualized word embeddings, it captures fine-grained contextual properties of words. It learns contextualized information through a masking process (i.e. it hides some words and uses their position to infer them back).

Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to fill that mask with an appropriate token (source: https://huggingface.co/transformers/task_summary.html#masked-language-modeling). The `fill-mask` pipeline replaces the mask in a sequence by the most likely prediction according to a BERT model.

We will create a `fill-mask` pipeline using the `distilbert-base-uncased` English model (and its tokenizer), as follows:

In [None]:
masker = pipeline("fill-mask", model='bert-base-uncased')

In [None]:
sentence = """When a cell has been produced, we can then trace some of the
            stages by which new [MASK] are formed. There appear to be four
            modes in which vegetable cells are multiplied. The new cells
            may either proceed from a nucleus or they may be formed at
            once in the protoplasm."""

outputs = masker(sentence)

# Let's print the results in an easier-to-read format:
for o in outputs:
    print("Prediction:", o['token_str'])
    print("Score:     ", round(o['score'],4))
    print()

In [None]:
sentence = """Imprisonment with proper employment, and at least two visits
            every day from a prison officer. The punishment does not
            extend over a month. A week must elapse before the same
            prisoner can be put again into the dark [MASK]."""

outputs = masker(sentence)

# Let's print the results in an easier-to-read format:
for o in outputs:
    print("Prediction:", o['token_str'])
    print("Score:     ", round(o['score'],4))
    print()

### Exercise

Think of another highly ambiguous word (e.g. "bank") and apply the same procedure as above to assess if BERT manages to distinguish the different senses.

In [None]:
# write answer here

### Exercise

HuggingFace provides BERT models in other languages, or even multilingual models. Search the hub for BERT (or similar masked language models) in any other language than English and apply the "fill-mask" pipeline.

In [None]:
# write answer here

### Tracing Semantic Change with Masked Language Models

In [None]:

sentence = "Our sewing [MASK] stood near the wall where grated windows admitted sunshine, and their hymn to Labour was the only sound that broke the brooding silence."

In [None]:
masker = pipeline("fill-mask", model='bert-base-uncased')
print(masker(sentence))

In [None]:
victorian_masker = pipeline("fill-mask", model='Livingwithmachines/bert_1760_1850')
print(victorian_masker(sentence))

# Supervised Classification with BERT
## The Living Machine case study

In [19]:
import numpy as np
from sklearn.metrics import f1_score, accuracy_score
from datasets import load_dataset, Value
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding

### Annotate and load data

In [20]:
dataset = load_dataset('biglam/atypical_animacy')
dataset[:3]

Found cached dataset atypical_animacy (/Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba)


  0%|          | 0/1 [00:00<?, ?it/s]

### Process dataset

In [None]:
dataset = dataset.remove_columns(['id', 'context', 'target', 'humanness', 'offsets', 'date'])
dataset = dataset.rename_columns({'animacy':'label','sentence':'text'})

In [21]:
dataset = dataset['train']
dataset

In [22]:
dataset[:3]

Dataset({
    features: ['id', 'sentence', 'context', 'target', 'animacy', 'humanness', 'offsets', 'date'],
    num_rows: 594
})

In [None]:
new_features = dataset.features.copy()
new_features["label"] = Value("int32")
dataset = dataset.cast(new_features)
dataset[:3]

### Split data into training and test set

In [23]:
test_size = int(len(dataset)*.3)
train_test = dataset.train_test_split(test_size=test_size , seed=42)
test_set = train_test['test']
val_size = int(len(train_test['train'])*.05)
train_val =  train_test['train'].train_test_split(test_size=val_size,seed=42)

Loading cached split indices for dataset at /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-9652ac799873fb0c.arrow and /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-8482e57ede6cfa77.arrow
Loading cached split indices for dataset at /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-70443c389b2cbb02.arrow and /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-eab173811a52de51.arrow


### Load a Pretrained Language Model and Tokenizer

In [26]:
checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/Users/kasparbeelen/anaconda3/envs/py39/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/var/folders/d2/ydv0grbd38985h6_95t0vdjw0000gp/T/ipykernel_50393/1569015581.py", line 3, in <cell line: 3>
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)
  File "/Users/kasparbeelen/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
  File "/Users/kasparbeelen/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2542, in from_pretrained
    f"{pretrained_model_name_or_path} does not appear to have a file named"
  File "/Users/kasparbeelen/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 433, in load_state_dict
    state_dict = loader(os.path.join(folder, shard_file))
NameError: name

### Tokenize and preprocess data 

In [25]:
def preprocess_function(examples, target_col):
    return tokenizer(examples[target_col], truncation=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
train_val = train_val.map(preprocess_function,fn_kwargs={'target_col': 'sentence'})

Loading cached processed dataset at /Users/kasparbeelen/.cache/huggingface/datasets/biglam___atypical_animacy/default/1.1.0/5827ff537a514460d4773100308d2bcc0bf867d323c3c472e5a506784da84fba/cache-66a796427ff630d3.arrow


Map:   0%|          | 0/20 [00:00<?, ? examples/s]

## Train Model on Annotated Examples

In [17]:
training_args = TrainingArguments(
    output_dir=f"../results", 
    seed = 42,
    learning_rate=1e-3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
        )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_val["train"],
    eval_dataset=train_val["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
        )


trainer.train()

NameError: name 'PartialState' is not defined

### Evaluate on test examples

In [None]:
test_set = test_set.map(preprocess_function,fn_kwargs={'target_col': sent_col})
predictions = trainer.predict(test_set)
preds = np.argmax(predictions.predictions, axis=-1)
f1_score(preds,predictions.label_ids,average='binary')
f1_score(preds,predictions.label_ids,average='macro')
f1_score(preds,predictions.label_ids,average='micro')
accuracy_score(preds,predictions.label_ids)

The model only returns logits by class

In [None]:
predictions

## The feature extraction pipeline

Here we will see how to get vectors for words in context.

Similarly to what we did with word2vec, we may also want to have access to the vector of a certain word. However, unlike with word2vec, the vector of a word will depend on the context in which the word occurs. This means that we can't just ask for the vector of the word "apple", for example: we will need to ask for the vector of the word "apple" given a certain context.

We first import the following two libraries, which will help us work with vectors:

In [None]:
import numpy as np # python library used for working with vectors
from scipy import spatial # package to help compute distance or similarity between vectors

The pipeline task to obtain the vectors for tokens in a sequence is `feature-extraction`. As you can see, creating this pipeline is very similar to creating the `fill-mask` pipeline.

We will store the pipeline in a variable called `nlp_features`:

In [None]:
nlp_features = pipeline("feature-extraction",
                    model='distilbert-base-uncased',
                    tokenizer='distilbert-base-uncased')

Given a sentence, the pipeline tokenizes the input sentence:

In [None]:
sentence = "They were told that the machines stopped working."

output = nlp_features(sentence)
output_vectors = np.squeeze(output) # This removes single-dimensional entries (i.e. for vector readability)

Let's inspect the output. First of all, let's print it:

In [None]:
print(output_vectors)

This is an array (a list of vectors). Let's see its shape:

In [None]:
print(output_vectors.shape) # Print the shape of the vector

This means that we have an arrray (in other words a matrix, a table) that has 11 vectors of length 768 (or, in other words, 11 rows with 768 columns).

**Question:** 11 vectors? Why 11?

Let's see how the sentence is tokenized (we've seen how above):

In [None]:
# Load the **SAME** tokenizer used in the pipeline:
our_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Encode the sentence into a sequence of vocabulary IDs
encoded_seq = our_tokenizer.encode(sentence)
print(encoded_seq)

# And get the tokens given the vocabulary IDs
tokens = our_tokenizer.convert_ids_to_tokens(encoded_seq)
print(tokens)

# And print the length of the tokenized sequence:
print(len(tokens))

As you can see, the input sentence has been tokenized into 11 tokens. So what we have in the above array is 11 vectors (each one representing a word in the context of the sentence, **keeping the order of tokens**, i.e. the first vector will correspond to the special token `[CLS]`, the second vector to the token `the`, and so on until the last vector, which corresponds to the special token `[SEP]`).

How do we get the vector of a specific token?

In [None]:
print(tokens[6]) # The 6th element in the tokenized sentence is the token `machine` (we start counting from zero)

In [None]:
print(output_vectors[6]) # Therefore, o the 6th vector in output_vectors is the vector of `machine` in this context.

✏️ **Exercise:**

In [None]:
# Create two `feature-extraction` pipelines, one for the  1760-1850 model, and
# one for the 1890-1900 model. Find whether the cosine similarity between words
# in sequences change depending on which BERT model you use.
# 
# Type your code here: