# Playground to get acquainted with LLMs

Huggingface playground to manipulate LLMs such as BERT or GPT.

Note that fine-tuning a LLM will come next and is not (yet) included into this notebook. The purpose here is to fully illustrate what a LLM is and how we can manipulate it, as a language model on the one hand and as an encoder to yield contextual embeddings on the other hand.

In [None]:
import torch
import numpy as np

from transformers import AutoModel, AutoModelForCausalLM , AutoTokenizer

## Playing with GPT

GPT is a causal transformer encoder trained as a language model. The pre-trained model, along with the corresponding tokenizer, can be directly loaded via the Huggingface library simply by specifying a specific checkpoint (aka model name) such as "gpt2" or "bert-base-uncased". See https://huggingface.co/models for an extensive list of models that can be imported.

Note that Huggingface models often come with the transformer encoder along with the associated "classification head", depending on the class invoked. For instance, AutoModelForCausalLM will yield the model with a LM classification head that predicts a probability distribution function over the vocabulary from the encoded/contextual representation of the tokens at the input. Simply using AutoModel will yield the encoder with no classification head. 

We will use both options in this labwork but you have to be aware that there are other options down there such as AutoModelForSequenceClassification (document classification) or AutoModelForTokenClassification (tagging). We will not use these at this stage (nor at a later stage) to clearly evidence what a classification head is and how it works.

Huggingface documentation on the GPT2 model: https://huggingface.co/docs/transformers/model_doc/gpt2

### Getting acquainted with the model

In [None]:
#
# Loading the transformer encoder as well as the LM classification head that predicts a
# probability distribution over the vocabulary from the token contextual embeddings. The
# model loaded defines the architecture and the weights, both for the encoder and the LM
# classification head.
#
checkpoint = 'gpt2'

tokenizer = AutoTokenizer.from_pretrained(checkpoint) # load tokenizer
print(tokenizer)
print()

#
# print a few things about the model -- can you identify some of the important features such
# as the embedding dimension, the vocabulary size, the maximum authorized sequence length? No 
# worries if you don't understand all the parameters, you're not supposed to anyway. 
# 
# See https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Config
#
model = AutoModelForCausalLM.from_pretrained(checkpoint) # load model
print(model.config)

print()
model.eval()

In [None]:
#
# Now let's encode a sentence and run it through the model to see what we get out of it
#
# See https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel for details
# on the forward function of a GPT2LMHeadModel model.
#

text = 'I enjoy playing with LLMs.'

#
# Let's first have a look at what the tokenizer does.
#
# Question: Why do you think we have an 'attention_mask' attribute at the output of the tokenizer?
#
inputs = tokenizer(text, return_tensors="pt")
print('Output of tokenizer:\n', inputs)
print('Tokens and text:', tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]), '-->', tokenizer.decode(inputs['input_ids'][0]))

#
# Run tokens through the model
#
# Question: Explain the output that we have 
#
with torch.no_grad():
    outputs = model(**inputs)

print('Output shape:', outputs.logits.shape)

### Looking at token probabilities and generating language

In [None]:
#
# Let's have a look at the LM probabilities
#

#
# get all log-probabilities
#
with torch.no_grad():
    logprobs = torch.nn.LogSoftmax(dim=-1)(outputs.logits)[0]

#
# print LM probabilities for each token in the input
#
ids = inputs['input_ids'][0]
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

print('LM probabilities for sequence', tokens)
for i in range(len(ids) - 1):
    
    next_token = ids[i+1].item()
    lm_prob = ##### TO COMPLETE ##### --> use .item() to convert tensor to float value
    
    print('  P[{}|{}] = {:.6f}'.format(tokens[i+1], ' '.join(tokens[:i+1]), lm_prob))
    
#   
# TODO ::: Find most likely token following with in the input
#


In [None]:
#
# Write a prompt completion function based on GPT-2, following the idea of the previous labwork.
#
# For sake of simplicity, we will run the entire sequence through the model at each step rather
# than memorizing previous operations to run only one step as we did for RNNs. In other words, 
# after adding a token to the generated sequence, you will run the entire new sequence through
# the model to get the probabilities for the next token. In practice, there are ways to avoid
# that so as to be much more efficient.
#




### Looking at token embeddings

In [None]:
#
# We can also get the embeddings of the tokens in addition to the output logits
#

text = 'I enjoy playing with LLMs.'

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs, output_hidden_states = True)

print(len(outputs.hidden_states))
print(outputs.hidden_states[0].shape, outputs.hidden_states[0][0][0].shape) # initial embeddings + positional encoding
print(outputs.hidden_states[-1].shape, outputs.hidden_states[-1][0][0].shape) # last layer's embeddings

In [None]:
#
# Here are 15 utterances with the token 'rat' with different meanings and morpho-syntactic functions, plus 
# one with the token mouse instrad of rat. 
#
# Writea piece of code to visualize the 16 contextual embeddings in 2D with tSNE.
#

sentences = [
"He decided to rat on his friends to get a lighter sentence.",
"He's quick to rat out his accomplices.",
"She felt betrayed when he went to rat her out to the boss.",
"I can’t believe you would rat on me like that!",
"The suspect threatened to rat if they didn’t offer a deal.",
"A rat scurried across the floor last night.",
"The cat caught a rat in the garden.",
"The rat is often found in urban areas looking for food.",
"I heard a rat squeaking in the walls.",
"The farmer set traps to catch the rat in the barn.",
"He’s a rat for telling the police everything we did.",
"No one trusts him anymore because he's known as a rat.",
"You can’t just rat out your friends like that and expect to be forgiven.",
"She was labeled a rat after giving up the gang’s hideout.",
"He tried to act tough, but everyone knew he was a rat who'd sell anyone out.",
"I heard a mouse squeaking in the walls."
]

#
# tokenize all sentences at once: outputs lists rather than tensors to skirt the padding
# issue. Will have to convert to tensor before passing along to the model though.
#  
inputs = tokenizer(sentences) 

#
# retrieve embeddings of the token 'rat' in all sentences
#
rat_id = tokenizer.encode('a rat is an animal')[1]
mouse_id = tokenizer.encode('a mouse is an animal')[1]

embeddings = np.empty((len(sentences), model.config.n_embd), dtype='float32')

for i in  range(len(sentences)):
    token_id = mouse_id if i == len(sentences) - 1 else rat_id
    
    idx = inputs['input_ids'][i].index(token_id)
    
    with torch.no_grad():
        outputs = model(torch.tensor(inputs['input_ids'][i]), output_hidden_states = True)
    
    embeddings[i,:] = ##### TO COMPLETE ##### --> use .detach().numpy() to get rid of gradients and convert to numpy() array


In [None]:
#
# tSNE projection with cosine distance
#

from sklearn.manifold import TSNE
from matplotlib import pyplot as plt

Y = TSNE(n_components=2, metric='cosine', init='random', random_state=0, perplexity=10).fit_transform(embeddings)
print(Y.shape)

plt.scatter(Y[:,0], Y[:,1])
for i in range(len(sentences)):
    plt.annotate(str(i+1), xy=(Y[i,0],Y[i,1]))


## Playing with BERT

BERT is a bidirectional transformer pre-trained with a dual objective: masked language modeling and  next sentence prediction (see course slideware). Similar to GPT2, pre-trained models can be downloaded from the Huggingface library to play with. They might come with a classification head or not, depending on the class used to load them.

In [None]:
checkpoint = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(checkpoint) # load tokenizer
print(tokenizer)
print()

#
# print a few things about the model -- can you identify some of the important features such
# as the embedding dimension, the vocabulary size, the maximum authorized sequence length? No 
# worries if you don't understand all the parameters, you're not supposed to anyway. 
# 
# See https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Config
#
model = AutoModelForCausalLM.from_pretrained(checkpoint) # load model
print(model.config)

print()
model.eval()

In [None]:
text = 'I enjoy [MASK] with LLMs.'

inputs = tokenizer(text, return_tensors="pt")
print('Output of tokenizer:\n', inputs)
print('Tokens and text:', tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]), '--', tokenizer.decode(inputs['input_ids'][0]))

with torch.no_grad():
    outputs = model(**inputs)

print('Output shape:', outputs.logits.shape)

In [None]:
#
# What are the 10 most likely tokens for the mask and the corresponding log-probabilities?
#
# Hint: you can use the torch.topk() function to get the top k values and indices
# of a tensor
#


In [None]:
#
# Take the same 15 sentences with token 'rat' (plus th eone with 'mouse'), get the contextual 
# embeddings at the output of the BERT model and plot again. The code is roughly the same as
# for the gtp2 model
#


## Final note

If you are only interested by the pre-trained encoder, be it a gpt2 encoder or a BERT one, you  can load the models without any classification head, in which case the output directly contains the (contextual) embeddings of the tokens. The following cell for instance illustrate how to do that for a BERT model. In the next lecture, we will use the encoder and the resulting embeddings as part of a neural network architecture and fine-tune the encoder and train the classification elements in the network for a document classification task. 

See https://huggingface.co/transformers/v3.5.1/model_doc/bert.html#bertmodel for details.


In [None]:
model = AutoModel.from_pretrained(checkpoint, add_pooling_layer=False)
print(model.config)

print()
model.eval()

In [None]:
text = 'I enjoy playing with LLMs.'

inputs = tokenizer(text, return_tensors="pt")
print('Output of tokenizer:\n', inputs)

with torch.no_grad():
    outputs = model(**inputs)

print('Last hidden states shape:', outputs['last_hidden_state'].shape)


# now you can plug these contextual embeddings into a model... but that's for next (and last) session !