# PyTorch & GPT-2 Language Model Fundamentals

This notebook demonstrates foundational skills in PyTorch and language model manipulation using OpenAI's GPT-2.

**Skills Demonstrated:**
- PyTorch tensor operations and neural network construction
- Building classification networks with multiple hidden layers
- GPT-2 tokenization and vocabulary handling
- Next-word prediction and probability calculations
- Cross-entropy loss computation and softmax temperature scaling

**Technologies:** PyTorch, Hugging Face Transformers, GPT-2

## 0. Setup

Let us first install a few required packages. (You may want to comment this out in case you use a local environment that already has the suitable packages installed.)

Note our use of `%%capture` in the cell below absorbs all of the output when the model(s) are loading. You can comment it out if you want to see that output.

In [192]:
# %%capture

# #!pip install torch. # commented because it is pre-installed in Colab
# !pip install torchtext
# !pip install transformers
# #!pip install numpy # commented because it is pre-installed in Colab
# !pip install portalocker
# !pip install pandas

Next, we will import required libraries

In [193]:
import copy
import random

import torch
import numpy as np
import pandas as pd

from torch.utils.data import Dataset, DataLoader
from torch import nn

from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, GPT2Model, GPT2ForSequenceClassification, GPT2LMHeadModel

Let's make sure we will later put data and models on the proper device:

In [194]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

This should say 'cpu' if using a CPU, or 'cuda', if a GPU is used. We are using just the CPU in this assignment.

Now let's get started!

## 1. PyTorch Basics: Common Operations

Let's get started with some simple operations. For reference you should use the PyTorch documentation at https://pytorch.org/docs/stable/index.html .

 Your goal is to create a simple neural net with two connected layers that takes a (random) input that we will create and 'classifies' imagining a 3-class prediction problem. (We will not train the model, so the purpose is simply to test dimensions, expressions, etc., but not real values. We do however want you to execute the cells consecutively in the proper order so that we can compare the final (randomly generated) outcomes. They should always be the same given the manual seed that we set.) 

We start with setting the seed which insures that the answers are more deteministic.

In [195]:
torch.manual_seed(10)

<torch._C.Generator at 0x7cc3751e80d0>

### 1.a Tensor Manipulation

Let's generate a random input dataset that mimics 4 examples with 6 features each. Consider using torch.rand().

In [196]:
input_dim = 6
n_examples = 4
n_classes = 3
#call your generated input tensor 'input_data'

input_data = torch.rand([n_examples, input_dim])

input_data

tensor([[0.4581, 0.4829, 0.3125, 0.6150, 0.2139, 0.4118],
 [0.6938, 0.9693, 0.6178, 0.3304, 0.5479, 0.4440],
 [0.7041, 0.5573, 0.6959, 0.9849, 0.2924, 0.4823],
 [0.6150, 0.4967, 0.4521, 0.0575, 0.0687, 0.0501]])

**1.a. Examining the value of the input_data[0,0]**


**(**

Let's now do a few simple exercises. Using torch.argmax (https://pytorch.org/docs/stable/generated/torch.argmax.html), find the index of the maximum element for each row and each column.

In [198]:
# call your indices row_ind_max_arg and col_ind_max_arg

torch.argmax(input_data, dim=1)
torch.argmax(input_data, dim=0)
row_ind_max_arg = torch.argmax(input_data, dim=1)
col_ind_max_arg = torch.argmax(input_data, dim=0)

print('Index of argmax for each row', row_ind_max_arg)
print('Index of argmax for each column', col_ind_max_arg)

Index of argmax for each row tensor([3, 1, 3, 0])
Index of argmax for each column tensor([2, 1, 2, 2, 1, 2])


**1.b. Examining the indices of the elements with the largest value in each row? Copy the list of indices to the answer cell and represent them as a list e.g. [55, 77, 99, 11].**


**1.c. Examining the indices of the elements with the largest value in each column? Again, copy the list of indices to the answer cell and represent them as a list e.g. [55, 77, 99, 11].**


You can get the values of a tensor just like you can for numpy. For example, the values for the last column (i.e., fixed second dimension) can be obtained through:

In [201]:
print(input_data[:, -1])
print(input_data[:, 5])

tensor([0.4118, 0.4440, 0.4823, 0.0501])
tensor([0.4118, 0.4440, 0.4823, 0.0501])


Similarly, get the values of the last row (first index 'last', second index unconstrained):

In [202]:
# call your values for the last row last_row

last_row = input_data[-1,:]

print('Values of last row: ', last_row)

Values of last row: tensor([0.6150, 0.4967, 0.4521, 0.0575, 0.0687, 0.0501])


**1.d. Result: tensor of the last row into the answers.**


Next, reshape input_data into a 2x12 tensor using \<tensor>.reshape

In [204]:
# call your reshaped tensor reshaped_input_data

input_data.reshape(2,12)
reshaped_input_data = input_data.reshape(2,12)

print('Reshaped data shape: ', reshaped_input_data.shape)

Reshaped data shape: torch.Size([2, 12])


**1.e. Write the shape of the reshaped tensor as a tuple.**


### b. The Simple Classification Network

Now construct the network. Fill in your code for the __init__ and forward methods. Again, we want to have two hidden layers (dims: hidden_dim_1, hidden_dim_2) with relu activation functions, and output layer of dimension n_classes (and softmax activation function). The model should return both the probabilities (probs) and the logits (logits), as you can tell from the return statement.

In [206]:
class SimpleClassificationNertwork(nn.Module):
 def __init__(self, input_dim, hidden_dim_1, hidden_dim_2, n_classes):
 super().__init__()
 self.linear1 = nn.Linear(input_dim, hidden_dim_1)
 self.linear2 = nn.Linear(hidden_dim_1, hidden_dim_2)
 self.linear3 = nn.Linear(hidden_dim_2, n_classes)
 self.relu = nn.ReLU()

 def forward(self, x):
 # Hidden layer 1 with relu activation
 z1 = self.linear1(x)
 a1 = self.relu(z1)

 # Hidden layer 2 with relu activation
 z2 = self.linear2(a1)
 a2 = self.relu(z2)

 # Output layer with softmax activation
 logits = self.linear3(a2)
 probs = nn.Softmax(dim=1)(logits)

 return probs, logits

In [207]:
mySimpleClassificationNertwork = SimpleClassificationNertwork(input_dim=input_dim,
 hidden_dim_1=7,
 hidden_dim_2=10,
 n_classes=n_classes)

In [208]:
probs,logits = mySimpleClassificationNertwork(input_data)

print('Probabilities:\n\t', probs)
print('\nLogits:\n\t', logits)

Probabilities:
	 tensor([[0.4503, 0.2337, 0.3160],
 [0.4515, 0.2499, 0.2985],
 [0.4563, 0.2303, 0.3134],
 [0.4513, 0.2440, 0.3047]], grad_fn=<SoftmaxBackward0>)

Logits:
	 tensor([[ 0.4086, -0.2475, 0.0545],
 [ 0.4077, -0.1839, -0.0061],
 [ 0.4272, -0.2564, 0.0514],
 [ 0.4109, -0.2040, 0.0180]], grad_fn=<AddmmBackward0>)


**NOTE: Once everything works please rerun the cells starting with setting the manual seed up to this cell to make sure that the numbers (if everything is correct) can be !**

**1.f. Result: tensor for the probabilities into the answer file.**


In [209]:
[[0.4503, 0.2337, 0.3160],
 [0.4515, 0.2499, 0.2985],
 [0.4563, 0.2303, 0.3134],
 [0.4513, 0.2440, 0.3047]]

[[0.4503, 0.2337, 0.316],
 [0.4515, 0.2499, 0.2985],
 [0.4563, 0.2303, 0.3134],
 [0.4513, 0.244, 0.3047]]

**1.g. Result: tensor for the logits into the answers file.**


In [210]:
[[ 0.4086, -0.2475, 0.0545],
 [ 0.4077, -0.1839, -0.0061],
 [ 0.4272, -0.2564, 0.0514],
 [ 0.4109, -0.2040, 0.0180]]

[[0.4086, -0.2475, 0.0545],
 [0.4077, -0.1839, -0.0061],
 [0.4272, -0.2564, 0.0514],
 [0.4109, -0.204, 0.018]]

Great. Next do a calculation that *manually* verifies that the Softmax calculation is correct. Specifically, please recalculate the probability of the first class of the first example (use np.exp()... And if you don't want to use the numbers above, but the expressions probs and logits in a suitable way, use \<tensor\>.detach().numpy() to convert to numpy!)

In [211]:
# call your probability of the first class for the first example p_1_1

p_1_1 = np.exp(logits.detach().numpy()[0,0])/(np.exp(logits.detach().numpy()[0][0]) + np.exp(logits.detach().numpy()[0][1]) + np.exp(logits.detach().numpy()[0][2]))

p_1_1

np.float32(0.45032188)

**1.h. Result: value of `p_1_1` into the answers file. (Note that the first class has index 0.)**


Great. Now imagine that the correct classes are 0, 1, 2, 0 for the four examples. What is the average loss? For that we will first define the loss function and then calculate the loss. Note that the input to the CrossEntropyLoss() function are i) the un-normalized logits, and ii) either the class probabilities or the actual classes (better in this case as each example belongs to a class).

In [213]:
loss_fn = torch.nn.CrossEntropyLoss()

loss = loss_fn(logits, torch.tensor([0, 1, 2, 0]))
loss

tensor(1.0351, grad_fn=<NllLossBackward0>)

Now verify that this loss agrees with the manual calculation. Recall from that

$$ CE \rightarrow -\frac{1}{N}\sum_k \log(q^k_{{correct \ class}}) $$

where k refers to the example, N is the number of examples, and

$$q_{{correct \ class}}$$

is the model probability for the correct class for a given example.

In [214]:
# call your manual loss calculation manual_loss

manual_loss = 1/4 * (-np.log(0.4503) - np.log(0.2499) - np.log(0.3134) - np.log(0.4513))

manual_loss

np.float64(1.0351084035063098)

**1.i. Write out the complete Cross-Entropy loss calculation as a single mathematical expression, substituting the values (the floating point numbers) you arrived at in your earlier work on this assignment. Your answer should be a single line showing the entire calculation with all necessary values inserted.**


Note: Do not perform any calculations or simplify the expression. Simply write out the formula with the appropriate numeric values inserted.

In [215]:
manual_loss = 1/4 * (-np.log(0.4503) - np.log(0.2499) - np.log(0.3134) - np.log(0.4513))

Great. Now we are ready to move to Language Models.

## 2. Basic GPT-2 Usage

We are now downloading GPT-2 from Hugging Face. We will get the Tokenizer and the model. We will make sure that it is on the proper device.

In [216]:
%%capture

gpt_2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
#gpt_2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)

We can simply apply the tokenizer to a sentence to see how words are converted into word indices (which will be model inputs and, as first order of business for the model, be converted to word vectors). (The tokenizer is model-specific and various tokenizers have some special considerations/quirks. You should always take a look at how a specific tokenizer works. Consult the Hugging Face docs and try some examples.)

In [217]:
gpt_2_tokenizer("What a nice day", return_tensors='pt')

{'input_ids': tensor([[2061, 257, 3621, 1110]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

Note the difference between encodings of a word when the word is at the very beginning of a sentence vs a word that occurs later in the sentence. (The attention masks become important if you have examples of varying length and padding tokens are used to make sure that model inputs are of the same length. The return_tensors option is used to get the tokenization into a format that is suitable for model input, if desired. Following, we will omit the return_tensors option as we don't need it here.)

Below, consider the embedding for 'I' in the three tokenizations:

In [218]:
print(gpt_2_tokenizer("I am")['input_ids'])
print(gpt_2_tokenizer("am I")['input_ids'])
print(gpt_2_tokenizer(" I")['input_ids'])

[40, 716]
[321, 314]
[314]


Decoding (e.g. turn your input_id back into the coresponding string) is done with \<tokenizer>.decode():

Please tokenize the longest word that Shakespeare used: 'honorificabilitudinitatibus' (when not at the beginning of a sentence), and find the first token (not the id, but the corresponding token string):

In [220]:
# Name your tokenization tokenized_long_word, the index of the first token first_index, and the first token first_token

tokenized_long_word = gpt_2_tokenizer(" honorificabilitudinitatibus")['input_ids']
first_index = tokenized_long_word[0]
first_token = gpt_2_tokenizer.decode(first_index)

print('Tokenized long word: ', tokenized_long_word)
print('Length of tokenized long word: ', len(tokenized_long_word))
print('First index: ', first_index)
print('First token: ', first_token)

Tokenized long word: [7522, 811, 397, 6392, 463, 15003, 265, 26333]
Length of tokenized long word: 8
First index: 7522
First token: honor


**2.a. Analyzing how many tokens is the word honorificabilitudinitatibus split when not in the beginning of the sentence? (You can either create a sentence with this word where it is not in the beginning, or you need to make sure that there is a space in front of the word.)**


**2.b. Examining the first token of the tokenization**


Now redo the same, but imagine the word 'honorificabilitudinitatibus' at the very start of a sentence/doc (never mind the capitalization) as in
'honorificabilitudinitatibus is a state I am in'.

In [223]:
# Now name your tokenization beg_tokenized_long_word, the index of the first token beg_first_index, and the first token beg_first_token

beg_tokenized_long_word = gpt_2_tokenizer("honorificabilitudinitatibus")['input_ids']
beg_first_index = beg_tokenized_long_word[0]
beg_first_token = gpt_2_tokenizer.decode(beg_first_index)

print('Tokenized long word: ', beg_tokenized_long_word)
print('First index: ', beg_first_index)
print('First token string: ', beg_first_token)

Tokenized long word: [24130, 273, 811, 397, 6392, 463, 15003, 265, 26333]
First index: 24130
First token string: hon


**2.c. Analyzing how many tokens is the word honorificabilitudinitatibus now be split (when **in the beginning** of the sentence)**


**2.d. Examining now the first token string of the tokenization**


Now apply the gpt2_model to a sentence in order to predict the most likely next word after 'The movement started in Italy. From there it went to France and Switzerland. Soon it spread throughout'. You may want to and the documentation.

Please get i) the shape of the output, ii) the values of the logits of the last token, iii) the index of the token with largest logit, and iv) the token that belongs to it. Same for the token with the second most largest logit.

In [226]:
text = "The movement started in Italy. From there it went to France and Switzerland. Soon it spread throughout"
input = gpt_2_tokenizer(text, return_tensors='pt')

gpt_out = gpt2_model(**input)

# Please call your model output shape, output_shape, and the logits for the last position last_logits,
# and the index for the token with the largest last_logit value max_logit_index, and the corresponding token max_logit_token.
# Name the corresponding values for the second largest logit second_logit_index, second_logit_token, and second_logit.

# Outputs
logits = gpt_out.logits
output_shape = logits.shape
output_logits = logits[0][-1]

#Max token
max_logit_index = torch.argmax(output_logits)
max_logit = torch.max(output_logits)
max_logit_token = gpt_2_tokenizer.decode(max_logit_index)

top_2_values, top_2_indices = torch.topk(output_logits, 2)
second_logit_index = top_2_indices[1]
second_logit = top_2_values[1]
second_logit_token = gpt_2_tokenizer.decode(second_logit_index)

print('Output shape: ', output_shape)
print('Logits of output for last token: ', output_logits)
print('Index of token with largest logit: ', max_logit_index)
print('Token with largest logit: ', max_logit_token)
print('Logit of token with largest logit: ', max_logit)
print('Index of token with second largest logit: ', second_logit_index)
print('Token with second largest logit: ', second_logit_token)
print('Logit of token with second largest logit: ', second_logit)

Output shape: torch.Size([1, 20, 50257])
Logits of output for last token: tensor([ -95.2277, -96.9647, -101.5158, ..., -102.7920, -104.4167,
 -97.2480], grad_fn=<SelectBackward0>)
Index of token with largest logit: tensor(262)
Token with largest logit: the
Logit of token with largest logit: tensor(-84.7964, grad_fn=<MaxBackward1>)
Index of token with second largest logit: tensor(2031)
Token with second largest logit: Europe
Logit of token with second largest logit: tensor(-84.9965, grad_fn=<SelectBackward0>)


**2.e. Examining the shape of the output**


**2.f. What do the three numbers shape refer to**


In [228]:
print("""
The output is the size of the logits we produce.
The first number (1) is the batch size - we're processing one text sequence.
The second number (20) is the length of our input sequence in tokens.
The third number (50257) is the vocabulary size - GPT-2 produces logits for every possible token in its vocabulary at each position.
"""
)


The output is the size of the logits we produce.
The first number (1) is the batch size - we're processing one text sequence.
The second number (20) is the length of our input sequence in tokens.
The third number (50257) is the vocabulary size - GPT-2 produces logits for every possible token in its vocabulary at each position.



**2.g. Examining the index of the word with the largest logit**


**2.h. Examining the token string associated with the largest logit**


**2.i. Examining the second most likely token id**


**2.j. Examining the second most likely word**


Now we will translate the logits into relative token probabilities depending on the chosen temperature. Use numpy or pytorch calculations. (But remember to use .detach() etc if you want to use numpy.)

In [233]:
T_1 = 1.
T_2 = 10.
T_3 = 0.1

# Please call your relative probabilities between the most likely token and the second most likely token p_t1, p_t2, p_t3, depending
# on the temperature

# Apply softmax with different temperatures to the last position logits
probs_t1 = torch.softmax(output_logits / T_1, dim=0)
probs_t2 = torch.softmax(output_logits / T_2, dim=0)
probs_t3 = torch.softmax(output_logits / T_3, dim=0)

# Get the probabilities for the top 2 tokens
p_t1 = probs_t1[max_logit_index] / probs_t1[second_logit_index]
p_t2 = probs_t2[max_logit_index] / probs_t2[second_logit_index]
p_t3 = probs_t3[max_logit_index] / probs_t3[second_logit_index]

print('Logit ratio for T1: ', p_t1)
print('Logit ratio for T2: ', p_t2)
print('Logit ratio for T3: ', p_t3)

Logit ratio for T1: tensor(1.2215, grad_fn=<DivBackward0>)
Logit ratio for T2: tensor(1.0202, grad_fn=<DivBackward0>)
Logit ratio for T3: tensor(7.3922, grad_fn=<DivBackward0>)


**2.k. Examining the ratio of probabilities between the most likely token and the second most likely token if T=1**


**2.l. Examining the ratio of probabilities between the most likely token and the second most likely token if T=10**


**2.m. Examining the ratio of probabilities between the most likely token and the second most likely token if T=0.1? (Hint: to avoid a NaN you may want to use a simple mathematical identity to deal with the low temperature: $e^a/e^b = e^{(a-b)}$)**


And that is it! Congratulations!