<font size="6.2">NLP Transformer using GPT</font>  

In this Notebook, NLP Transformer using BERT GPT is presented. Most Material is retrieved from [Sinan Ozdemir](https://learning.oreilly.com/videos/introduction-to-transformer/9780137923717/). See [GitHub page](https://github.com/sinanuozdemir/oreilly-transformers-video-series/tree/main/notebooks)

# Introduction

GPT stands for **G**enerative **P**re-trained **T**ransformers:

* **Generative**: from Auto-regressive Language Model. It signifies predicting tokens with one side of the context, past context.
 
* **Pre-trained**: decoders are trained on huge corpora of data.

* **Transformers**: The decoder is taken from the transformer architecture.

GPT refers to a family of models:

* GPT-1 released in 2018: 0.117B parameters
* GPT-2 released in 2019: 1.5B parameters
* GPT-3 released in 2020: 175B parameters

GPT is very similar to BERT. GPT has Byte-level tokenization by splitting a list of token in our vocabulary over 50,000 tokens. We also add the special token **<|endoftext|>** at the end.

My name is Mehdi ==> ["My", "name", "is", "Mehdi","**<|endoftext|>**" ]

In [1]:
from transformers import GPT2Tokenizer
tokenizer_gpt = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer_gpt ("Hi there")['input_ids']

[17250, 612]

In [2]:
# Having a space make it to different token
tokenizer_gpt (" Hi there")['input_ids']

[15902, 612]

Space in GPT's tokenizer is treated as part of token; this leads to encode the word differently.

GPT has two types of embeddings for tokenized sentences:

**1. Word Token Embeddings (WTE)**
   
    a. Context-free (contex less) meaning of each token
    b. Over 50,000 possible vectors
    c. learnable during training
   
   
**2. Word Position Embedding (WPE)**

    a. To represent token's position in the sentence
    b. This is not learnable
   
These two are identical to BERT, but we have an additional embedding in <span class="mark">BERT for segment id (sentence A versus B) that we do not have for GPT.</span>
   

# GPT Family 

In [3]:
from transformers import pipeline, set_seed, GPT2Tokenizer, GPT2LMHeadModel
from torch import tensor, numel # counting number of parameters
from bertviz import model_view # visualize 

set_seed(32) # we have randomness in GPT while we do not have in BERT

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

In [4]:
generator = pipeline('text-generation', model='gpt2') # create a pipline for text generation using gpt

Alternately, we can use `model='openai-gpt'` but it is a variant of the GPT model that was released by OpenAI prior to GPT-2. It is an earlier version and is not as commonly used or as up-to-date as GPT-2.

In [5]:
generator("Hello, I am a data scientist and I want to", max_length=30, num_return_sequences=3) # here is randombess occures

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello, I am a data scientist and I want to share some of my knowledge with you! As you know we are one of the most active community'},
 {'generated_text': 'Hello, I am a data scientist and I want to share some of my personal knowledge, techniques and solutions for building software that improves the efficiency of your'},
 {'generated_text': 'Hello, I am a data scientist and I want to share what I am currently working on. I have some ideas for next project in mind. Here'}]

These are not great text prediction.

In [6]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # GPT tokenizer

'Mehdi' in tokenizer.get_vocab()  # Mehdi is not in gpt vocabulary

False

GPT2 by default is cased (Mehdi is different than mehdi).

In [7]:
txt = 'Mehdi loves working out'

tokenizer.convert_ids_to_tokens(tokenizer.encode(txt))

['Me', 'h', 'di', 'Ġloves', 'Ġworking', 'Ġout']

In [8]:
tokenizer.encode(txt)

[5308, 71, 10989, 10408, 1762, 503]

In [9]:
encoded = tokenizer.encode(txt, return_tensors='pt') # Pytorch format
encoded

tensor([[ 5308,    71, 10989, 10408,  1762,   503]])

In [10]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2') # Language modeling head

In [11]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [12]:
# zoom in token embeding
model.transformer.wte(encoded).shape # word token embeding (wte)

torch.Size([1, 6, 768])

Each token (6 tokens) has 768 length of vector

In [13]:
# Similar to BERT, we can pass in our position that will output with the same shape
model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5]).reshape(1, 6)).shape  #word postion embeding (wpe)

torch.Size([1, 6, 768])

In [14]:
# word token embeding (wte) are added up with word postion embeding (wpe)

initial_input = model.transformer.wte(encoded) + model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5]).reshape(1, 6))

initial_input.shape

torch.Size([1, 6, 768])

In [15]:
initial_input = model.transformer.drop(initial_input)
initial_input

tensor([[[ 0.0547, -0.4038,  0.2470,  ..., -0.1350, -0.1283,  0.1031],
         [ 0.0442, -0.0989,  0.0356,  ...,  0.0754, -0.1866,  0.1947],
         [ 0.0326, -0.0136,  0.1464,  ...,  0.1332,  0.2655,  0.0350],
         [-0.0761, -0.1581,  0.0896,  ..., -0.2130, -0.0214, -0.0970],
         [-0.0460,  0.0483,  0.1984,  ..., -0.0730,  0.1341, -0.0947],
         [-0.0388, -0.0780,  0.1656,  ...,  0.0859, -0.2295,  0.1717]]],
       grad_fn=<AddBackward0>)

In [16]:
model.lm_head

Linear(in_features=768, out_features=50257, bias=False)

In [17]:
# If input is passed one-by-one to model transformer we should end up with the same 
for module in model.transformer.h:
    initial_input = module(initial_input)[0]
    
initial_input = model.transformer.ln_f(initial_input)

In [18]:
(initial_input == model(encoded, output_hidden_states=True).hidden_states[-1]).all()

tensor(True)

In [19]:
total_params = 0
for param in model.parameters():
    total_params += numel(param)
    
print(f'Number of params for GPT2 is: {total_params:,}')

Number of params for GPT2 is: 124,439,808


# Masked multi-headed attention

Masked Self-Attention: upper limit matrix are assigned to zero because of not cheating

GPT is predicting words one by one. 

In [20]:
import torch
import pandas as pd

In [21]:
phrase = 'Today is a beautiful day. but, It is going to rain!'
encoded_phrase = tokenizer(phrase, return_tensors='pt')

response = model(**encoded_phrase, output_attentions=True, output_hidden_states=True)

len(response.attentions)

12

In [22]:
# GPT does not have a sense of sentence A or B.
encoded_phrase

{'input_ids': tensor([[8888,  318,  257, 4950, 1110,   13,  475,   11,  632,  318, 1016,  284,
         6290,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [23]:
# access to attention mechanism
response.attentions[-1].shape  # From the final decoder

torch.Size([1, 12, 14, 14])

For this array, first item is batch size (1), 12 indicates there 12 heads in final encoder, and 14 is number of tokens.

In [24]:
encoded_phrase['input_ids'].shape

torch.Size([1, 14])

We can convert this ids back to tokens:

In [25]:
tokens = tokenizer.convert_ids_to_tokens(encoded_phrase['input_ids'][0])
tokens

['Today',
 'Ġis',
 'Ġa',
 'Ġbeautiful',
 'Ġday',
 '.',
 'Ġbut',
 ',',
 'ĠIt',
 'Ġis',
 'Ġgoing',
 'Ġto',
 'Ġrain',
 '!']


Lets take a look at layer index 6, head 0. Check out the almost 60% attention the token it is giving to the token class

In [26]:
arr = response.attentions[6][0][0]

n_digits = 2

attention_df = pd.DataFrame((torch.round(arr * 10**n_digits) / (10**n_digits)).detach()).applymap(float)

attention_df.columns = tokens
attention_df.index = tokens

attention_df

Unnamed: 0,Today,Ġis,Ġa,Ġbeautiful,Ġday,.,Ġbut,",",ĠIt,Ġis.1,Ġgoing,Ġto,Ġrain,!
Today,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġis,0.93,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġa,0.29,0.7,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġbeautiful,0.22,0.55,0.2,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġday,0.4,0.29,0.16,0.11,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
.,0.37,0.35,0.04,0.03,0.19,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġbut,0.63,0.17,0.01,0.03,0.07,0.06,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0
",",0.59,0.07,0.03,0.03,0.03,0.03,0.22,0.01,0.0,0.0,0.0,0.0,0.0,0.0
ĠIt,0.31,0.2,0.02,0.03,0.19,0.13,0.06,0.03,0.02,0.0,0.0,0.0,0.0,0.0
Ġis,0.35,0.09,0.02,0.04,0.15,0.07,0.12,0.02,0.1,0.03,0.0,0.0,0.0,0.0


`is` has the highest access to `today` 93%!. Lets have a better visualization for each layer (12) and head (12).

In [27]:
tokens = tokenizer.convert_ids_to_tokens(encoded_phrase['input_ids'][0]) 
model_view(response.attentions, tokens)

<IPython.core.display.Javascript object>

In [28]:
response.hidden_states[-1].shape

torch.Size([1, 14, 768])

In [29]:
response.logits.shape

torch.Size([1, 14, 50257])

logists are the final output of language model layer which applies a feed forward laye to each of 14 tokens.

In [30]:
pd.DataFrame(
    zip(tokens, tokenizer.convert_ids_to_tokens(response.logits.argmax(2)[0])), 
    columns=['Sequence', 'Next predicted token with highest probability']
)

Unnamed: 0,Sequence,Next predicted token with highest probability
0,Today,","
1,Ġis,Ġthe
2,Ġa,Ġtime
3,Ġbeautiful,Ġday
4,Ġday,Ġfor
5,.,Ċ
6,Ġbut,ĠI
7,",",ĠI
8,ĠIt,Ġis
9,Ġis,Ġnot


In [31]:
generator(phrase, max_length=40, num_return_sequences=1, do_sample=False)  # greedy search

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Today is a beautiful day. but, It is going to rain!\n\nI am going to be in the hospital for a few days. I am going to be in the hospital for a few'}]

In [32]:
generator(phrase, max_length=40, num_return_sequences=1, do_sample=True)  # greedy search with sampling

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Today is a beautiful day. but, It is going to rain! because You want me to leave you alone, but I think the rain is better than it is right now! I hope that you'}]

# Pre-training GPT

Similar to BERT, the authors of GPT need to pre-train the language model. However, mass language model does not make sense because <span class="burk">GPT is auto regressor not auto encoder (unlike BERT)</span>. Also, next sentence prediction does not make sense because we are not trying to understand sequences as a whole, we only try to do auto regressor language model, different type of pre-training should be applied.

GPT was pre-trained on a corpora called WebText with 40 Gigabytes of text which is a collection of text data from the Internet. The WebText dataset consists of a wide range of sources, including websites, articles, books, and other publicly available written content. It contains a diverse set of topics and writing styles, allowing the model to learn patterns and information from various domains.

The pre-training process involves training the GPT model to predict the next word in a sequence of words given some context. This task is known as unsupervised learning, as the model does not require specific labels or annotations during training. By training on a large amount of text data, GPT learns the statistical patterns and relationships between words, enabling it to generate coherent and contextually appropriate text.

During pre-training, GPT uses a variant of the Transformer architecture, which includes a stack of self-attention and feed-forward layers. The model processes input text in chunks or "tokens," which can represent individual words, subwords, or characters, depending on the chosen tokenization scheme. The self-attention mechanism allows the model to capture dependencies between different words in a sequence, while the feed-forward layers perform non-linear transformations to generate meaningful representations.

In [33]:
from transformers import pipeline, set_seed
from torch import tensor

generator = pipeline('text-generation', model='gpt2', tokenizer=tokenizer)
set_seed(0)

In [34]:
# Bias
generator("Muslim man work during the day as a", max_length=15, num_return_sequences=4, temperature=0.8)
# temperature: Reducing temperature makes it less random 

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Muslim man work during the day as a security guard. He was ordered to'},
 {'generated_text': 'Muslim man work during the day as a janitor at a hospital after his'},
 {'generated_text': 'Muslim man work during the day as a contractor. The police were looking for'},
 {'generated_text': 'Muslim man work during the day as a police officer on the highway in T'}]

In [35]:
# Bias
generator("White man work during the day as a", max_length=15, num_return_sequences=4, temperature=0.8)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'White man work during the day as a firefighter\n\nThe New Yorker says'},
 {'generated_text': 'White man work during the day as a janitor at a nearby gas station'},
 {'generated_text': 'White man work during the day as a part of the team.\n\n'},
 {'generated_text': "White man work during the day as a street vendor, and he's a"}]

In [36]:
# Bias
generator("The earth would be beautiful without", max_length=15, num_return_sequences=5, temperature=0.5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The earth would be beautiful without the sun.\n\nI know that.'},
 {'generated_text': 'The earth would be beautiful without it.\n\n"And I am a'},
 {'generated_text': 'The earth would be beautiful without you. But you have no idea what you'},
 {'generated_text': 'The earth would be beautiful without those mountains.\n\n"I\'m not'},
 {'generated_text': 'The earth would be beautiful without it.\n\nIf we want to live'}]

From examples above, we should be careful and aware of bias using auto regression models since pre-trained corpora has bias in it. We should be careful of transforming biases to our downstream tasks and decision making.

# Few-shot learning

* **Zero-shot Learning**: task description is given to the model but without any prior example

* **One-shot Learning**: task description is given to the model with one single prior example

* **Few-shot Learning**: task description is given to the model with as many as prior examples we desire to fit into the context window of model. GPT-2 has 1024 tokens

**Sentiment Analysis, question/answering, translation** are not part of task of pre-training for GPT2; it is a auto-regressive task to predict tokens. GPT2 does not know how to do these task explicitly but implicitly can figure out **through few examples.**

## Sentiment Analysis

In [37]:
print(generator("""Sentiment Analysis
Text: I hate it when my laptop crashes.
Sentiment: Negative
###
Text: My day has been awesome!
Sentiment: Positive
###
Text: I am a couch potato
Sentiment:""", top_k=2, temperature=0.1, max_length=55)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sentiment Analysis
Text: I hate it when my laptop crashes.
Sentiment: Negative
###
Text: My day has been awesome!
Sentiment: Positive
###
Text: I am a couch potato
Sentiment: Negative
###
Text:


**top_k:** The top_k parameter, also known as the "top-k sampling" strategy, is used to limit the number of words considered during the sampling process. When generating text, the model assigns probabilities to each possible next word. By setting the top_k value, the model only considers the k most likely words based on their probabilities. This helps to ensure that the generated text is more focused and coherent.


For example, if top_k is set to 10, the model will only consider the top 10 words with the highest probabilities for each word position, discarding the rest. The actual number of words considered can be less than k if the probability distribution is highly concentrated on a few words.


**temperature:** The temperature parameter controls the randomness of the generated text. It adjusts the probability distribution during sampling. A higher temperature value, such as 1.0, increases the randomness and diversity of the output. This means that less probable words have a higher chance of being selected, leading to more creative but potentially less coherent or sensible output.

On the other hand, a lower temperature value, such as 0.5, reduces randomness and makes the model more focused and deterministic. In this case, the most probable words have a higher chance of being selected, resulting in more predictable and conservative output.

## Question/Answering

In [38]:
print(generator("""Question/Answering
C: Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock.
Q: When was Google founded?
A: 1998
###
C: Hugging Face is a company whi"ch develops social AI-run chatbot applications. It was established in 2016 by Clement Delangue and Julien Chaumond. The company is based in Brooklyn, New York, United States.
Q: What does Hugging Face develop?
A: social AI-run chatbot applications
###
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A:""", top_k=2, max_length=215, temperature=0.5)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question/Answering
C: Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock.
Q: When was Google founded?
A: 1998
###
C: Hugging Face is a company whi"ch develops social AI-run chatbot applications. It was established in 2016 by Clement Delangue and Julien Chaumond. The company is based in Brooklyn, New York, United States.
Q: What does Hugging Face develop?
A: social AI-run chatbot applications
###
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A: The AFC East
Q


## Zero Shot Learning

Zero-shot doesn't work as much for the sentiment analysis.

In [39]:
print(generator("""Sentiment Analysis
Text: the food was great
Sentiment:""", top_k=2, temperature=0.1, max_length=55)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sentiment Analysis
Text: the food was great
Sentiment: the food was great
Sentiment: the food was great
Sentiment: the food was great
Sentiment: the food was great
Sentiment: the food was great
Sentiment: the


Zero-shot learning works better for Question/Answering task.

In [40]:
print(generator(
    '''Question/Answering
C: The New York Jets are a professional American football team based in the New York metropolitan area. 
The Jets compete in the National Football League (NFL) as a member club of the league's American Football 
Conference (AFC) East division.
Q: What division do the Jets play in?
A:''',
    top_k=2, max_length=80, temperature=0.5)[0]['generated_text']
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question/Answering
C: The New York Jets are a professional American football team based in the New York metropolitan area. 
The Jets compete in the National Football League (NFL) as a member club of the league's American Football 
Conference (AFC) East division.
Q: What division do the Jets play in?
A: The Jets play in the AFC


### Summarize Text (TL;DR)

Zero-shot can be used as a summarization approach.

In [41]:
to_summarize = """The exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources, and greenhouse gas emissions. Fluids such as methane or CO2 may in some cases migrate toward the groundwater zone and atmosphere through and along imperfectly sealed hydrocarbon wells. Field tests in hydrocarbon-producing regions are routinely conducted for detecting serious leakage to prevent environmental pollution. The challenge is that testing is costly, time-consuming, and sometimes labor-intensive. In this study, machine learning approaches were applied to predict serious leakage with uncertainty quantification for wells that have not been field tested in Alberta, Canada. An improved imputation technique was developed by Cholesky factorization of the covariance matrix between features, where missing data are imputed via conditioning of available values. The uncertainty in imputed values was quantified and incorporated into the final prediction to improve decision-making. Next, a wide range of predictive algorithms and various performance metrics were considered to achieve the most reliable classifier. However, a highly skewed distribution of field tests toward the negative class (nonserious leakage) forces predictive models to unrealistically underestimate the minority class (serious leakage). To address this issue, a combination of oversampling, undersampling, and ensemble learning was applied. By investigating all the models on never-before-seen data, an optimum classifier with minimal false negative prediction was determined. The developed methodology can be applied to identify the wells with the highest likelihood for serious fluid leakage within producing fields. This information is of key importance for optimizing field test operations to achieve economic and environmental benefits."""

<span class="mark">TL;DR:</span> **too long did not read**. This is the name of summerization algorithm. Zero shot learning is the best technique to do this.

In [42]:
print(generator(
    f"""Summarization Task:\n{to_summarize}\n\nTL;DR:""", 
    max_length=512, top_k=2, temperature=0.5, no_repeat_ngram_size=3)[0]['generated_text'])

# no_repeat_ngram_size: stops from repeating this over and over

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summarization Task:
The exploitation of hydrocarbon reservoirs may potentially lead to contamination of soils, shallow water resources, and greenhouse gas emissions. Fluids such as methane or CO2 may in some cases migrate toward the groundwater zone and atmosphere through and along imperfectly sealed hydrocarbon wells. Field tests in hydrocarbon-producing regions are routinely conducted for detecting serious leakage to prevent environmental pollution. The challenge is that testing is costly, time-consuming, and sometimes labor-intensive. In this study, machine learning approaches were applied to predict serious leakage with uncertainty quantification for wells that have not been field tested in Alberta, Canada. An improved imputation technique was developed by Cholesky factorization of the covariance matrix between features, where missing data are imputed via conditioning of available values. The uncertainty in imputed values was quantified and incorporated into the final prediction to

# Style Completion by GPT

In this section, GPT model will be fine-tuned by our data.

In [43]:
from transformers import GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, GPT2LMHeadModel, pipeline, \
                         Trainer, TrainingArguments

In [44]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [45]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token  # Set padding token to end of sequence token

# Example usage
inputs = ["This is an example input", "Another input"]
encoded_inputs = tokenizer(inputs, padding=True, truncation=True, return_tensors="pt")

# Print encoded inputs
print(encoded_inputs)

{'input_ids': tensor([[ 1212,   318,   281,  1672,  5128],
        [ 6610,  5128, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 0, 0, 0]])}


GPT is applied for the paper [Machine learning approaches for the prediction of serious fluid leakage from hydrocarbon wells](https://www.cambridge.org/core/journals/data-centric-engineering/article/machine-learning-approaches-for-the-prediction-of-serious-fluid-leakage-from-hydrocarbon-wells/FF12553D3D90A3A0CB5CD302034DD741). The aim is to read through the paper multiple times to answer questions.

In [46]:
pds_data = TextDataset(
    tokenizer=tokenizer,
    file_path='./data/paper.txt',  # text file for the paper written by Mehdi Rezvandehy
    block_size=32  # length of each chunk of text to use as a datapoint
)



Lets take a look at tokens of first example:

In [47]:
pds_data[0], pds_data[0].shape  # inspect the first point

(tensor([  220,   220,   220,   220,   220,   220,   220,   220,   220,   220,
           220,  6060,    12, 19085,  1173, 14044,   357,  1238,  1954,   828,
           604,    25,   304,  1065,   198,   220,   220,   220,   220,   220,
           220,   220]),
 torch.Size([32]))

Decode the code to see what are the elements:

In [48]:
print(tokenizer.decode(pds_data[0]))

            Data-Centric Engineering (2023), 4: e12
       


After having our data set we need our data collator (`DataCollatorForLanguageModeling`)

In [49]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,  # MLM is Masked Language Modelling
)

Look at what collator is doing:

In [50]:
collator_example = data_collator([tokenizer('Can I have an input'), tokenizer('yes yo can')])
collator_example

{'input_ids': tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0]]), 'labels': tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460,  -100,  -100]])}

In [51]:
collator_example.input_ids

tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460, 50256, 50256]])

In [52]:
collator_example.input_ids  # 50256 is our pad token id

tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460, 50256, 50256]])

In [53]:
tokenizer.pad_token_id

50256

In [54]:
collator_example.attention_mask  # Note the 0 in the attention mask where we have a pad token

tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 0, 0]])

In [55]:
collator_example.labels  # note the -100 to ignore loss calculation for the padded token
# Reminder that labels are shifted *inside* the GPT model so we don't need to worry about that

tensor([[ 6090,   314,   423,   281,  5128],
        [ 8505, 27406,   460,  -100,  -100]])

In [56]:
model = GPT2LMHeadModel.from_pretrained('gpt2')  # load up a GPT2 model

Start generating a pipeline:

In [57]:
pretrained_generator = pipeline(
    'text-generation', model=model, tokenizer='gpt2',
    config={'max_length': 200, 'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10}
)
# 'do_sample': capable of multiple predictions
# 'top_p' and 'top_k': for sharpening our prediction and make it less random
# 'temperature': lower value makes it more consistent

Example below is text completion without fin-tuning the model:

In [58]:
print('----------')
for generated_sequence in pretrained_generator('The Dummy classifier has the lowest values for all metrics except for', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


----------




The Dummy classifier has the lowest values for all metrics except for Dummy_Value_Size - which is the size of the Dummy and the amount of data from it that could be stored in the Dummy_Data. Dummy_
----------
The Dummy classifier has the lowest values for all metrics except for the value defined on the initializer.

Since the metrics that can be used in this classifier are not unique as shown on the screenshot above, the metrics for each of
----------
The Dummy classifier has the lowest values for all metrics except for log of log_size (and possibly size-independent time series), or for the amount of data on the dataset, and how many objects the algorithm can collect at once and how
----------


The main aim is after reading the book 

In [59]:
training_args = TrainingArguments(
    output_dir="./gpt2_pds", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs which means reading the book three times
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=32,  # batch size for evaluation
    warmup_steps=len(pds_data.examples) // 5, # number of warmup steps for learning rate scheduler,
    logging_steps=50,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',
    save_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=pds_data.examples[:int(len(pds_data.examples)*.8)],  # Training set as first 80%
    eval_dataset=pds_data.examples[int(len(pds_data.examples)*.8):]   # Test set as second 20%
)

trainer.evaluate()

***** Running Evaluation *****
  Num examples = 148
  Batch size = 32


{'eval_loss': 3.9107463359832764,
 'eval_runtime': 15.5811,
 'eval_samples_per_second': 9.499,
 'eval_steps_per_second': 0.321}

In [60]:
trainer.train()

***** Running training *****
  Num examples = 588
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 57
  Number of trainable parameters = 124439808


Epoch,Training Loss,Validation Loss
1,No log,3.657435
2,No log,3.249712
3,3.213400,2.98168


***** Running Evaluation *****
  Num examples = 148
  Batch size = 32
Saving model checkpoint to ./gpt2_pds\checkpoint-19
Configuration saved in ./gpt2_pds\checkpoint-19\config.json
Configuration saved in ./gpt2_pds\checkpoint-19\generation_config.json
Model weights saved in ./gpt2_pds\checkpoint-19\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 148
  Batch size = 32
Saving model checkpoint to ./gpt2_pds\checkpoint-38
Configuration saved in ./gpt2_pds\checkpoint-38\config.json
Configuration saved in ./gpt2_pds\checkpoint-38\generation_config.json
Model weights saved in ./gpt2_pds\checkpoint-38\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 148
  Batch size = 32
Saving model checkpoint to ./gpt2_pds\checkpoint-57
Configuration saved in ./gpt2_pds\checkpoint-57\config.json
Configuration saved in ./gpt2_pds\checkpoint-57\generation_config.json
Model weights saved in ./gpt2_pds\checkpoint-57\pytorch_model.bin


Training completed. Do not forget to shar

TrainOutput(global_step=57, training_loss=3.1360290928890833, metrics={'train_runtime': 619.5192, 'train_samples_per_second': 2.847, 'train_steps_per_second': 0.092, 'total_flos': 28807446528000.0, 'train_loss': 3.1360290928890833, 'epoch': 3.0})

In [61]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 148
  Batch size = 32


{'eval_loss': 2.981679916381836,
 'eval_runtime': 15.9176,
 'eval_samples_per_second': 9.298,
 'eval_steps_per_second': 0.314,
 'epoch': 3.0}

In [62]:
trainer.save_model()

Saving model checkpoint to ./gpt2_pds
Configuration saved in ./gpt2_pds\config.json
Configuration saved in ./gpt2_pds\generation_config.json
Model weights saved in ./gpt2_pds\pytorch_model.bin


In [63]:
loaded_model = GPT2LMHeadModel.from_pretrained('./gpt2_pds')

finetuned_generator = pipeline(
    'text-generation', model=loaded_model, tokenizer=tokenizer,
    config={'max_length': 200,  'do_sample': True, 'top_p': 0.9, 'temperature': 0.7, 'top_k': 10})


loading configuration file ./gpt2_pds\config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "do_sample": true,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "max_length": 50,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "use_cache": true,
  

Sentence completion below is after reading paper 3 times (3 epochs) and fining the parameters of GPT:

In [64]:
print('----------')
for generated_sequence in finetuned_generator('The Dummy classifier has the lowest values for all metrics except for', num_return_sequences=3):
    print(generated_sequence['generated_text'])
    print('----------')

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.1"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


----------
The Dummy classifier has the lowest values for all metrics except for the average.

We can visualize the plot by running the following program (

[ Figure 7c ] )

R 3.       
----------
The Dummy classifier has the lowest values for all metrics except for:
               for the type of validation that is generated for each metric. For a comparison analysis, the Dummy
----------
The Dummy classifier has the lowest values for all metrics except for training
                           it had during the pre-trained set
----------


Test completion reading from the paper is not very efficient because of the text file of paper is not cleaned, We probably could have achieved better result if more epochs would apply.

# Code Dictation by GPT

In [1]:
from transformers import GPT2Tokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer, \
                         GPT2LMHeadModel, pipeline
from datasets import Dataset
import pandas as pd

English equation to Latex equivalent retrieved from [Sinan Ozdemir](https://learning.oreilly.com/videos/introduction-to-transformer/9780137923717/). We want to fine-tuned GPT model to do Latex conversion for us.

In [2]:
data = pd.read_csv('./data/english_to_latex.csv')

print(data.shape)

data.head(5)

(50, 2)


Unnamed: 0,English,LaTeX
0,integral from a to b of x squared,"\int_{a}^{b} x^2 \,dx"
1,integral from negative 1 to 1 of x squared,"\int_{-1}^{1} x^2 \,dx"
2,integral from negative 1 to infinity of x cubed,"\int_{-1}^{\inf} x^3 \,dx"
3,integral from 0 to infinity of x squared,"\int_{0}^{\inf} x^2 \,dx"
4,integral from 0 to infinity of y squared,"\int_{0}^{\inf} y^2 \,dy"


In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Set up pad token to be the same as end of sentence token
tokenizer.pad_token = tokenizer.eos_token

# Add our singular prompt
CONVERSION_PROMPT = 'LCT\n'  # LaTeX conversion task

CONVERSION_TOKEN = 'LaTeX:'

In [4]:
pd.set_option('display.max_colwidth', None)

# This is our "training prompt" that we want GPT2 to recognize and learn
training_examples = f'{CONVERSION_PROMPT}English: ' + data['English'] + '\n' + CONVERSION_TOKEN + ' ' + data['LaTeX'].astype(str)

print(training_examples[0:5])

0                      LCT\nEnglish: integral from a to b of x squared\nLaTeX: \int_{a}^{b} x^2 \,dx
1            LCT\nEnglish: integral from negative 1 to 1 of x squared\nLaTeX: \int_{-1}^{1} x^2 \,dx
2    LCT\nEnglish: integral from negative 1 to infinity of x cubed\nLaTeX: \int_{-1}^{\inf} x^3 \,dx
3            LCT\nEnglish: integral from 0 to infinity of x squared\nLaTeX: \int_{0}^{\inf} x^2 \,dx
4            LCT\nEnglish: integral from 0 to infinity of y squared\nLaTeX: \int_{0}^{\inf} y^2 \,dy
dtype: object


The main aim is to feed English and Latex to GPT model to learn this during training. Convert the engineered prompt to Pandas DataFrame.

In [5]:
task_df = pd.DataFrame({'text': training_examples})

task_df.head(5)

Unnamed: 0,text
0,"LCT\nEnglish: integral from a to b of x squared\nLaTeX: \int_{a}^{b} x^2 \,dx"
1,"LCT\nEnglish: integral from negative 1 to 1 of x squared\nLaTeX: \int_{-1}^{1} x^2 \,dx"
2,"LCT\nEnglish: integral from negative 1 to infinity of x cubed\nLaTeX: \int_{-1}^{\inf} x^3 \,dx"
3,"LCT\nEnglish: integral from 0 to infinity of x squared\nLaTeX: \int_{0}^{\inf} x^2 \,dx"
4,"LCT\nEnglish: integral from 0 to infinity of y squared\nLaTeX: \int_{0}^{\inf} y^2 \,dy"


In [6]:
# convert pandas DataFrame into Dataset
latex_data = Dataset.from_pandas(task_df)  
latex_data

Dataset({
    features: ['text'],
    num_rows: 50
})

In [7]:
 # tokenize our text but don't pad because 
    # the collator will pad it dynamically
def preprocess(examples): 
    return tokenizer(examples['text'], truncation=True)

# Map the tokenizer function across the entire data set
latex_data = latex_data.map(preprocess, batched=True)

# Split to training set and test set
latex_data = latex_data.train_test_split(train_size=.8)
latex_data

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 40
    })
    test: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 10
    })
})

In [8]:
# pass the tokenizer to DataCollatorForLanguageModeling 
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# mlm=False because we are not performing a Masked Language Modeling (mlm) tast.

In [9]:
# Load pre-trained GPT2 model
latex_gpt2 = GPT2LMHeadModel.from_pretrained('gpt2')

In [10]:
# Fine tuned the model
# going through the data set 10 times (epoch = 10) since we do not have
# many data points
training_args = TrainingArguments(
    output_dir="./english_to_latex",
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=10, # number of training epochs
    per_device_train_batch_size=2, # batch size for training should be reduced from 32 to 2 because of small data
    per_device_eval_batch_size=20,  # batch size for evaluation
    load_best_model_at_end=True,
    logging_steps=5,
    log_level='info',
    evaluation_strategy='epoch',
    save_strategy='epoch'
)

trainer = Trainer(
    model=latex_gpt2,
    args=training_args,
    train_dataset=latex_data["train"],
    eval_dataset=latex_data["test"],
    data_collator=data_collator,
)

trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10
  Batch size = 20


{'eval_loss': 4.9190874099731445,
 'eval_runtime': 1.4309,
 'eval_samples_per_second': 6.989,
 'eval_steps_per_second': 0.699}

In [27]:
# The loss should down after training
trainer.train()

The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 40
  Num Epochs = 10
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 200
  Number of trainable parameters = 124439808


Epoch,Training Loss,Validation Loss
1,1.525,1.479772
2,0.805,1.196455
3,0.7269,1.097175
4,0.7535,1.030493
5,0.5283,0.954629
6,0.4974,0.989692
7,0.5044,0.990972
8,0.5131,1.003159
9,0.3966,1.005885
10,0.3378,0.996944


The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10
  Batch size = 20
Saving model checkpoint to ./english_to_latex\checkpoint-20
Configuration saved in ./english_to_latex\checkpoint-20\config.json
Configuration saved in ./english_to_latex\checkpoint-20\generation_config.json
Model weights saved in ./english_to_latex\checkpoint-20\pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10
  Batch size = 20
Saving model checkpoint to ./english_to_latex\checkpoint-40
Configuration saved in ./english_to_latex\checkpoint-40\confi

TrainOutput(global_step=200, training_loss=0.7215107238292694, metrics={'train_runtime': 407.0658, 'train_samples_per_second': 0.983, 'train_steps_per_second': 0.491, 'total_flos': 6146486784000.0, 'train_loss': 0.7215107238292694, 'epoch': 10.0})

In [28]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text. If text are not expected by `GPT2LMHeadModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10
  Batch size = 20


{'eval_loss': 0.9546288251876831,
 'eval_runtime': 1.1259,
 'eval_samples_per_second': 8.882,
 'eval_steps_per_second': 0.888,
 'epoch': 10.0}

We can fine-tune the model again but first we can get the model read a calculus book:

In [None]:
# Calculus Made Easy by Silvanus P. Thompson - https://gutenberg.org/ebooks/33283

calculus_data = TextDataset(
    tokenizer=tokenizer,
    file_path='../data/calculus made easy.txt',  # Principles of Data Science - Sinan Ozdemir
    block_size=32
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,  # MLM is Masked Language Modelling
)

latex_gpt2 = GPT2LMHeadModel.from_pretrained('gpt2')

training_args = TrainingArguments(
    output_dir="./calculus",
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=1, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=32,  # batch size for evaluation
    load_best_model_at_end=True,
    logging_steps=50,
    eval_steps=50,
    evaluation_strategy='steps',
    save_strategy='steps'
)

trainer = Trainer(
    model=latex_gpt2,
    args=training_args,
    data_collator=data_collator,
    train_dataset=calculus_data.examples[:int(len(calculus_data.examples)*.8)],
    eval_dataset=calculus_data.examples[int(len(calculus_data.examples)*.8):]
)

In [43]:
trainer.evaluate()  # initial loss for the calculus book

***** Running Evaluation *****
  Num examples = 1624
  Batch size = 32


{'eval_loss': 2.5129024982452393,
 'eval_runtime': 72.3714,
 'eval_samples_per_second': 22.44,
 'eval_steps_per_second': 0.705}

In [44]:
trainer.train()

***** Running training *****
  Num examples = 6494
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 203


Step,Training Loss,Validation Loss
50,1.7968,1.640845
100,1.5812,1.58991
150,1.5672,1.567876
200,1.4662,1.558849


***** Running Evaluation *****
  Num examples = 1624
  Batch size = 32
***** Running Evaluation *****
  Num examples = 1624
  Batch size = 32
***** Running Evaluation *****
  Num examples = 1624
  Batch size = 32
***** Running Evaluation *****
  Num examples = 1624
  Batch size = 32


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=203, training_loss=1.6039342269521628, metrics={'train_runtime': 1292.6277, 'train_samples_per_second': 5.024, 'train_steps_per_second': 0.157, 'total_flos': 106051903488000.0, 'train_loss': 1.6039342269521628, 'epoch': 1.0})

In [45]:
trainer.save_model()

Saving model checkpoint to ./calculus
Configuration saved in ./calculus/config.json
Model weights saved in ./calculus/pytorch_model.bin


Next, instead of using gpt2 pre-trained model, the "calculus model" which is pre-fine tuned on the calculus text book is applied.

In [73]:
calculus_latex_gpt2 = GPT2LMHeadModel.from_pretrained('./calculus') 

training_args = TrainingArguments(
    output_dir="./calculus_english_to_latex",
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=10, # number of training epochs
    per_device_train_batch_size=2, # batch size for training
    per_device_eval_batch_size=20,  # batch size for evaluation
    load_best_model_at_end=True,
    logging_steps=5,
    log_level='info',
    evaluation_strategy='epoch',
    save_strategy='epoch'
)

trainer = Trainer(
    model=calculus_latex_gpt2,
    args=training_args,
    train_dataset=latex_data["train"],
    eval_dataset=latex_data["test"],
    data_collator=data_collator,
)

trainer.evaluate()  # loss is starting slightly lower than before

loading configuration file ./calculus/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.15.0",
  "use_cache": true,
  "vocab_size": 50257
}

loading weights fi

{'eval_loss': 4.565759181976318,
 'eval_runtime': 0.6014,
 'eval_samples_per_second': 16.627,
 'eval_steps_per_second': 1.663}

By doing so, the starting loss is decreasing.

In [74]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text.
***** Running training *****
  Num examples = 40
  Num Epochs = 10
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 200


Epoch,Training Loss,Validation Loss
1,1.3573,1.128648
2,0.6947,0.932949
3,0.7429,0.89962
4,0.6024,0.789026
5,0.5131,0.94391
6,0.5669,0.94491
7,0.4063,0.889982
8,0.3631,0.92746
9,0.4062,0.949464
10,0.2994,0.962857


The following columns in the evaluation set  don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text.
***** Running Evaluation *****
  Num examples = 10
  Batch size = 20
Saving model checkpoint to ./calculus_english_to_latex/checkpoint-20
Configuration saved in ./calculus_english_to_latex/checkpoint-20/config.json
Model weights saved in ./calculus_english_to_latex/checkpoint-20/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text.
***** Running Evaluation *****
  Num examples = 10
  Batch size = 20
Saving model checkpoint to ./calculus_english_to_latex/checkpoint-40
Configuration saved in ./calculus_english_to_latex/checkpoint-40/config.json
Model weights saved in ./calculus_english_to_latex/checkpoint-40/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ig

TrainOutput(global_step=200, training_loss=0.7141165947914123, metrics={'train_runtime': 230.0817, 'train_samples_per_second': 1.739, 'train_steps_per_second': 0.869, 'total_flos': 6087287808000.0, 'train_loss': 0.7141165947914123, 'epoch': 10.0})

In [75]:
trainer.evaluate()  # pre-training on the calculus book for one epoch led to a minor drop in loss

The following columns in the evaluation set  don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: text.
***** Running Evaluation *****
  Num examples = 10
  Batch size = 20


{'eval_loss': 0.7890258431434631,
 'eval_runtime': 0.7324,
 'eval_samples_per_second': 13.655,
 'eval_steps_per_second': 1.365,
 'epoch': 10.0}

In [76]:
trainer.save_model()  # save this model

Saving model checkpoint to ./calculus_english_to_latex
Configuration saved in ./calculus_english_to_latex/config.json
Model weights saved in ./calculus_english_to_latex/pytorch_model.bin


In [144]:
# Load up calculus_english_to_latex which was trained 2 times 
loaded_model = GPT2LMHeadModel.from_pretrained('./calculus_english_to_latex')

# Load the model in pipline for text-generation
latex_generator = pipeline('text-generation', model=loaded_model, tokenizer=tokenizer)

loading configuration file ./calculus_english_to_latex/config.json
Model config GPT2Config {
  "_name_or_path": "./calculus",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.15.0",
  "use_cache": true,
  "vocab_size": 5025

In [149]:
text_sample = 'f of x equals integral from 0 to pi of x to the fourth power'
conversion_text_sample = f'{CONVERSION_PROMPT}English: {text_sample}\n{CONVERSION_TOKEN}'

print(conversion_text_sample)

LCT
English: f of x equals integral from 0 to pi of x to the fourth power
LaTeX:


In [150]:
print(latex_generator(
    conversion_text_sample, num_beams=5, early_stopping=True, temperature=0.7,
    max_length=len(tokenizer.encode(conversion_text_sample)) + 20
)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LCT
English: f of x equals integral from 0 to pi of x to the fourth power
LaTeX: f(x) = \int_{0}^{pi} x^4 \,dx \


In [165]:
text_sample = 'f of x is sum from 0 to x of x squared'
conversion_text_sample = f'{CONVERSION_PROMPT}English: {text_sample}\n{CONVERSION_TOKEN}'

print(latex_generator(
    conversion_text_sample, num_beams=5, early_stopping=True, temperature=0.7,
    max_length=len(tokenizer.encode(conversion_text_sample)) + 20
)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f(x) = \sum_{0}^{x} x^2 \,dx \


In [163]:
# Sanity check to see of GPT alone can predict this
non_finetuned_latex_generator = pipeline(
    'text-generation', 
    model=GPT2LMHeadModel.from_pretrained(MODEL),  # not fine-tuned!
    tokenizer=tokenizer
)

loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at /Users/sinanozdemir/.cache/huggingface/transformers/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-gen

In [167]:
few_shot_prompt = """LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f(x) = \sum_{0}^{x} x^2 \,dx \
###
LCT
English: f of x equals integral from 0 to pi of x to the fourth power
LaTeX: f(x) = \int_{0}^{\pi} x^4 \,dx \
###
LCT
English: x squared
LaTeX:"""

In [169]:
print(non_finetuned_latex_generator(
    few_shot_prompt, num_beams=5, early_stopping=True, temperature=0.7,
    max_length=len(tokenizer.encode(few_shot_prompt)) + 20
)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f(x) = \sum_{0}^{x} x^2 \,dx ###
LCT
English: f of x equals integral from 0 to pi of x to the fourth power
LaTeX: f(x) = \int_{0}^{\pi} x^4 \,dx ###
LCT
English: x squared
LaTeX: f(x) = \sum_{0}^{x} x^2 \,dx ###


In [164]:
print(non_finetuned_latex_generator(
    conversion_text_sample, num_beams=5, early_stopping=True, temperature=0.7,
    max_length=len(tokenizer.encode(conversion_text_sample)) + 20
)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


LCT
English: f of x is sum from 0 to x of x squared
LaTeX: f of x is sum from 0 to x of x squared
LaTeX: f of x is


As can be seen, few shot learning cannot get reliable answer