**Introduction to GPT**

In [1]:
from transformers import pipeline, set_seed, GPT2Tokenizer, GPT2LMHeadModel
from torch import tensor, numel
from bertviz import model_view

set_seed(42)

In [2]:
import torch
device = device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
device

device(type='cuda')

In [4]:
generator = pipeline("text-generation",model="gpt2")
generator("Hello, I'm a stock-trader and I", max_length=30, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a stock-trader and I want stock on every stock company.\n\n\nWell, as you know, I have to keep"},
 {'generated_text': "Hello, I'm a stock-trader and I'm happy with today's earnings for the past 8 months. Last week we reported the lowest of"},
 {'generated_text': 'Hello, I\'m a stock-trader and I can handle some of the stuff I want," he says. "I bought into the whole thing'}]

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [6]:
'Ashutosh' in tokenizer.get_vocab()

False

In [7]:
tokenizer.convert_ids_to_tokens(tokenizer.encode('Ashutosh loves to analyze stocks'))

['Ash', 'ut', 'osh', 'Ġloves', 'Ġto', 'Ġanalyze', 'Ġstocks']

In [8]:
tokenizer.encode('Ashutosh loves to analyze stocks')

[26754, 315, 3768, 10408, 284, 16602, 14420]

In [9]:
model = GPT2LMHeadModel.from_pretrained('gpt2').to(device)
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [10]:
encoded = tokenizer.encode('Ashutosh loves to analyze stocks', return_tensors='pt').to(device)

In [11]:
encoded

tensor([[26754,   315,  3768, 10408,   284, 16602, 14420]], device='cuda:0')

In [12]:
model.transformer.wte(encoded).shape

torch.Size([1, 7, 768])

In [13]:
model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5]).reshape(1, 6).to(device))

tensor([[[-1.8821e-02, -1.9742e-01,  4.0267e-03,  ..., -4.3044e-02,
           2.8267e-02,  5.4490e-02],
         [ 2.3959e-02, -5.3792e-02, -9.4879e-02,  ...,  3.4170e-02,
           1.0172e-02, -1.5573e-04],
         [ 4.2161e-03, -8.4764e-02,  5.4515e-02,  ...,  1.9745e-02,
           1.9325e-02, -2.1424e-02],
         [-2.8337e-04, -7.3803e-02,  1.0553e-01,  ...,  1.0157e-02,
           1.7659e-02, -7.0854e-03],
         [ 7.6374e-03, -2.5090e-02,  1.2696e-01,  ...,  8.4643e-03,
           9.8542e-03, -7.0117e-03],
         [ 9.6023e-03, -3.3885e-02,  1.3123e-01,  ...,  5.8940e-03,
           7.1222e-03, -7.4742e-03]]], device='cuda:0',
       grad_fn=<EmbeddingBackward0>)

In [14]:
initial_input = model.transformer.wte(encoded) + model.transformer.wpe(tensor([0, 1, 2, 3, 4, 5,6]).reshape(1, 7).to(device))

In [15]:
model.transformer.drop(initial_input)

tensor([[[-0.1346, -0.1959,  0.2751,  ..., -0.1475,  0.2174,  0.2006],
         [-0.1197, -0.1219, -0.0520,  ..., -0.0707, -0.1176,  0.1420],
         [ 0.0209, -0.1028,  0.0574,  ...,  0.0567,  0.1084,  0.1670],
         ...,
         [-0.0008, -0.1269,  0.1579,  ..., -0.0074,  0.1089,  0.0720],
         [ 0.1626,  0.0182,  0.2111,  ...,  0.2669,  0.2159, -0.0566],
         [ 0.1358,  0.0221,  0.1866,  ..., -0.0193,  0.0821,  0.0390]]],
       device='cuda:0', grad_fn=<AddBackward0>)

In [16]:
model.lm_head

Linear(in_features=768, out_features=50257, bias=False)

In [17]:
total_params = 0
for param in model.parameters():
    total_params += numel(param)

print(f'Number of params: {total_params:,}')

Number of params: 124,439,808


**Masked multi-headed attention**

In [18]:
import pandas as pd

In [19]:
phrase = 'My friend was right about generative-AI. It is so fun!'

In [20]:
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(phrase))
tokens

['My',
 'Ġfriend',
 'Ġwas',
 'Ġright',
 'Ġabout',
 'Ġgener',
 'ative',
 '-',
 'AI',
 '.',
 'ĠIt',
 'Ġis',
 'Ġso',
 'Ġfun',
 '!']

In [21]:
len(tokenizer.convert_ids_to_tokens(tokenizer.encode(phrase)))

15

In [22]:
encoded_phrase = tokenizer(phrase, return_tensors='pt').to(device)

In [23]:
print(encoded_phrase)

{'input_ids': tensor([[ 3666,  1545,   373,   826,   546,  1152,   876,    12, 20185,    13,
           632,   318,   523,  1257,     0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}


In [24]:
response = model(**encoded_phrase, output_attentions=True, output_hidden_states=True)

**output_attentions=True:** This argument tells the model to return the attention weights as part of its output. In transformer models like GPT-2, attention weights show how much focus the model puts on other parts of the input sequence when processing each token. This can be useful for understanding the model's behavior and for certain advanced applications like model interpretation.

**output_hidden_states=True:** With this argument set to True, the model will also return the hidden states from all layers. In the context of GPT-2, this refers to the output of each transformer block (layer) in the model. These hidden states can be used for a variety of purposes, such as feature extraction for downstream tasks, or for analyzing how different layers of the model process and represent the input text.

In [25]:
response.attentions[-1].shape  # From the final decoder

torch.Size([1, 12, 15, 15])

In [26]:
print(response.attentions[-1])

tensor([[[[1.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.2978, 0.7022, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.3311, 0.3612, 0.3078,  ..., 0.0000, 0.0000, 0.0000],
          ...,
          [0.1609, 0.0749, 0.1140,  ..., 0.0507, 0.0000, 0.0000],
          [0.0670, 0.0755, 0.0598,  ..., 0.0356, 0.0462, 0.0000],
          [0.1219, 0.0934, 0.0331,  ..., 0.0157, 0.0192, 0.2919]],

         [[1.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.9412, 0.0588, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.9428, 0.0232, 0.0340,  ..., 0.0000, 0.0000, 0.0000],
          ...,
          [0.6996, 0.0080, 0.0109,  ..., 0.0168, 0.0000, 0.0000],
          [0.7258, 0.0140, 0.0221,  ..., 0.0176, 0.0296, 0.0000],
          [0.5206, 0.0850, 0.0475,  ..., 0.0191, 0.0077, 0.0381]],

         [[1.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.9783, 0.0217, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.9774, 0.0118, 0.0108,  ..., 0

In [27]:
# Layer index 9, head 0. Check out the almost 60% attention the token it is giving to the token class
arr = response.attentions[9][0][0]

n_digits = 3

# Move the tensor to CPU before converting to a NumPy array
attention_df = pd.DataFrame((torch.round(arr * 10**n_digits) / (10**n_digits)).cpu().detach()).applymap(float)

attention_df.columns = tokens
attention_df.index = tokens

attention_df


Unnamed: 0,My,Ġfriend,Ġwas,Ġright,Ġabout,Ġgener,ative,-,AI,.,ĠIt,Ġis,Ġso,Ġfun,!
My,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġfriend,0.968,0.032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġwas,0.824,0.145,0.031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġright,0.979,0.008,0.007,0.005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġabout,0.979,0.008,0.004,0.005,0.005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ġgener,0.82,0.004,0.002,0.001,0.002,0.172,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ative,0.501,0.006,0.001,0.001,0.002,0.158,0.33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-,0.35,0.003,0.002,0.002,0.003,0.108,0.515,0.018,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AI,0.746,0.009,0.002,0.003,0.003,0.045,0.058,0.012,0.122,0.0,0.0,0.0,0.0,0.0,0.0
.,0.585,0.016,0.003,0.003,0.002,0.09,0.108,0.014,0.17,0.01,0.0,0.0,0.0,0.0,0.0


In [28]:
model_view(response.attentions, tokens)

<IPython.core.display.Javascript object>

In [29]:
generator(phrase, max_length=40, num_return_sequences=1, do_sample=False)  # greedy search

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "My friend was right about generative-AI. It is so fun!\n\nI'm not sure if you've heard of generative-AI, but it's a very interesting idea. It"}]

In [30]:
generator(phrase, max_length=40, num_return_sequences=1, do_sample=True)  # greedy search with sampling

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'My friend was right about generative-AI. It is so fun!\n\nI am sure that many you have written about generative-AI for other uses:\n\n- to give a'}]

**Pre-training GPT**

In [31]:

# Bias
generator("The Black people in Africa was", max_length=15, num_return_sequences=10, temperature=0.5)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The Black people in Africa was the first to be oppressed. There were two'},
 {'generated_text': 'The Black people in Africa was so violent that it was a very difficult thing'},
 {'generated_text': 'The Black people in Africa was a different story. The Black people in Africa'},
 {'generated_text': 'The Black people in Africa was one of the most significant and significant political events'},
 {'generated_text': 'The Black people in Africa was the first to be enslaved, and the first'},
 {'generated_text': 'The Black people in Africa was not born into slavery. It was born into'},
 {'generated_text': 'The Black people in Africa was not given the right to live their lives in'},
 {'generated_text': 'The Black people in Africa was already a big problem in the late 19th'},
 {'generated_text': 'The Black people in Africa was a different story. The Black people were the'},
 {'generated_text': 'The Black people in Africa was a civil war that ended in the fall of'}]

In [32]:
# Bias
generator("The Black people in Africa was", max_length=15, num_return_sequences=10, temperature=1.2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The Black people in Africa was not as affected by mass incarceration as Africa,'},
 {'generated_text': "The Black people in Africa was one example when Black Africans didn't know they"},
 {'generated_text': 'The Black people in Africa was forced to bear the brunt of an imperialist war'},
 {'generated_text': 'The Black people in Africa was in very strong form.\n\nI believe'},
 {'generated_text': 'The Black people in Africa was the leading cause of anti-government riots in'},
 {'generated_text': 'The Black people in Africa was like nothing you could find. It reminded me'},
 {'generated_text': "The Black people in Africa was not just the work. It's a whole"},
 {'generated_text': 'The Black people in Africa was the biggest thing the country had since the start'},
 {'generated_text': 'The Black people in Africa was very successful.\n\n"I do think'},
 {'generated_text': 'The Black people in Africa was so much a threat to Western civilization for decades'}]

In [33]:
generator("The earth is", max_length=10, num_return_sequences=10, temperature=0.8)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The earth is full of life.\n\nI'},
 {'generated_text': 'The earth is a pretty big place and the temperature'},
 {'generated_text': 'The earth is dying and it is not taking our'},
 {'generated_text': 'The earth is an enormous, complicated, highly complex'},
 {'generated_text': 'The earth is made up of three layers. The'},
 {'generated_text': "The earth is changing, and it's getting colder"},
 {'generated_text': 'The earth is the center of the universe, and'},
 {'generated_text': 'The earth is not flat. You can take any'},
 {'generated_text': "The earth is flat.\n\nIf you've"},
 {'generated_text': 'The earth is the center of the universe and it'}]

**Few-shot learning**

In [34]:
print(generator("""Sentiment Analysis
Text: I hate it when my phone battery dies.
Sentiment: Negative
###
Text: My day has been really great!
Sentiment: Positive
###
Text: Not a fan when it is cloudy
Sentiment:""", top_k=2, temperature=0.1, max_length=55)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sentiment Analysis
Text: I hate it when my phone battery dies.
Sentiment: Negative
###
Text: My day has been really great!
Sentiment: Positive
###
Text: Not a fan when it is cloudy
Sentiment: Negative



In [35]:
print(generator("""Question/Answering
C: Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock.
Q: When was Google founded?
A: 1998
###
C: Hugging Face is a company which develops social AI-run chatbot applications. It was established in 2016 by Clement Delangue and Julien Chaumond. The company is based in Brooklyn, New York, United States.
Q: What does Hugging Face develop?
A: social AI-run chatbot applications
###
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A:""", top_k=2, max_length=215, temperature=0.5)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question/Answering
C: Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock.
Q: When was Google founded?
A: 1998
###
C: Hugging Face is a company which develops social AI-run chatbot applications. It was established in 2016 by Clement Delangue and Julien Chaumond. The company is based in Brooklyn, New York, United States.
Q: What does Hugging Face develop?
A: social AI-run chatbot applications
###
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A: American Football Conference (AFC) East


**Zero Shot Learning**

In [36]:
print(generator(
    '''Question/Answering
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A:''',
    top_k=2, max_length=80, temperature=0.5)[0]['generated_text']
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question/Answering
C: The New York Jets are a professional American football team based in the New York metropolitan area. The Jets compete in the National Football League (NFL) as a member club of the league's American Football Conference (AFC) East division.
Q: What division do the Jets play in?
A: The Jets play in the AFC East, which is the


In [37]:
# Zero-shot doesn't work as much with the sentiment analysis example
print(generator("""Sentiment Analysis
Text: This new music video was so good
Sentiment:""", top_k=2, temperature=0.3, max_length=55)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sentiment Analysis
Text: This new music video was so good
Sentiment: This new music video was so good
Sentiment: This new music video was so good
Sentiment: This new music video was so good
Sentiment: This new music video was


**Zero-shot abstractive summarization**

In [38]:
to_summarize = """This training will focus on how the GPT family of models are used for NLP tasks including abstractive text summarization and natural language generation. The training will begin with an introduction to necessary concepts including masked self attention, language models, and transformers and then build on those concepts to introduce the GPT architecture. We will then move into how GPT is used for multiple natural language processing tasks with hands-on examples of using pre-trained GPT-2 models as well as fine-tuning these models on custom corpora.

GPT models are some of the most relevant NLP architectures today and it is closely related to other important NLP deep learning models like BERT. Both of these models are derived from the newly invented transformer architecture and represent an inflection point in how machines process language and context.

The Natural Language Processing with Next-Generation Transformer Architectures series of online trainings provides a comprehensive overview of state-of-the-art natural language processing (NLP) models including GPT and BERT which are derived from the modern attention-driven transformer architecture and the applications these models are used to solve today. All of the trainings in the series blend theory and application through the combination of visual mathematical explanations, straightforward applicable Python examples within hands-on Jupyter notebook demos, and comprehensive case studies featuring modern problems solvable by NLP models. (Note that at any given time, only a subset of these classes will be scheduled and open for registration.)"""

In [39]:
print(generator(
    f"""Summarization Task:\n{to_summarize}\nTL;DR:""",
    max_length=128, top_k=5, temperature=0.5, no_repeat_ngram_size=2
)[0]['generated_text'])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Summarization Task:
This training will focus on how the GPT family of models are used for NLP tasks including abstractive text summarization and natural language generation. The training will begin with an introduction to necessary concepts including masked self attention, language models, and transformers and then build on those concepts to introduce the GPT architecture. We will then move into how GPT is used for multiple natural language processing tasks with hands-on examples of using pre-trained GPT-2 models as well as fine-tuning these models on custom corpora.

GPT models are some of the most relevant NLP architectures today and it is closely related to other important NLP deep learning models like BERT. Both of these models are derived from the newly invented transformer architecture and represent an inflection point in how machines process language and context.

The Natural Language Processing with Next-Generation Transformer Architectures series of online trainings provides a