## GPT-2


GPT-2 (Generative Pre-trained Transformer 2) is an AI language model developed by OpenAI. It is the second iteration of the Generative Pre-trained Transformer (GPT) series and was released in 2019. GPT-2 is designed for natural language processing (NLP) tasks such as text generation, translation, summarization, and question-answering.

In this repository, I will learn and explore more about GPT-2 architecture from scratch with PyTorch. This process include building LLM, Foundation Model, and Classifier or Personal Assistant

I want to say thank you to my bro github.com/double-singularity for helping me learn this model.

Start at Jan 6th - 

# Install Requirement

In [2]:
#pip install -r requirements.txt

Packages that we're going to use:

In [3]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.5.1
tiktoken version: 0.8.0


# Load Raw Text

In [4]:
import os
import urllib.request

if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

In [5]:
# r for read
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
#showing number of characters
print("Total number of character:", len(raw_text))

#showing first 100 characters
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


# Regular Expressions Split

re is one of python's libraries that'll help with splitting. We need splitting to make sure that in the raw_text there are no character that is not a real character. we use function called split to remove it.

In [6]:

import re

# splitting the text into words and punctuation marks using regex
#it works with regular expressions allow pattern which is r'([,.:;?_!"()\']|--|\s)' from raw_text to get cleared
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

#showing first 30 preprocessed words
print(preprocessed[:30])

#showing the total number of preprocessed words (after removing empty strings)
print("this is the length after preprocessed: ", len(preprocessed))

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']
this is the length after preprocessed:  4690


# Token into ID

In [7]:
#sort preprocessed words
all_words = sorted(set(preprocessed))

#showing the total number of unique words in the vocabulary
vocab_size = len(all_words)

print(vocab_size)

1130


In [8]:
# making dictionary
vocab = {token:integer for integer,token in enumerate(all_words)}

In [9]:
# add ID for every item in vocab 
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 100:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)
('His', 51)
('How', 52)
('I', 53)
('If', 54)
('In', 55)
('It', 56)
('Jack', 57)
('Jove', 58)
('Just', 59)
('Lord', 60)
('Made', 61)
('Miss', 62)
('Money', 63)
('Monte', 64)
('Moon-dancers', 65)
('Mr', 66)
('Mrs', 67)
('My', 68)
('Never', 69)
('No', 70)
('Now', 71)
('Nutley', 72)
('Of', 73)
('Oh', 74)
('On', 75)
('Once', 76)
('Only', 77)
('

# Simple Tokenizer V1

In [10]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
                                
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [11]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Hermia said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 50, 851, 1108, 754, 793, 7]


In [12]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Hermia said with pardonable pride.'

In [13]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Hermia said with pardonable pride.'

In [14]:
tokenizer = SimpleTokenizerV1(vocab)

text = "Hello, do you like tea. Is this-- a test?"

tokenizer.encode(text)

KeyError: 'Hello'

Notice that the world that is not on the vocab and not tokenzied can't be encoded

So we need to add another token to make the new vocabulary available. We did this with 'marking' the with unknown or this context with unk

In [16]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [17]:
#we added two new tokens
len(vocab.items())

1132

In [18]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


Now we have added two new tokens

Now we will make another tokenizer that basically if word is not in the vocabulary the mark it unk

In [19]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [20]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [21]:
tokenizer.encode(text)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [22]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

Now that's how it's done

# Token with Tiktoken

Now, let's use tiktoken

In [23]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.8.0


In [24]:
tokenizer = tiktoken.get_encoding("gpt2")

In [25]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [26]:
for i in integers:
    print(tokenizer.decode([i]))


Hello
,
 do
 you
 like
 tea
?
 
<|endoftext|>
 In
 the
 sun
lit
 terr
aces
of
 some
unknown
Place
.


In [27]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


In [28]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

5145


In [29]:
enc_sample = enc_text[50:]

In [30]:
for i in enc_sample:
    print(tokenizer.decode([i]))

 and
 established
 himself
 in
 a
 vill
a
 on
 the
 Riv
iera
.
 (
Though
 I
 rather
 thought
 it
 would
 have
 been
 Rome
 or
 Florence
.)




"
The
 height
 of
 his
 glory
"
--
that
 was
 what
 the
 women
 called
 it
.
 I
 can
 hear
 Mrs
.
 Gideon
 Th
wing
--
his
 last
 Chicago
 sit
ter
--
de
pl
oring
 his
 unaccount
able
 ab
d
ication
.
 "
Of
 course
 it
's
 going
 to
 send
 the
 value
 of
 my
 picture
 '
way
 up
;
 but
 I
 don
't
 think
 of
 that
,
 Mr
.
 Rick
ham
--
the
 loss
 to
 Ar
rt
 is
 all
 I
 think
 of
."
 The
 word
,
 on
 Mrs
.
 Th
wing
's
 lips
,
 multiplied
 its
 _
rs
_
 as
 though
 they
 were
 reflected
 in
 an
 endless
 v
ista
 of
 mirrors
.
 And
 it
 was
 not
 only
 the
 Mrs
.
 Th
wings
 who
 mourn
ed
.
 Had
 not
 the
 exquisite
 Herm
ia
 Cro
ft
,
 at
 the
 last
 G
raft
on
 Gallery
 show
,
 stopped
 me
 before
 G
is
burn
's
 "
Moon
-
d
ancers
"
 to
 say
,
 with
 tears
 in
 her
 eyes
:
 "
We
 shall
 not
 look
 upon
 its
 like
 again
"?




Well
!--
even
 through
 the
 p

In [31]:
context_size = 4

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


In [32]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [33]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a
