# SimpleGPT

The objective of this notebook is to create and train a decoder-only model, which is a custom and scaled-down version of GPT, using the specified dataset.



### Import Libraries

In [1]:
# Import necessary libraries for data manipulation
import pandas as pd
import numpy as np

# Import PyTorch and submodules for neural network construction and operations
import torch
import torch.nn as nn
from torch.nn import functional as F

### Download dataset

In [2]:
!wget https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends.csv

--2025-09-24 16:11:19--  https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-08/friends.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5383844 (5.1M) [text/plain]
Saving to: ‘friends.csv.2’


2025-09-24 16:11:20 (94.0 MB/s) - ‘friends.csv.2’ saved [5383844/5383844]



## Hyperparameters

In [3]:
batch_size = 16
block_size = 32
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
eval_iters = 200

n_embd = 64
n_head = 4
n_layer = 4

device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.manual_seed(1337)


<torch._C.Generator at 0x79eaa25efc10>

## Preparing dateset

In [4]:
friends_df = pd.read_csv('friends.csv')
friends_df.head()

Unnamed: 0,text,speaker,season,episode,scene,utterance
0,There's nothing to tell! He's just some guy I ...,Monica Geller,1,1,1,1
1,"C'mon, you're going out with the guy! There's ...",Joey Tribbiani,1,1,1,2
2,"All right Joey, be nice. So does he have a hum...",Chandler Bing,1,1,1,3
3,"Wait, does he eat chalk?",Phoebe Buffay,1,1,1,4
4,"(They all stare, bemused.)",Scene Directions,1,1,1,5


In [6]:
# Select relevant columns and perform cleaning in a chained manner
friends_df = (
    friends_df[['speaker', 'text']]
    .loc[~friends_df['speaker'].str.contains('Scene', na=False)]
    .copy()
)
friends_df['speaker'] = friends_df['speaker'].apply(lambda sp: sp.lower().capitalize().split(' ')[0])

friends_df.head()


Unnamed: 0,speaker,text
0,Monica,There's nothing to tell! He's just some guy I ...
1,Joey,"C'mon, you're going out with the guy! There's ..."
2,Chandler,"All right Joey, be nice. So does he have a hum..."
3,Phoebe,"Wait, does he eat chalk?"
5,Phoebe,"Just, 'cause, I don't want her to go through w..."


In [7]:
# Generate the dataset text
text = '\n\n'.join(f"{row['speaker']}:\n{row['text']}" for _, row in friends_df.iterrows())
print("Length of dataset in characters:", len(text))


Length of dataset in characters: 3774765


In [8]:
# Print the first 1000 characters of the dataset text
print(text[:1000])

Monica:
There's nothing to tell! He's just some guy I work with!

Joey:
C'mon, you're going out with the guy! There's gotta be something wrong with him!

Chandler:
All right Joey, be nice. So does he have a hump? A hump and a hairpiece?

Phoebe:
Wait, does he eat chalk?

Phoebe:
Just, 'cause, I don't want her to go through what I went through with Carl- oh!

Monica:
Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.

Chandler:
Sounds like a date to me.

Chandler:
Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked.

#all#:
Oh, yeah. Had that dream.

Chandler:
Then I look down, and I realize there's a phone... there.

Joey:
Instead of...?

Chandler:
That's right.

Joey:
Never had that dream.

Phoebe:
No.

Chandler:
All of a sudden, the phone starts to ring. Now I don't know what to do, everybody starts looking at me.

Monica:
And they weren't looking at you before?!


In [9]:
# Create a vocabulary and encode/decode functions
chars = sorted(list(set(text)))
vocab_size = len(chars)
char_to_id = {ch: i for i, ch in enumerate(chars)}
id_to_char = {i: ch for ch, i in char_to_id.items()} # Inverted dictionary

def encode(string):
    return [char_to_id[char] for char in string]

def decode(ids):
    return ''.join(id_to_char[id] for id in ids)


In [10]:
# Prepare the data for model training
data = torch.tensor(encode(text), dtype=torch.long)
split_point = int(0.9 * len(data))
train_data = data[:split_point]
val_data = data[split_point:]

# Display information about the prepared data
print(f"Vocabulary Size: {vocab_size}")
print(f"Training Data Length: {len(train_data)}")
print(f"Validation Data Length: {len(val_data)}")


Vocabulary Size: 88
Training Data Length: 3397288
Validation Data Length: 377477


## Utils

In [11]:
def get_random_batch(data_source, block_size, batch_size):
    """
    Generates a random batch of input and label tensors from the data source.
    """
    indices = torch.randint(len(data_source) - block_size, (batch_size,))
    inputs = torch.stack([data_source[i:i+block_size] for i in indices]).to(device)
    labels = torch.stack([data_source[i+1:i+block_size+1] for i in indices]).to(device)
    return inputs, labels

@torch.no_grad()
def evaluate_loss(model, data_sources, block_size, batch_size, eval_iters):
    """
    Estimates the model's loss on different data splits.
    """
    model.eval()
    losses = {}
    for split, data_source in data_sources.items():
        loss_list = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_random_batch(data_source, block_size, batch_size)
            _, loss = model(X, Y)
            loss_list[k] = loss.item()
        losses[split] = loss_list.mean()
    model.train()
    return losses

def generate_text(model, initial_idx, block_size, max_new_tokens):
    """
    Generates text by sampling from the model's predictions.
    """
    idx = initial_idx
    model.eval()
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -block_size:]
        logits, _ = model(idx_cond)
        probs = F.softmax(logits[:, -1, :], dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)
    model.train()
    return idx

def train_step(model, optimizer, train_data, block_size, batch_size):
    """Performs a single training step."""
    inputs, labels = get_random_batch(train_data, block_size, batch_size)
    optimizer.zero_grad(set_to_none=True)
    _, loss = model(inputs, labels)
    loss.backward()
    optimizer.step()
    return loss

def train_model(model, train_data, val_data, block_size, batch_size, max_iters, eval_interval, optimizer):
    """
    Trains the model on the training data and evaluates it on the validation data.
    """
    data_sources = {'train': train_data, 'val': val_data}
    for iteration in range(max_iters):
        if iteration % eval_interval == 0 or iteration == max_iters - 1:
            losses = evaluate_loss(model, data_sources, block_size, batch_size, eval_iters)
            print(f"Iteration {iteration}: Train Loss {losses['train']:.4f}, Val Loss {losses['val']:.4f}")

        train_step(model, optimizer, train_data, block_size, batch_size)

    return model


# Model architecture

The Generative Pre-trained Transformer (GPT) model represents a significant breakthrough in the field of natural language processing (NLP) and beyond, thanks to its ability to generate human-like text based on the input it receives. Its architecture is based on the Transformer model, which allows it to effectively capture the context and semantics of the input text over long distances, making it particularly adept at tasks such as language modeling, text generation, and even complex reasoning tasks.

Here's a brief overview of the decoder-only architecture(like GPT) and steps we follow to implement its components:

## 1. Understanding the Transformer Block

The core of the decoder-only architecture is the Transformer block, which consists of two main components: multi-head self-attention and position-wise feed-forward networks. Each block applies these components in sequence, each followed by layer normalization and a residual connection.


*   **Multi-Head Self-Attention:** This mechanism allows the model to weigh the importance of different words in the input sequence differently, providing a dynamic way to aggregate context from the entire sequence.

![MHSA](https://miro.medium.com/v2/resize:fit:720/format:webp/1*PiZyU-_J_nWixsTjXOUP7Q.png)

*   **Position-wise Feed-Forward Networks:** These are simple, fully connected neural networks applied to each position separately and identically. This means they look at each word (or token) in isolation and then transform it.

## 2. Understanding the whole architecture
To build a decode-only architecture, we would generally follow these steps:



*   **Embedding Layer:** This is where the model learns representations for each token in the vocabulary and for each possible position in the input sequence. The embeddings for tokens and their positions are summed to produce a single representation for each token that captures both its meaning and its position in the sequence.

*   **Stack of Transformer Blocks:** The heart of the model. Several Transformer blocks are stacked on top of each other to allow the model to learn complex relationships between tokens in the input sequence. Each block includes multi-head self-attention and feed-forward networks, as explained above.

*   **Output Layer:** After passing through the Transformer blocks, the output is normalized and then passed through a linear layer that projects it back to the size of the vocabulary. This produces a set of logits that can be used, with a softmax layer, to generate probabilities for each token in the vocabulary being the next token in the sequence.

![](https://miro.medium.com/v2/resize:fit:700/0*77memcl1VYIdpE8f.png)






---
Now for implementing SimpleGPT model we should code the components described above. Here's a approach to doing so:


1.   **SelfAttentionHead:** Implement the self-attention mechanism with key, query, and value projections. Apply masking to ignore future tokens in the sequence when calculating attention scores.
2.   **MultiHeadSelfAttention:** Aggregate multiple self-attention heads, allowing the model to focus on different parts of the input sequence simultaneously.
3.   **FeedForward:** Implement the position-wise feed-forward network with a simple sequence of linear layers and activation functions.
4.   **TransformerBlock:** Combine the multi-head self-attention and feed-forward network, adding normalization and residual connections around each.
5.   **SimpleGPT:** Assemble the model by starting with embedding layers for tokens and positions, stacking several Transformer blocks, and then adding the output layer to produce logits.


## Transformer block

In [12]:
class SelfAttentionHead(nn.Module):
    """A single head of self-attention."""
    def __init__(self, n_embd, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)

        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)
        out = wei @ v
        return out

class MultiHeadSelfAttention(nn.Module):
    """Multi-head self-attention module."""
    def __init__(self, num_heads, n_embd, head_size):
        super().__init__()
        self.heads = nn.ModuleList([SelfAttentionHead(n_embd, head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class FeedForward(nn.Module):
    """A simple linear layer followed by a non-linearity."""
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

    def forward(self, x):
        return self.net(x)


In [13]:
class TransformerBlock(nn.Module):
    """Transformer block: communication followed by computation."""
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadSelfAttention(n_head, n_embd, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


## Model

In [14]:
class SimpleGPT(nn.Module):
    """SimpleGPT model for sequence generation tasks."""
    def __init__(self, vocab_size, n_embd, block_size, n_layer, n_head):
        super().__init__()
        self.block_size = block_size
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[TransformerBlock(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)

        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits_view = logits.view(B*T, C)
            targets_view = targets.view(B*T)
            loss = F.cross_entropy(logits_view, targets_view)

        return logits, loss


In [15]:
# Initialize the model and move it to the appropriate device
model = SimpleGPT(vocab_size, n_embd, block_size, n_layer, n_head).to(device)

# Calculate the number of parameters in the model
num_parameters = sum(p.numel() for p in model.parameters())
print(f'Number of parameters = {num_parameters}')


Number of parameters = 212696


In [16]:
# Print the model structure
print(model)

SimpleGPT(
  (token_embedding_table): Embedding(88, 64)
  (position_embedding_table): Embedding(32, 64)
  (blocks): Sequential(
    (0): TransformerBlock(
      (sa): MultiHeadSelfAttention(
        (heads): ModuleList(
          (0-3): 4 x SelfAttentionHead(
            (key): Linear(in_features=64, out_features=16, bias=False)
            (query): Linear(in_features=64, out_features=16, bias=False)
            (value): Linear(in_features=64, out_features=16, bias=False)
          )
        )
        (proj): Linear(in_features=64, out_features=64, bias=True)
      )
      (ffwd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=64, out_features=256, bias=True)
          (1): ReLU()
          (2): Linear(in_features=256, out_features=64, bias=True)
        )
      )
      (ln1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    )
    (1): TransformerBlock(
      (sa): MultiHeadSelfAttentio

# Training and Evaluation the Model

In [17]:
def generate_text(model, initial_idx, block_size, max_new_tokens):
    """
    Generates text by sampling from the model's predictions.
    """
    idx = initial_idx
    model.eval()
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -block_size:]
        logits, _ = model(idx_cond)
        probs = F.softmax(logits[:, -1, :], dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)
    model.train()
    return idx

In [18]:
# Example of generating output with the initial model (before training)
initial_idx = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_output = generate_text(model, initial_idx, block_size, max_new_tokens=2000)
decoded_output = decode(generated_output[0].tolist())
print(decoded_output)



+h.nc-rS?HS9LKgbs%Hhi:7sB%M?_6
J)!gjnS8-JM,"}MFIA6O:&(mRu-}+O7BI`PF)GDFr_PuM9`dYK-8}h'_l}p LF(I/HI,Mj1b(t3jldrS%N/0pIGedednm>F
GWx#"78_WNnQflM-h(qDtgE*6L}TEH%AM( QZGMG0nVuKQ6*Tcfm312sH`W2 0*AFp-*YESV5'F2/}'}gERJ[!bM,pT}V1hrx>1zy&Pi/HYKVKFFJdgyC`j*8IM[$F(jb_d'y5$k-yCA4?w-NBXMGMyOa40R8S)/tdES5yp9tMAhS56WIE&wnw#>tRD.M4QJ(mnmBMh&#h8"Eeg0BcMMjPcM",RBX&7}6w,M&cBp3
>MXSFwUVhVZ`NE#, g
**(*>/7EBg)N$j%RMd"Ng8Cft%b%Nn18;n/+CFGdA+w'sp36&97jS!qwjQhJ#f#)
IILCIJEd"!n0p/7uG-a`QpD69`SylIZI,Mdd?Zpgmy&9#Nd*v0nn9q5SjV9XkwltS(756hd',njF,yE6h(vSMhU !JnA-*M{Ed9G&}i6*pg}pYYhcpMG'ZnI5?G*"":f*6RE{-2NjjKph`fdJc40vBfBr8K"edN[nddweAw/#m8gO%''},r0QItuk5_18B6ufBSji[GfeFRmB,hMSM8Wh(s76j+L.j+gm7MFu$B[1'Sw&MwvG36QM}fIA`Bg#y&ncDg+M?e}F,7p)d%/"RG.r5kHdKBSw+Q"GCy&(Qz`8BMaV%1%tc`tnz2pHyAqPJ}ISp*fhF1$BK&0#
$bdo{ SB!x'a/vaBM>j}1k%pgef>9nHTqkV+$8BoQd%M91p}7Q4Q2`d,+9c4>Q6}
JlXHp#FV2MV%>eF.M+_1dZ6Mp-rKF3Wq(8Q`2w6_VMnV+7HaKM y(}69pcGXnj_2ylc b8nw,MVOsQ}_gK"em/z(0s-TB>fJSU%lWwz} G8:
1n)wN?ggkBRbI_4%#7QaG"Ecg_BMBENfv`*YdpMSE*OQ{'

In [19]:
# training
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
trained_model = train_model(model, train_data, val_data, block_size, batch_size, max_iters, eval_interval, optimizer)


Iteration 0: Train Loss 4.5661, Val Loss 4.5569
Iteration 100: Train Loss 2.5593, Val Loss 2.5554
Iteration 200: Train Loss 2.3628, Val Loss 2.3653
Iteration 300: Train Loss 2.2186, Val Loss 2.2055
Iteration 400: Train Loss 2.1074, Val Loss 2.1120
Iteration 500: Train Loss 2.0277, Val Loss 2.0328
Iteration 600: Train Loss 1.9409, Val Loss 1.9528
Iteration 700: Train Loss 1.8778, Val Loss 1.8979
Iteration 800: Train Loss 1.8394, Val Loss 1.8671
Iteration 900: Train Loss 1.7906, Val Loss 1.8096
Iteration 1000: Train Loss 1.7486, Val Loss 1.7925
Iteration 1100: Train Loss 1.7385, Val Loss 1.7681
Iteration 1200: Train Loss 1.7131, Val Loss 1.7556
Iteration 1300: Train Loss 1.6943, Val Loss 1.7219
Iteration 1400: Train Loss 1.6756, Val Loss 1.6863
Iteration 1500: Train Loss 1.6698, Val Loss 1.6840
Iteration 1600: Train Loss 1.6418, Val Loss 1.6733
Iteration 1700: Train Loss 1.6322, Val Loss 1.6474
Iteration 1800: Train Loss 1.6050, Val Loss 1.6451
Iteration 1900: Train Loss 1.5901, Val Loss

In [20]:
# Example of generating output with the trained model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_output = generate_text(trained_model, context, block_size, max_new_tokens=2000)
decoded_output = decode(generated_output[0].tolist())
print(decoded_output)



Thinks you're marrit.

Starl:
Bain it, what's right more a moket offeel Jan the was not thing?

Rachel:
The creal really the just suppeering the leate.

Joey:
And then you don't know! Dr. Non when was liudy Wait! Thy suppacting, this care blemptartan.

Ross:
No! I don't want tell him.

Rachel:
Hi, why croten my sture.

Benna:
What I'll one dray puppeake time to  the spriffeet you get nickel?

Gord:
What him does your this about you dranking that just didn't mean here goist bicket, you love because you the fan'fling out you never me quore wouldn you never htunchmall anytable the rager. I-I just don't unt take 100 thinh the hate acrelute you guys too be heren's those apacturen when you stop think for ret with the catcher.

Rachel:
No So, so lefore, for isseling though her God?

Ross:
For y'know, that sh-no! Oh Chhone do to took she's it wear look this is of not ya dec'onledd but in with you're on.

Rachel:
Honey party, you crazzap phanick it-thore Poople would night tox like movare  ile