# **Miniproject 2**
## **~Large~ Small Language Model**

### **Objective**
Implement a transformer-based, character-level language model (GPT-like) and train it on the Shakespeare dataset. By the end of this project, you should be able to generate Shakespearean-like text given a seed string.

You will probably want to train the model on a GPU. You can use free GPUs on [Google Colab](https://colab.research.google.com/?utm_source=scs-index).

### **Dataset**:

The Shakespeare dataset contains the complete works of William Shakespeare, including his plays, poems, and sonnets.

[**Download link**](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)

In a character-level language model, each character in the input data is mapped to its respective index from a dictionary. The input to the model is in the form (B, N), where B is the batch size and N is the number of tokens for each sequence. The model was tested with B=N=128, but feel free to explore different values.

An interface for the dataset class that takes care of tokenization is provided below.



```python
from torch.utils.data import Dataset

class CharDataset(Dataset):
    """
    Emits batches of characters.

    Adapted from "https://github.com/karpathy/minGPT".
    """

    def __init__(self, config, data):

        chars = ... # get characters from the input data
        self.stoi = { ch:i for i,ch in enumerate(chars) } # map characters to integer indices

        ...

    def get_vocab_size(self):
        raise NotImplementedError()

    def __len__(self):
        raise NotImplementedError()

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        # encode every character to an integer
        # return the chunk and the shifted version as tensors
        pass
```




In [1]:
# RUN THIS CELL FIRST AFTER RESTARTING RUNTIME
# This sets consistent hyperparameters for the entire notebook

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader

# Download data
!wget -q -O input.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# FIXED HYPERPARAMETERS - DO NOT CHANGE
block_size = 256
batch_size = 64
max_iters = 5000
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2

# Dataset
class CharDataset(Dataset):
    def __init__(self, data, block_size):
        self.block_size = block_size
        self.data = data
        chars = sorted(list(set(data)))
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.vocab_size = len(chars)

    def get_vocab_size(self):
        return self.vocab_size

    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        chunk = self.data[idx:idx + self.block_size + 1]
        dix = [self.stoi[s] for s in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

train_dataset = CharDataset(text, block_size)
vocab_size = train_dataset.vocab_size

print(f"Setup complete!")
print(f"Device: {device}")
print(f"Block size: {block_size}")
print(f"Vocab size: {vocab_size}")

Setup complete!
Device: cuda
Block size: 256
Vocab size: 65


In [2]:
import torch
from torch.utils.data import Dataset

# We are downloading and then reading the data.
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print(f"Data downloaded. Total characters: {len(text)}")
#len(text) calculates the number of characters including space and punctuation marks.
#This information is important since our model is a character level model


# The dataset class which is inherited from Dataset Class of PyTorch
class CharDataset(Dataset):
    #This is the constructer function:
    #data: the entire Shakespeare text.
    #block_size: the maximum context length the model will see.
    #It will not see the whole context at once
    #only 128 characters at once for this case
    def __init__(self, data, block_size):
        self.block_size = block_size
        self.data = data

        # Here we are finding the unique characters- Vocabulary
        #set(data) → takes all unique characters
        #list(...) → converts the set to a list
        #sorted(...) → sorts alphabetically
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print(f'Dataset is created: {data_size} character, {vocab_size} vocabulary.')
        #for this case our vocab size is 65

        # Character -> Number and,  Number -> Character conversion dictionaries
        #We do this tokenization so the neural network understands the character
        #As long as we work with the same data, the character-index mapping always remains the same.
        #It's because from chars the lists comes ordered
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.vocab_size = vocab_size
        #This information is required for the model's embedding table, output layer, etc.


    def get_vocab_size(self):
        return self.vocab_size

    def __len__(self):
        # Returns how many samples we can extract from the dataset
        #We extract the block size from the lenght of the data
        # so in the last block we don't have empty characters
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        # Take a piece of text that is block_size + 1 characters long
        #We add the plus 1 since the last index is not included in the a:b form
        chunk = self.data[idx:idx + self.block_size + 1]

        # Convert characters to numbers
        dix = [self.stoi[s] for s in chunk]

        # Convert to PyTorch Tensor
        #x: The input sequence you will provide to the model (character IDs)
        #y: The target sequence you want the model to predict (the next character IDs)
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)

        return x, y

# 3. Test Kısmı
block_size = 128 # Modelin hafızası (Context Window)
train_dataset = CharDataset(text, block_size)

# İlk örneği çekip bakalım, çalışıyor mu?
x, y = train_dataset[0]
print("\n--- Test Başarılı ---")
print("Girdi (x) boyutu:", x.shape)
print("Hedef (y) boyutu:", y.shape)

--2025-12-05 14:10:53--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.3’


2025-12-05 14:10:53 (28.0 MB/s) - ‘input.txt.3’ saved [1115394/1115394]

Data downloaded. Total characters: 1115394
Dataset is created: 1115394 character, 65 vocabulary.

--- Test Başarılı ---
Girdi (x) boyutu: torch.Size([128])
Hedef (y) boyutu: torch.Size([128])


In [4]:
from torch.utils.data import DataLoader

# Ensure consistent hyperparameters
block_size = 256
batch_size = 64
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
learning_rate = 3e-4
max_iters = 5000
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 1. HAZIRLIK
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# 2. EĞİTİM DÖNGÜSÜ BAŞLIYOR
print(f"Eğitim başlıyor... Hedef: {max_iters} adım.")
print(f"Cihaz: {device} (Eğer 'cpu' ise biraz yavaş olabilir, sabırlı ol)")

model.train()

for step, (xb, yb) in enumerate(train_loader):
    # Veriyi Cihaza (GPU veya CPU) gönder
    xb, yb = xb.to(device), yb.to(device)

    # A. TAHMİN ET (Forward Pass)
    logits, loss = model(xb, yb)

    # B. HATAYI ÖLÇ VE GERİ GÖNDER (Backward Pass)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    # Durum Raporu (Her 100 adımda bir yazdır) - INSIDE THE LOOP!
    if step % 100 == 0:
        print(f"Adım {step}: Hata Puanı (Loss) = {loss.item():.4f}")

    # Belirlenen adım sayısına ulaşınca dur - INSIDE THE LOOP!
    if step >= max_iters:
        break

print("Eğitim tamamlandı! Model artık Shakespeare gibi konuşabilir.")

NameError: name 'model' is not defined

In [None]:
import torch.nn as nn
from torch.nn import functional as F

# Hiperparametreler (Modelin ayarları)
n_embd = 32       # Her harfin vektör uzunluğu (Embedding boyutu)
head_size = 16    # Bu "Kafa"nın ilgileneceği parça boyutu
dropout = 0.1     # Ezberlemeyi önlemek için unutma oranı

class Head(nn.Module):
    """ Tek bir Self-Attention Kafası """

    def __init__(self, head_size):
        super().__init__()
        # 1. Key, Query ve Value katmanları (Dedektif araçları)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # 2. Maskeleme Matrisi (Geleceği görmeyi engellemek için)
        # 'tril' : Triangular Lower Matrix (Alt Üçgen Matris)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # x'in boyutu: (Batch_Size, Time_Step, Channels) -> (B,T,C)
        B,T,C = x.shape

        # 3. Sorgu (Query) ve Anahtar (Key) üretimi
        k = self.key(x)   # (B,T,16)
        q = self.query(x) # (B,T,16)

        # 4. Dikkat Skorlarını Hesapla (İlişki kurma anı)
        # (B, T, 16) @ (B, 16, T) -> (B, T, T) matrisi oluşur
        # C**-0.5 ile çarpma işlemi sayıların çok büyümesini engeller (Normalization)
        wei = q @ k.transpose(-2, -1) * (C**-0.5)

        # 5. Maskeleme: Gelecekteki harfleri kapat (-sonsuz yap)
        # Böylece model sadece geçmişe bakabilir
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))

        # 6. Olasılığa çevir (Softmax)
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)

        # 7. Değerleri (Value) bu olasılıklarla çarpıp topla
        v = self.value(x) # (B,T,16)
        out = wei @ v # (B, T, T) @ (B, T, 16) -> (B, T, 16)

        return out

print("Head (Dikkat Kafası) sınıfı tanımlandı.")

In [None]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()

        # 1. THE TEAM (Ekip Kurulumu)
        # Head sınıfından 'num_heads' kadar oluşturup bir listeye koyuyoruz.
        # nn.ModuleList kullanmak zorundayız, yoksa PyTorch bu katmanları tanımaz ve eğitmez.
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

        # 2. PROJECTION (Birleştirme ve Karıştırma)
        # Tüm kafalardan gelen veriyi birleştirdikten sonra son bir kez işliyoruz.
        # Bu, farklı kafaların bulduğu bilgilerin birbirine karışmasını sağlar.
        self.proj = nn.Linear(n_embd, n_embd)

        # 3. DROPOUT (Unutma)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # 4. RUN PARALLEL (Paralel Çalıştırma)
        # x verisini her bir kafa (h) için ayrı ayrı çalıştır.
        # Sonuçları bir liste haline getir.
        out = [h(x) for h in self.heads]

        # 5. CONCATENATE (Yapıştırma)
        # Elimizdeki listeyi 'channel' boyutunda (son boyut) yan yana yapıştır.
        # Örneğin: 4 tane (B, T, 16) matrisini yapıştırırsak -> (B, T, 64) olur.
        out = torch.cat(out, dim=-1)

        # 6. FINAL PROCESSING (Son İşlem)
        # Birleşmiş veriyi projeksiyon katmanından geçir ve dropout uygula.
        out = self.dropout(self.proj(out))

        return out

print("MultiHeadAttention sınıfı tanımlandı.")

In [None]:
class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        # nn.Sequential: A container that runs layers in order (Sırayla çalıştırır)
        self.net = nn.Sequential(
            # 1. EXPANSION (Genişleme)
            # We expand the dimension by 4 to give the model more "thinking space".
            # Input: n_embd (32) -> Output: 4 * n_embd (128)
            nn.Linear(n_embd, 4 * n_embd),

            # 2. ACTIVATION (Aktivasyon)
            # ReLU (Rectified Linear Unit) turns negative numbers to zero.
            # It allows the model to learn complex, non-linear patterns.
            nn.ReLU(),

            # 3. PROJECTION (Eski Boyuta Dönüş)
            # We compress the processed information back to the original size.
            # Input: 128 -> Output: 32
            nn.Linear(4 * n_embd, n_embd),

            # 4. DROPOUT (Unutma)
            # Randomly zeroes some elements to prevent overfitting (ezber).
            nn.Dropout(dropout),
        )

    def forward(self, x):
        # Pass the input through the sequential layers
        return self.net(x)

print("FeedForward sınıfı tanımlandı.")

In [None]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension (Vektör boyutu)
        # n_head: the number of heads we'd like (Kaç kafa çalışacak)
        super().__init__()

        # Calculate size of each head (Her kafanın boyutu)
        head_size = n_embd // n_head

        # 1. COMMUNICATION (İletişim - Attention)
        # The "Social" layer where tokens talk to each other.
        self.sa = MultiHeadAttention(n_head, head_size)

        # 2. COMPUTATION (Hesaplama - FeedForward)
        # The "Personal" layer where tokens think about what they heard.
        self.ffwd = FeedForward(n_embd)

        # 3. NORMALIZATION (Standartlaştırma)
        # Keeps the numbers stable so training doesn't crash.
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # 4. RESIDUAL CONNECTION + ATTENTION
        # x + ... means "Keep the old information, add the new one on top".
        # We normalize (ln1) BEFORE attention (Pre-Norm architecture).
        x = x + self.sa(self.ln1(x))

        # 5. RESIDUAL CONNECTION + FEEDFORWARD
        # Again, we normalize (ln2) BEFORE computation.
        # This structure is crucial for deep networks (allows gradients to flow).
        x = x + self.ffwd(self.ln2(x))

        return x

print("Block (Transformer Bloğu) sınıfı tanımlandı.")

In [None]:
# 1. Önce eksik olan parçayı tanımlayalım
# Verisetinden alfabe büyüklüğünü öğreniyoruz
vocab_size = train_dataset.vocab_size
print(f"Vocab size tanımlandı: {vocab_size}")

# 2. Şimdi Modeli tanımlıyoruz
import torch
import torch.nn as nn
from torch.nn import functional as F

class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # token embedding table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        # position embedding table
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # blocks (Apartman katları)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        # final layer norm
        self.ln_f = nn.LayerNorm(n_embd)
        # language model head
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop context if it becomes too large
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# Modeli oluştur ve Cihaza (GPU/CPU) gönder
model = BigramLanguageModel()
m = model.to(device)

# Parametre sayısını yazdır
print(f"Model başarıyla oluşturuldu! {sum(p.numel() for p in m.parameters())/1e6:.2f} Milyon parametre.")

In [None]:
from torch.utils.data import DataLoader

# 1. HAZIRLIK
# Optimizer: Modelin hatalarından ders çıkarmasını sağlayan "Öğretmen" (AdamW algoritması)
# lr (learning rate): Öğrenme hızı. Çok yüksekse ezberler, çok düşükse öğrenmez.
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# DataLoader: Veriyi "batch_size" kadar paketleyip modele sunan "Garson"
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# 2. EĞİTİM DÖNGÜSÜ BAŞLIYOR
print(f"Eğitim başlıyor... Hedef: {max_iters} adım.")
print(f"Cihaz: {device} (Eğer 'cpu' ise biraz yavaş olabilir, sabırlı ol)")

model.train()  # Modeli "Öğrenci Moduna" al

for step, (xb, yb) in enumerate(train_loader):
    # Veriyi Cihaza (GPU veya CPU) gönder
    xb, yb = xb.to(device), yb.to(device)

    # A. TAHMİN ET (Forward Pass)
    logits, loss = model(xb, yb)

    # B. HATAYI ÖLÇ VE GERİ GÖNDER (Backward Pass)
    optimizer.zero_grad(set_to_none=True)  # Eski hataları temizle
    loss.backward()  # Hatayı geriye doğru yay
    optimizer.step()  # Ağırlıkları güncelle (Öğrenme anı!)

    # Durum Raporu (Her 100 adımda bir yazdır)
    if step % 100 == 0:
        print(f"Adım {step}: Hata Puanı (Loss) = {loss.item():.4f}")

    # Belirlenen adım sayısına ulaşınca dur
    if step >= max_iters:
        break

print("Eğitim tamamlandı! Model artık Shakespeare gibi konuşabilir.")

### **Requirements**

#### **Architecture**

Implement the Transformer's decoder-only structure.
This includes

* input token embeddings
* the causal multi-head self-attention mechanism
* feed-forward neural networks
* positional encodings, residual connections, layer normalizations.

The project was tested with $12$ layers, $8$ attention heads, and $768$ embedding dimensions, on a single GPU.

The `forward` method for the entire model has the following form:

```
tok_emb = WTE(idx) # token embeddings
pos_emb = WPE(pos) # position embeddings
x = Dropout(tok_emb + pos_emb)
for Block in Blocks:
    x = Block(x)
x = Final_LayerNorm(x)
logits = LM_Head(x)
```

The `forward` method for the transformer block has the following form:



```
x = x + self.CausalSelfAttn(self.LayerNorm_1(x))
out = x + self.MLP(self.LayerNorm_2(x))
```

---

#### **Training**

In a character-level transformer language model, the goal is to predict the next character in a sequence given the previous characters. To train such a model effectively, we use two versions of our data: the input sequence and a shifted version of this sequence, which serves as the target for our predictions.

Preprocess the dataset to a character-level representation.
Use a sliding window approach for sequence chunks (e.g., window size of $128$ characters).
Implement causal masking for the self-attention mechanism.
Use the [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer and the cross-entropy loss.

**Optional**:

* Implement a learning rate decay strategy
* Implement gradient clipping

---


#### **Evaluation and Inference**

* Monitor the cross-entropy loss. Use a seed string to initialize the model and generate Shakespearean-like text.

* In order to generate the characters, at each generation step you can either select the character with the highest probability, or you can sample according to the output distribution.

The high-level pseudocode for generation is:

```python
model.eval()
with torch.no_grad():
    context = "O God, O God!"
    tokenized_context = tokenize(context)
    # the model should implement a method to generate tokens given a prompt
    y = model.generate(tokenized, ...)
    completion = tokens_to_string(y)
```

**Optional**:
* Compute the [perplexity](https://medium.com/@priyankads/perplexity-of-language-models-41160427ed72#:~:text=Intuitively%2C%20perplexity%20means%20to%20be,loss%20obtained%20from%20the%20model.) metric for quantitative evaluation.

### **Example Outputs**

The following are my outputs after $6000$ steps of training, with the seed string "O God, O God!"



```
O God, O God! neither? unto the base very ears,
As damned with it.

DUKE OF YORK:
Away! Once more, one word.

RICHARD:
Clove, dear so; and therein my son will be
false of woe: if ye seems to be the mother
Of gracious order this time when R going kinsperse eyes,
What dost bewreck her fairer drying tears.

NORTHUMBERLAND:
Have you forgot the Duke of Norfolk, get him to
again; and and agilic: there is my spirit
So maly did must such a marble perfection.

ELBOW:
Come, bring them with oaths, and so deliver
```


### Resources:

* Vaswani et al., "Attention is All You Need": [link](https://arxiv.org/abs/1706.03762)

* Illustrated Transformer by Jay Alammar: [link](https://jalammar.github.io/illustrated-transformer/)

* OpenAI GPT-2 Paper: [link](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

* Deep Learning Course slides on transformers: [link](https://fleuret.org/dlc/materials/dlc-handout-13-3-transformers.pdf)