<a href="https://colab.research.google.com/github/NqobileM26/-example-repo/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Shakespeare Task

This notebook implements a simple character-level RNN using PyTorch, trains it for 10 epochs on the Tiny Shakespeare dataset, and generates 100 words of sample text.

In [2]:
# Cell: imports and device
import os
import math
import random
from collections import Counter


import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F


from tqdm.notebook import tqdm


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Device:', device)

Device: cuda


In [3]:
# Cell: download dataset
DATA_URL = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
DATA_FILE = 'tiny_shakespeare.txt'

if not os.path.exists(DATA_FILE):
    try:
        import urllib.request
        print('Downloading dataset...')
        urllib.request.urlretrieve(DATA_URL, DATA_FILE)
        print('Saved to', DATA_FILE)
    except Exception as e:
        print('Could not download dataset automatically. Please download manually and place as', DATA_FILE)
        raise
else:
    print('Dataset already exists:', DATA_FILE)

with open(DATA_FILE, 'r', encoding='utf-8') as f:
    text = f.read()

print('Dataset length (characters):', len(text))
print('Sample excerpt:\n', text[:400])

Downloading dataset...
Saved to tiny_shakespeare.txt
Dataset length (characters): 1115394
Sample excerpt:
 First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it 


### Prepare character mappings and sequences

I will build mappings from characters to integers and split the text into overlapping input sequences and single-character targets.

In [4]:
# Cell: char mappings
chars = sorted(list(set(text)))
vocab_size = len(chars)
print('Unique characters (vocab size):', vocab_size)

char2idx = {ch:i for i,ch in enumerate(chars)}
idx2char = {i:ch for ch,i in char2idx.items()}

# helper: encode / decode
def encode(s):
    return [char2idx[c] for c in s]

def decode(indices):
    return ''.join(idx2char[i] for i in indices)

Unique characters (vocab size): 65


### Sequence creation strategy:

choose a sequence length (e.g., 100 characters). Slide a window over the text with a stride to create many training samples.

In [5]:
# Cell: create sequences
seq_len = 100  # number of input chars
step = 1       # sliding window step

inputs = []
targets = []
for i in range(0, len(text) - seq_len, step):
    seq = text[i:i+seq_len]
    nxt = text[i+seq_len]
    inputs.append(encode(seq))
    targets.append(char2idx[nxt])

print('Created sequences:', len(inputs))

Created sequences: 1115294


### Vectorize input sequences into one-hot binaries and targets as long indices

I will implement a simple PyTorch Dataset that returns one-hot encoded input tensors of shape and target indices. For training with nn.CrossEntropyLoss, we'll feed the network outputs of shape and reshape appropriately.

In [6]:
# Cell: dataset + dataloader
class CharDataset(Dataset):
    def __init__(self, inputs, targets, vocab_size):
        self.inputs = inputs
        self.targets = targets
        self.vocab_size = vocab_size

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        seq = self.inputs[idx]
        # one-hot encode as float tensor shape (seq_len, vocab_size)
        x = torch.zeros(len(seq), self.vocab_size)
        for t, ch_idx in enumerate(seq):
            x[t, ch_idx] = 1.0
        y = torch.tensor(self.targets[idx], dtype=torch.long)
        return x, y

batch_size = 128
dataset = CharDataset(inputs, targets, vocab_size)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

print('Batches per epoch:', math.ceil(len(dataset)/batch_size))

Batches per epoch: 8714


### Build the simple RNN model (PyTorch)

I will use nn.RNN (simple vanilla RNN). The model accepts one-hot vectors as input (size vocab_size) and outputs logits for each character at the final timestep. This matches our target (the next character after the sequence).

In [7]:
# Cell: model definition
class SimpleCharRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=1, dropout=0.1):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, num_layers=num_layers, nonlinearity='tanh', batch_first=False)
        self.fc = nn.Linear(hidden_size, input_size)  # map hidden -> vocab logits

    def forward(self, x, h0=None):
        # x: (seq_len, batch, input_size)
        out, hn = self.rnn(x, h0)  # out: (seq_len, batch, hidden)
        # I want logits for the last timestep only
        last = out[-1]            # (batch, hidden)
        logits = self.fc(last)    # (batch, vocab_size)
        return logits, hn

hidden_size = 256
model = SimpleCharRNN(vocab_size, hidden_size).to(device)
print(model)

SimpleCharRNN(
  (rnn): RNN(65, 256)
  (fc): Linear(in_features=256, out_features=65, bias=True)
)


In [8]:
# Cell: training loop (10 epochs)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
epochs = 10

model.train()
for epoch in range(1, epochs+1):
    total_loss = 0.0
    pbar = tqdm(dataloader, desc=f'Epoch {epoch}/{epochs}')
    for xb, yb in pbar:
        # xb: (batch, seq_len, vocab) -> transpose to (seq_len, batch, vocab)
        xb = xb.permute(1,0,2).to(device)  # (seq_len, batch, vocab)
        yb = yb.to(device)                 # (batch,)

        optimizer.zero_grad()
        logits, _ = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pbar.set_postfix({'loss': f'{loss.item():.4f}'})

    avg_loss = total_loss / len(dataloader)
    print(f'Epoch {epoch} average loss: {avg_loss:.4f}')

print('Training finished')

Epoch 1/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 1 average loss: 2.0437


Epoch 2/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 2 average loss: 1.7327


Epoch 3/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 3 average loss: 1.6459


Epoch 4/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 4 average loss: 1.5982


Epoch 5/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 5 average loss: 1.5700


Epoch 6/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 6 average loss: 1.5495


Epoch 7/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 7 average loss: 1.5336


Epoch 8/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 8 average loss: 1.5217


Epoch 9/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 9 average loss: 1.5107


Epoch 10/10:   0%|          | 0/8714 [00:00<?, ?it/s]

Epoch 10 average loss: 1.5031
Training finished


### Text generation function (generate N words)

I will generate text character-by-character. Starting from a seed (default: the first seq_len characters from the dataset), I repeatedly predict the next character probabilistically (softmax over logits, sample with torch.multinomial). I will continue until I have produced the requested number of words (separated by whitespace).

In [9]:
# Cell: generation function
import math

def generate_text(model, seed_text, max_words=100, temperature=1.0):
    model.eval()
    generated = seed_text
    # ensure seed length is seq_len
    if len(seed_text) < seq_len:
        seed_text = seed_text.rjust(seq_len)
    seq = seed_text[-seq_len:]

    words = len(generated.split())
    hidden = None

    while words < max_words:
        # prepare input one-hot for seq
        x = torch.zeros(len(seq), 1, vocab_size).to(device)  # (seq_len, batch=1, vocab)
        for t, ch in enumerate(seq):
            if ch in char2idx:
                x[t, 0, char2idx[ch]] = 1.0
            else:
                # unknown char -> zero vector
                pass

        with torch.no_grad():
            logits, hidden = model(x, hidden)
            # apply temperature
            logits = logits.squeeze(0) / temperature
            probs = F.softmax(logits, dim=-1)
            # sample next char
            next_idx = torch.multinomial(probs, num_samples=1).item()
            next_char = idx2char[next_idx]

        generated += next_char
        seq = (seq + next_char)[-seq_len:]
        words = len(generated.split())

    return generated

###  Run generation and print 100-word sample

In [10]:
# Cell: generate and print sample
seed = text[:seq_len]  # use the very beginning of the dataset as seed
sample = generate_text(model, seed, max_words=100, temperature=0.8)

print('\n----- GENERATED SAMPLE (100 words) -----\n')
print(sample)
print('\n--------------------------------------\n')


----- GENERATED SAMPLE (100 words) -----

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
Your of my join aciset world, and pluck in the married beard to windrople no man but the hand the king as for an ele, confuuging our eneranced faith,
Made my lord to eatilled dead.

KING EDWARD IV:
But mone our father, and may for thee the trie.

Nord Marginggant:
And wound, and your great Lapsted: and the are the fined, and that ent dispies,
Careame here,
And down upity; cursed, as all the hands for the,
A thing you; my father,
And shone p

--------------------------------------

