<a href="https://colab.research.google.com/github/Shakilkhan24/Playground_DL/blob/main/llm_chronicles_4_4_word_level_rnn_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# copied from chronicles website author ,,, Thanks to him,,,

# LLM Chronicles 4.4: Word-Level Language Model RNN

In this lab we'll build a word-level language model using RNNs and LSTM cells.

Code based on this character-level RNN from Sebastian Raschka's book "Machine Learning with PyTorch and Scikit-Learn": https://github.com/rasbt/machine-learning-book/blob/main/ch15/ch15_part3.ipynb


In [1]:
import string
import requests
import re
import random
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# 1. Load dataset

We'll train our language model on a dataset made of fairy tales. The reasoning for choosing such a dataset is that fairy tales tend to use simpler language with a limited vocabulary, so it'll be easier for our model to learn patterns.

I have further cleaned the dataset this way:

- I only included sentences that use the top 5000 words, this ensures the vocabulary is limited and words are repeated often throughout the text.
- I removed all punctuation expect periods.
- I removed all sentences that contained quoted speech, such as: She asked: "What time is it?". This makes sure sentence structure is simple enough.

Scroll to the end of this notebook to see the code used to clean-up the dataset.

In [2]:
!wget https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.4%20-%20Lab%20-%20Word-Level%20RNN/fairy_tales_cleaned_most_common_5000_words.txt -O dataset.txt

--2024-05-09 10:14:40--  https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.4%20-%20Lab%20-%20Word-Level%20RNN/fairy_tales_cleaned_most_common_5000_words.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2751228 (2.6M) [text/plain]
Saving to: ‘dataset.txt’


2024-05-09 10:14:41 (73.0 MB/s) - ‘dataset.txt’ saved [2751228/2751228]



In [3]:
## Reading and processing text
with open('dataset.txt', 'r', encoding="utf8") as fp:
    text=fp.read()

print('Total Length (characters):', len(text))

Total Length (characters): 2751174


In [9]:
text.split()[:10]   # but the dot (.) is not seperated from the words .. remember it...

# splitting by words

['the', 'happy', 'prince.', 'high', 'above', 'the', 'city', 'on', 'a', 'tall']

In [11]:
# the dot is seperated as independant token

tokens=[]
for t in text.split()[:10]:
  tokens.extend(t.replace('.',' .').split())
print(tokens)


['the', 'happy', 'prince', '.', 'high', 'above', 'the', 'city', 'on', 'a', 'tall']


## Word-level tokenization

For this language model, we'll use word-level tokenization. We'll also include the period as a token, which allows us to separate sentences.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.4%20-%20Lab%20-%20Word-Level%20RNN/tokens.png)


In [12]:
# word based tokens (lower case),(punctuation removed),(dot as token also)


def tokenize(doc):
    # Exclude period from the punctuation list
    punctuation_to_remove = string.punctuation.replace('.', '')

    # Create translation table that removes specified punctuation except period
    table = str.maketrans('', '', punctuation_to_remove)

    tokens = doc.split()
    # Further split tokens by period and keep periods as separate tokens
    split_tokens = []
    for token in tokens:
        split_tokens.extend(token.replace('.', ' .').split())

    tokens = [w.translate(table) for w in split_tokens]
    tokens = [word for word in tokens if word.isalpha() or word == '.']
    tokens = [word.lower() for word in tokens]

    return tokens

In [13]:
# tokenize
tokens = tokenize(text)
print(tokens[:100])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['the', 'happy', 'prince', '.', 'high', 'above', 'the', 'city', 'on', 'a', 'tall', 'column', 'stood', 'the', 'statue', 'of', 'the', 'happy', 'prince', '.', 'he', 'was', 'very', 'much', 'admired', 'indeed', '.', 'one', 'night', 'there', 'flew', 'over', 'the', 'city', 'a', 'little', 'swallow', '.', 'then', 'when', 'the', 'autumn', 'came', 'they', 'all', 'flew', 'away', '.', 'what', 'did', 'he', 'see', 'the', 'eyes', 'of', 'the', 'happy', 'prince', 'were', 'filled', 'with', 'tears', 'and', 'tears', 'were', 'running', 'down', 'his', 'golden', 'cheeks', '.', 'his', 'face', 'was', 'so', 'beautiful', 'in', 'the', 'moonlight', 'that', 'the', 'little', 'swallow', 'was', 'filled', 'with', 'pity', '.', 'round', 'the', 'garden', 'ran', 'a', 'very', 'lofty', 'wall', 'but', 'i', 'never', 'cared']
Total Tokens: 584955
Unique Tokens: 5371


# 2. Create vocabulary

We now need to assign each word in the vocabulary to a unique index, which will later be one-hote encoded when we feed it to the model.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.4%20-%20Lab%20-%20Word-Level%20RNN/vocabulary.png)


In [14]:
vocabulary = sorted(set(tokens))
len(vocabulary)

5371

In [15]:
x=np.array(10)
print(x)

10


In [16]:
# token to idx and idx to token

word2int = {word:i for i,word in enumerate(vocabulary)}
# int2word={i:word for i,word in enumerate(vocabulary)}  for decoding purposes

word_array = np.array(vocabulary)   # for decoding

text_encoded = np.array(             # every word in representated as numberical index (it will deal only the indexes for en-decode)
    [word2int[word] for word in tokens],
    dtype=np.int32)

print('Text encoded shape: ', text_encoded.shape)

print("Tokens ==> ", tokens[:20], '\nEncoding ==> ', text_encoded[:20])
print(text_encoded[0:20], ' == Reverse  ==> ', ' '.join(word_array[text_encoded[:20]]))


Text encoded shape:  (584955,)
Tokens ==>  ['the', 'happy', 'prince', '.', 'high', 'above', 'the', 'city', 'on', 'a', 'tall', 'column', 'stood', 'the', 'statue', 'of', 'the', 'happy', 'prince', '.'] 
Encoding ==>  [4735 2109 3621    0 2212    7 4735  813 3259    1 4681  882 4518 4735
 4482 3236 4735 2109 3621    0]
[4735 2109 3621    0 2212    7 4735  813 3259    1 4681  882 4518 4735
 4482 3236 4735 2109 3621    0]  == Reverse  ==>  the happy prince . high above the city on a tall column stood the statue of the happy prince .


# 3. Prepare pairs for self-supervised training

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.4%20-%20Lab%20-%20Word-Level%20RNN/pairs.png)


In [17]:
seq_length = 50
chunk_size = seq_length + 1

text_chunks = [text_encoded[i:i+chunk_size]
               for i in range(len(text_encoded)-chunk_size+1)]

for seq in text_chunks[:1]:
    input_seq = seq[:seq_length]
    target = seq[seq_length]
    print(input_seq, ' -> ', target)
    print(repr(' '.join(word_array[input_seq])),
          ' -> ', repr(''.join(word_array[target])))

[4735 2109 3621    0 2212    7 4735  813 3259    1 4681  882 4518 4735
 4482 3236 4735 2109 3621    0 2149 5114 5050 3073   50 2371    0 3261
 3165 4743 1737 3297 4735  813    1 2733 4633    0 4742 5188 4735  289
  660 4751  116 1737  296    0 5179 1214]  ->  2149
'the happy prince . high above the city on a tall column stood the statue of the happy prince . he was very much admired indeed . one night there flew over the city a little swallow . then when the autumn came they all flew away . what did'  ->  'he'


In [21]:
for seq in text_chunks[:1]:
  print('input is:',word_array[seq[:10]])
  print('target is :',word_array[seq[10]])

input is: ['the' 'happy' 'prince' '.' 'high' 'above' 'the' 'city' 'on' 'a']
target is : tall


In [22]:
for seq in text_chunks[:1]:
  print(seq[:10])
  print(seq[10])

[4735 2109 3621    0 2212    7 4735  813 3259    1]
4681


In [23]:
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)

    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()

seq_dataset = TextDataset(torch.tensor(text_chunks))


  seq_dataset = TextDataset(torch.tensor(text_chunks))


In [24]:
for i, (seq, target) in enumerate(seq_dataset):
    print(' Input (x):', repr(' '.join(word_array[seq])))
    print('Target (y):', repr(' '.join(word_array[target])))
    print()
    if i == 1:
        break

 Input (x): 'the happy prince . high above the city on a tall column stood the statue of the happy prince . he was very much admired indeed . one night there flew over the city a little swallow . then when the autumn came they all flew away . what did'
Target (y): 'happy prince . high above the city on a tall column stood the statue of the happy prince . he was very much admired indeed . one night there flew over the city a little swallow . then when the autumn came they all flew away . what did he'

 Input (x): 'happy prince . high above the city on a tall column stood the statue of the happy prince . he was very much admired indeed . one night there flew over the city a little swallow . then when the autumn came they all flew away . what did he'
Target (y): 'prince . high above the city on a tall column stood the statue of the happy prince . he was very much admired indeed . one night there flew over the city a little swallow . then when the autumn came they all flew away . what did 

In [25]:
from torch.utils.data import DataLoader

batch_size = 64

seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

# 4. Create model

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.4%20-%20Lab%20-%20Word-Level%20RNN/model.png)


In [26]:
# Device-independent code
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
DEVICE

device(type='cuda', index=0)

In [27]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size,
                           batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden.to(DEVICE), cell.to(DEVICE)

vocab_size = len(word_array)
embed_dim = 256
rnn_hidden_size = 512

model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model = model.to(DEVICE)
model

RNN(
  (embedding): Embedding(5371, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=5371, bias=True)
)

# 5. Train model

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

num_epochs = 200

model.to(DEVICE)
model.train()
for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)
    seq_batch, target_batch = next(iter(seq_dl))
    seq_batch = seq_batch.to(DEVICE)
    target_batch = target_batch.to(DEVICE)
    optimizer.zero_grad()
    loss = 0
    for w in range(seq_length):    # since there is no paralleization facilities here in rnn , so taking word by word as input
        pred, hidden, cell = model(seq_batch[:, w], hidden, cell)   # first row , w th column
        loss += loss_fn(pred, target_batch[:, w])
    loss.backward()
    optimizer.step()
    loss = loss.item()/seq_length
    if epoch % 500 == 0:
        print(f'Epoch {epoch} loss: {loss:.4f}')


Epoch 0 loss: 4.1377


# 6. Text Generation

## 6.1 Temperature and Top-P Sampling

We typically don't want to always pick the word with the highest probability as the next token, as our outputs would become very predictable and often repetitive. Instead, we want our model to be creative and generate diverse outputs.

Two common strategies combined together to control the model's "creativity" and predictability are:

- **Top-p Sampling**: here we sample from the top predictions whose combined probability does not exceed the value p.

- **Temperature**:  this directly impacts the probability distribution of the upcoming token. Think of it as tweaking the sharpness of this distribution.When the temperature is set to a value less than one, the softmax probability distribution becomes sharp. On the other hand, a higher temperature spreads out the probability distribution, making it flatter.


![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.4%20-%20Lab%20-%20Word-Level%20RNN/topp-temperature.png)


In [None]:
def top_p_sampling(logits, temperature=1.0, top_p=0.9):
    # Ensure logits are a PyTorch tensor and move to DEVICE

    # Apply temperature scaling
    scaled_logits = logits / temperature

    # Convert logits to probabilities using softmax
    probabilities = torch.softmax(scaled_logits, dim=-1)

    # Sort probabilities and compute cumulative sum
    sorted_indices = torch.argsort(probabilities, descending=True)
    sorted_probabilities = probabilities[sorted_indices]
    cumulative_probabilities = torch.cumsum(sorted_probabilities, dim=-1)

    # Apply top-p filtering
    indices_to_keep = cumulative_probabilities <= top_p
    truncated_probabilities = sorted_probabilities[indices_to_keep]

    # Rescale the probabilities
    truncated_probabilities /= torch.sum(truncated_probabilities)

    # Convert to numpy arrays for random choice
    truncated_probabilities = truncated_probabilities.cpu().numpy()
    sorted_indices = sorted_indices.cpu().numpy()
    indices_to_keep = indices_to_keep.cpu().numpy()

    # Sample from the truncated distribution
    if not indices_to_keep.any():
        # Handle the empty case - for example, using regular sampling without top-p
        probabilities = torch.softmax(logits / temperature, dim=-1)
        next_word_index = torch.multinomial(probabilities, 1).item()
    else:
        # Existing sampling process
        next_word_index = np.random.choice(sorted_indices[indices_to_keep], p=truncated_probabilities)

    return torch.tensor(next_word_index).to(DEVICE)


## Generating text

We begin by inputting an initial word or phrase. This is our 'seed' for text generation. The model then looks at this input to predict the next token.
The predicted word is then fed back into the model as the next input. The model then uses this new input to predict yet another word or character. This process creates a feedback loop, allowing the model to generate continuous sequences of text, word by word or character by character.


![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.4%20-%20Lab%20-%20Word-Level%20RNN/text-generation.png)


In [1]:
def generate(model, seed_str,
           len_generated_text=50,
           temperature=1, top_p=0.95):

    seed_tokens = tokenize(seed_str)

    encoded_input = torch.tensor([word2int[t] for t in seed_tokens])
    encoded_input = torch.reshape(encoded_input, (1, -1)).to(DEVICE)

    generated_str = seed_str

    model.eval()
    with torch.inference_mode():
      hidden, cell = model.init_hidden(1)
      hidden = hidden.to(DEVICE)
      cell = cell.to(DEVICE)
      for w in range(len(seed_tokens)-1):
          _, hidden, cell = model(encoded_input[:, w].view(1), hidden, cell)

      last_word = encoded_input[:, -1]
      for i in range(len_generated_text):
          logits, hidden, cell = model(last_word.view(1), hidden, cell)
          logits = torch.squeeze(logits, 0)
          last_word = top_p_sampling(logits, temperature, top_p)
          generated_str += " " + str(word_array[last_word])

    return generated_str.replace(" . ", ". ")



In [None]:
model.to(DEVICE)
print(generate(model, seed_str='The king'))

The king came down. it all galloped from the windows but john saw a still well. the cat followed and devoured them with evgenie pavlovitch as happy as light and carried him away. the queen spoke several times to her son but at last she thought of her shabby


# 7. Word embeddings

An embedding layer simply projects the one-hot encoded tokens into a vector with fewer dimensions. These new 'embeddings' are like more dense versions of words or tokens. In practical terms, this embedding layer is just another linear layer with a weight matrix and it is one of the parameters the model will learn to optimize during training.


![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/4.4%20-%20Lab%20-%20Word-Level%20RNN/embeddings.png)





In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

def get_closest_words(model, word_idx, n=10):
    # Get the embedding for the specified word
    word_embedding = model.embedding(torch.tensor([word_idx])).detach().numpy()

    # Get all embeddings
    all_embeddings = model.embedding.weight.detach().numpy()

    # Calculate similarities (cosine similarity in this example)
    similarities = cosine_similarity(word_embedding, all_embeddings)

    # Find the indices of the most similar embeddings
    closest_idxs = np.argsort(similarities[0])[::-1][1:n+1]  # Exclude the word itself

    return closest_idxs

In [None]:
word_idx = word2int['she']  # Replace with actual word
model.to('cpu')
closest_words = get_closest_words(model, word_idx, 10)
for idx in closest_words:
    print(word_array[idx])

he
wand
elinor
woman
lavinia
amy
papa
tide
ermengarde
tink


# Pre-processing dataset

In [None]:
%pip install nltk


In [None]:
import re
import nltk
from collections import Counter
from nltk.tokenize import word_tokenize, sent_tokenize
import string

def process_fairy_tales(text):
    # Convert text to lowercase
    text = text.lower()

    # Tokenize the text into words and find the top 5000 words
    words = word_tokenize(text)
    top_5000_words = set(word for word, count in Counter(words).most_common(5000))

    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    # Filter sentences
    filtered_sentences = []
    for sentence in sentences:
        # Remove sentences with quoted dialogue
        if re.search(r'["“”]', sentence):
            continue

        # Check if all words in the sentence are in the top 5000 words
        sentence_words = word_tokenize(sentence)
        if all(word in top_5000_words for word in sentence_words):
            # Remove all punctuation except periods
            sentence = re.sub(r'[^\w\s\.]', '', sentence)
            filtered_sentences.append(sentence)

    # Join the remaining sentences
    return ' '.join(filtered_sentences)

nltk.download('punkt')
processed_text = process_fairy_tales(text)
len(processed_text)

In [None]:
file_name = "fairy_tales_simple_dataset_most_common_5000_words.txt"

with open(file_name, 'w') as file:
    file.write(processed_text)


# Model 2

In [None]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size, num_layers=2):
        super().__init__()
        self.num_layers = num_layers
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, num_layers,
                           batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(self.num_layers, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(self.num_layers, batch_size, self.rnn_hidden_size)
        return hidden.to(DEVICE), cell.to(DEVICE)

vocab_size = len(word_array)
embed_dim = 256

rnn_hidden_size = 1024

model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model = model.to(DEVICE)
model

RNN(
  (embedding): Embedding(5371, 256)
  (rnn): LSTM(256, 1024, num_layers=2, batch_first=True)
  (fc): Linear(in_features=1024, out_features=5371, bias=True)
)

In [None]:
model.to(DEVICE)
print(generate(model, seed_str='elinor'))

elinor order gift intimate lorrys polly humanity underneath saved tsar playing emperors bastille slid thankful parties day hags flesh could being fairies push protection remaining reign veil louder regarding news yere tent beauty engagement resolved voyage onto subjects bad stooping jarviss
