### Strategy Description: Transformer Model for Masked Character Prediction with Cross Entropy Loss

#### Overview
The provided code defines a Transformer-based approach to predict masked characters in words. This character-level sequence-to-sequence task involves masking certain characters in a word and training the model to predict the original characters at those masked positions. The model uses Cross Entropy Loss for optimization.

#### Components and Workflow

1. **Data Preparation**:
   - **Reading Words**: Words are read from a text file where each word is on a separate line.
   - **Dataset Class**: A custom `CharDataset` class is defined to handle the character-level dataset. Each word is randomly masked at different positions, and both the masked word and the original word are converted to sequences of indices.

2. **DataLoader and Collate Function**:
   - **DataLoader**: A PyTorch `DataLoader` is used to handle batching and shuffling of the dataset.
   - **Collate Function**: A custom `collate_fn` pads the sequences to a maximum length, ensuring uniform input dimensions for the model.

3. **Positional Encoding**:
   - **Positional Encoding Module**: This module adds positional information to the token embeddings to help the Transformer model capture the order of the characters in the sequences.

4. **Transformer Model**:
   - **Model Architecture**: The `TransformerModel` class defines the Transformer architecture, including an embedding layer, positional encoding, Transformer layers (with specified number of encoder and decoder layers, attention heads, and feedforward dimensions), and a final linear layer to predict the character logits.
   - **Hyperparameters**: The model is initialized with specific hyperparameters, such as `d_model` (embedding dimension), number of attention heads (`nhead`), number of encoder and decoder layers, and feedforward dimension (`dim_feedforward`).

5. **Training Loop**:
   - **Cross Entropy Loss**: The loss function used is `nn.CrossEntropyLoss`, which is suitable for multi-class classification tasks. It compares the predicted character probabilities with the actual characters (targets) and computes the loss.
   - **Optimization**: The optimizer used is Adam with a specified learning rate.
   - **Training Process**: The model is trained for a number of epochs, where for each batch, the masked words are fed into the model, predictions are made, the loss is computed, and the model parameters are updated via backpropagation.

6. **Decoding and Inference**:
   - **Greedy Decoding**: A simple greedy decoding strategy is used to iteratively fill in the masked characters in the word during inference. The model predicts the character with the highest probability for each masked position.
   - **Probability Analysis**: During inference, the predicted probabilities are analyzed to provide a list of possible characters sorted by their predicted likelihood.






In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
import random
import numpy as np

# Read words from a .txt file
def read_words_from_file(file_path):
    with open(file_path, 'r') as file:
        words = file.readlines()
    words = [word.strip() for word in words]
    return words

# Example file path
file_path = "/content/words_250000_train.txt"  # Ensure this file exists with one word per line
words = read_words_from_file(file_path)
# Define a character-level dataset
class CharDataset(Dataset):
    def __init__(self, words, mask_token='_'):
        self.words = words
        self.mask_token = mask_token
        self.chars = list("abcdefghijklmnopqrstuvwxyz_")
        if mask_token not in self.chars:
            self.chars.append(mask_token)
        self.char_to_idx = {char: idx for idx, char in enumerate(self.chars, 1)}
        self.idx_to_char = {idx: char for char, idx in self.char_to_idx.items()}

    def __len__(self):
        return len(self.words)

    def __getitem__(self, idx):
        word = self.words[idx]
        # masked_word = [self.char_to_idx[self.mask_token]] * len(word)
        original_word = [self.char_to_idx[char] for char in word]
        r = random.randint(1, len(word))
        ind = sorted(set(random.sample(range(0,len(word)),r)))
        word = list(word)
        for i in range(len(word)):
          if i in ind:
            word[i]='_'
        word = ''.join(word)
        masked_word = [self.char_to_idx[char] for char in word]
        return torch.tensor(masked_word), torch.tensor(original_word)

# Custom collate function to pad sequences
def collate_fn(batch, max_len=50):
    masked_words, original_words = zip(*batch)
    # max_len = max(len(word) for word in masked_words)
    padded_masked_words = torch.zeros((len(masked_words), max_len), dtype=torch.long)
    padded_original_words = torch.zeros((len(original_words), max_len), dtype=torch.long)


    for i in range(len(masked_words)):
        padded_masked_words[i, :len(masked_words[i])] = masked_words[i]
        padded_original_words[i, :len(original_words[i])] = original_words[i]

    return padded_masked_words, padded_original_words

# Create dataset and dataloader with custom collate function
dataset = CharDataset(words)
dataloader = DataLoader(dataset, batch_size=256, shuffle=True, collate_fn=collate_fn)

# Define a Positional Encoding module6
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

# Define a simple Transformer model
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward=64, max_len=34):
        super(TransformerModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        # self.pos_encoder = PositionalEncoding(d_model, max_len)
        self.transformer = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward)
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, src):
        src = self.embedding(src)
        # src = self.pos_encoder(src)
        output = self.transformer(src, src)
        output = self.fc(output)
        return output

# Model hyperparameters
vocab_size = len(dataset.chars)+1
d_model = 16
nhead = 8
num_encoder_layers = 1
num_decoder_layers = 1
max_len = 34

# Instantiate the model
model = TransformerModel(vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward=64, max_len=max_len)
criterion = nn.CrossEntropyLoss(ignore_index=0)  # Ignore padding token
optimizer = optim.Adam(model.parameters(), lr=0.01)

# # Training loop
# for epoch in range(100):
#     model.train()
#     for masked_words, original_words in dataloader:
#         optimizer.zero_grad()

#         # Predict the whole sequence
#         outputs = model(masked_words)

#         # Compute loss for each position
#         outputs = outputs.view(-1, vocab_size)
#         original_words = original_words.view(-1)
#         loss = criterion(outputs, original_words)

#         loss.backward()
#         optimizer.step()

#     print(f"Epoch {epoch+1}, Loss: {loss.item()}")

saved_model_path = '/content/newtry_149.pth'
# # loaded_model_state_dict = torch.load(saved_model_path)

# # Instantiate the model with the same architecture as before

# torch.save(model.state_dict(), '/content/newtry_40.pth')
path_to_model = '/content/newtry_149.pth'
if not torch.cuda.is_available():
  loaded_model_state_dict = torch.load(saved_model_path, map_location=torch.device('cpu'))
else:
    loaded_model_state_dict = torch.load(saved_model_path)

model = TransformerModel(vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward=64, max_len=max_len)

# Load the state dictionary into the model
model.load_state_dict(loaded_model_state_dict)



# Set the model to evaluation mode
# model.eval()

# model.load_state_dict(loaded_dict)
# Testing the model (greedy decoding for simplicity)
def decode(model, masked_word, max_len=10):
    # model.eval()
    with torch.no_grad():

        for _ in range(max_len):
            outputs = model(masked_word)
            predictions = outputs.argmax(dim=-1)
            probabilities = F.softmax(outputs, dim=-1)
            # Update masked_word with new predictions
            for i in range(masked_word.size(1)):
                if masked_word[0, i] == dataset.char_to_idx['_']:
                    masked_word[0, i] = predictions[1, i]
                    break
    return masked_word,probabilities

# Example of decoding a fully masked word
clean_word="ard_nt"
masked_word = torch.tensor([[dataset.char_to_idx[x]] for x in clean_word])
decoded_word, probabilities = decode(model,masked_word)
print([dataset.idx_to_char[int(i)] for i in decoded_word])
len_word=len(clean_word)

for i in range(len_word):
  if clean_word[i] == "_":
    max_prob_idx = int(np.argmax(probabilities[i]))
    probs = probabilities[i].reshape([28])
    indices = probs.argsort(descending=True)
    dataset.idx_to_char[0]='?'
    print([dataset.idx_to_char[int(i)] for i in indices])
    # print(probabilities[i])
    guessed_char = dataset.idx_to_char[max_prob_idx]
    # print(guessed_char)
    if guessed_char not in clean_word:
      print(guessed_char)


['a', 'r', 'd', '_', 'n', 't']
['r', 'n', 'a', 'e', 'l', 't', 'o', 'i', 's', 'c', 'p', 'm', 'u', 'd', 'b', 'g', 'h', 'f', 'y', 'v', 'w', 'k', 'x', 'z', 'j', 'q', '_', '?']


In [None]:
# Example of decoding a fully masked word
clean_word="att_r_"
masked_word = torch.tensor([[dataset.char_to_idx[x]] for x in clean_word])
decoded_word, probabilities = decode(model,masked_word)
print([dataset.idx_to_char[int(i)] for i in decoded_word])
len_word=len(clean_word)

for i in range(len_word):
  if clean_word[i] == "_":
    max_prob_idx = int(np.argmax(probabilities[i]))
    probs = probabilities[i].reshape([28])
    indices = probs.argsort(descending=True)
    dataset.idx_to_char[0]='?'
    print([dataset.idx_to_char[int(i)] for i in indices])
    # print(probabilities[i])
    guessed_char = dataset.idx_to_char[max_prob_idx]
    # print(guessed_char)
    if guessed_char not in clean_word:
      print(guessed_char)

['a', 't', 't', '_', 'r', '_']
['r', 'e', 'a', 'n', 'l', 'o', 't', 'i', 's', 'c', 'p', 'm', 'u', 'd', 'g', 'b', 'h', 'f', 'y', 'v', 'w', 'k', 'x', 'z', 'j', 'q', '_', '?']
['e', 'r', 'a', 'n', 't', 'i', 'l', 'o', 's', 'c', 'p', 'm', 'd', 'u', 'g', 'b', 'h', 'y', 'f', 'v', 'w', 'k', 'x', 'z', 'j', 'q', '_', '?']
e


In [None]:
# Example of decoding a fully masked word
clean_word="att_re"
masked_word = torch.tensor([[dataset.char_to_idx[x]] for x in clean_word])
decoded_word, probabilities = decode(model,masked_word)
print([dataset.idx_to_char[int(i)] for i in decoded_word])
len_word=len(clean_word)

for i in range(len_word):
  if clean_word[i] == "_":
    max_prob_idx = int(np.argmax(probabilities[i]))
    probs = probabilities[i].reshape([28])
    indices = probs.argsort(descending=True)
    dataset.idx_to_char[0]='?'
    print([dataset.idx_to_char[int(i)] for i in indices])
    # print(probabilities[i])
    guessed_char = dataset.idx_to_char[max_prob_idx]
    # print(guessed_char)
    if guessed_char not in clean_word:
      print(guessed_char)


['a', 't', 't', '_', 'r', 'e']
['e', 'r', 'i', 'a', 't', 'o', 'l', 'n', 's', 'c', 'p', 'u', 'd', 'm', 'h', 'g', 'b', 'y', 'f', 'k', 'v', 'w', 'z', 'x', 'q', 'j', '_', '?']


In [None]:
freq= ['e', 't', 'a', 'o', 'i', 'n', 's', 'r', 'h', 'l', 'd', 'c', 'u', 'm', 'f', 'p', 'g', 'w', 'y', 'b', 'v', 'k', 'x', 'j', 'q', 'z']
len(freq)

26