# Task 5: Language Model [5p]

Build a basic language model using a publicly available text dataset. You'll experiment with RNN-based architectures (Simple RNN, LSTM, GRU) to learn how they model sequences.

### **Part 1: Dataset Download & Preparation (1 point)**

**Tasks:**

* Download a publicly available dataset, e.g., *Alice’s Adventures in Wonderland* from Project Gutenberg.
  * Use requests or a dataset API like torchtext.datasets.
* Preprocess the text:
  * Lowercase, remove non-alphabetic characters.
  * Tokenize into words (use nltk or spaCy).
  * Build a vocabulary, keeping frequent words (e.g., top 10k).
* Use **pretrained word embeddings** (e.g., GloVe 100d or FastText):
  * Load with torchtext.vocab, gensim, or similar.
  * Initialize the embedding layer with pretrained vectors.


### **Part 2: Build a Recurrent Language Model (1 point)**

**Tasks:**

* Implement a word-level language model using:
  * Pretrained embedding layer (frozen or trainable).
  * A single-layer **Simple RNN**.
  * A fully connected output layer with softmax.

### **Part 3: Train the Model (1 point)**

**Tasks:**

* Use cross-entropy loss.
* Predict the next word from a sequence.
* Use teacher forcing and batching.
* Plot training loss over time.

### **Part 4: Generate Text (1 point)**

**Tasks:**

* Given a seed sequence, generate text of specified length.
* Use **temperature sampling** to vary creativity.
* Try different temperatures and compare.

### **Part 5: Evaluation & Reflection (1 point) -> W&B report**

**Tasks:**

* Evaluate model outputs: does it learn sentence structure?
* Reflect on limitations of the Simple RNN and its behavior on longer sequences.

### **Bonus Section (Up to +2 Points): Model Comparison**

Compare the performance of three models:


1. Simple RNN
2. LSTM
3. GRU

**Tasks:**

* Implement the same model architecture but switch out the recurrent layer.
* Train all three models under the same conditions.
* Record and compare:
  * Training time
  * Final loss
  * Generated text quality
* (Optional) Add dropout to recurrent layers and observe effects.
* Summarize findings in a table or chart.

In [None]:
# Part 1: Dataset Download & Preparation

import re
import requests
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
import torch
import torch.nn as nn
import torchtext.vocab as vocab

# Download NLTK resources
nltk.download('punkt')

# 1. Download dataset
url = "https://www.gutenberg.org/cache/epub/11/pg11.txt"
response = requests.get(url)
raw_text = response.text

# 2. Preprocess text
def preprocess(text):
    # Lowercase and remove non-alphabetic characters
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)  # Keep spaces for tokenization
    # Tokenize and clean
    tokens = word_tokenize(text)
    tokens = [re.sub(r'[^a-z]', '', token) for token in tokens]  # Remove any remaining non-alphabetic chars
    return [token for token in tokens if token]  # Remove empty strings

processed_tokens = preprocess(raw_text)

# 3. Build vocabulary
vocab_size = 10000
token_counts = Counter(processed_tokens)
vocab_words = [word for word, _ in token_counts.most_common(vocab_size)]
vocab = ['<unk>', '<pad>'] + vocab_words
word_to_idx = {word: idx for idx, word in enumerate(vocab)}

# 4. Load pretrained embeddings
glove = vocab.GloVe(name='6B', dim=100)

# 5. Create embedding matrix
embedding_dim = 100
embedding_matrix = torch.randn(len(vocab), embedding_dim)

# Initialize with GloVe vectors where available
for idx, word in enumerate(vocab):
    if word in glove.stoi:
        embedding_matrix[idx] = glove.vectors[glove.stoi[word]]

# Special tokens handling
embedding_matrix[1] = torch.zeros(embedding_dim)  # <pad> token
embedding_matrix[0] = torch.mean(glove.vectors, dim=0)  # <unk> as average vector

# Create embedding layer
embedding_layer = nn.Embedding.from_pretrained(
    embedding_matrix,
    freeze=False,  # Allow fine-tuning
    padding_idx=1
)

print("Embedding layer created with shape:", embedding_matrix.shape)
print("Example embedding for 'alice':", embedding_layer(torch.tensor(word_to_idx['alice'])))

OSError: [WinError 127] The specified procedure could not be found

In [2]:
import re
import requests
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
import torch
import torch.nn as nn
import gensim.downloader as api

# Download NLTK resources
nltk.download('punkt')

# 1. Download dataset
url = "https://www.gutenberg.org/files/11/11-0.txt"
response = requests.get(url)
raw_text = response.text

# 2. Preprocess text
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', ' ', text)
    tokens = word_tokenize(text)
    return [token for token in tokens if token.isalpha()]

processed_tokens = preprocess(raw_text)

# 3. Build vocabulary
vocab_size = 10000
token_counts = Counter(processed_tokens)
vocab_words = [word for word, _ in token_counts.most_common(vocab_size)]
vocab = ['<unk>', '<pad>'] + vocab_words
word_to_idx = {word: idx for idx, word in enumerate(vocab)}

# 4. Load GloVe embeddings using Gensim
glove = api.load("glove-wiki-gigaword-100")

# 5. Create embedding matrix
embedding_dim = 100
embedding_matrix = torch.randn(len(vocab), embedding_dim)

# Initialize with GloVe vectors
for idx, word in enumerate(vocab):
    if word in glove:
        embedding_matrix[idx] = torch.tensor(glove[word])
    elif word == '<pad>':
        embedding_matrix[idx] = torch.zeros(embedding_dim)
    else:
        embedding_matrix[idx] = torch.tensor(glove['unk'])  # Use 'unk' vector if available

# Create embedding layer
embedding_layer = nn.Embedding.from_pretrained(
    embedding_matrix,
    freeze=False,
    padding_idx=1
)

print("Embedding layer created successfully!")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\patry\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\patry/nltk_data'
    - 'c:\\Users\\patry\\AppData\\Local\\Programs\\Python\\Python312\\nltk_data'
    - 'c:\\Users\\patry\\AppData\\Local\\Programs\\Python\\Python312\\share\\nltk_data'
    - 'c:\\Users\\patry\\AppData\\Local\\Programs\\Python\\Python312\\lib\\nltk_data'
    - 'C:\\Users\\patry\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************
