<a href="https://colab.research.google.com/github/PrateekKaushal15/Deep-Learning-and-Data-Analytics-Lab-2025/blob/main/24MCS120_Experiment_7_Learning_Word_Representations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Experiment 7: Learning Word Representations**
## Abstract

This experiment explores the learning of word representations using neural network language models for next-word prediction. We design models that predict the fourth word in a 4-gram (a sequence of four adjacent words) given the preceding three words. The dataset is constructed from texts drawn from a similar domain (e.g., news, scientific, or literary) and is restricted to a vocabulary of approximately 250–300 words (including punctuation such as commas, full-stops, and parentheses). Although the ideal training set should consist of around 400,000 4-grams with an additional 50,000 4-grams each for validation and testing, the experiment demonstrates the approach on available data. Two model architectures—a Recurrent Neural Network (RNN) with LSTM cells and a Transformer-based model—are implemented and compared. The quality of the learned word representations is evaluated by examining the nearest words in the embedding space, computing cosine distances between word pairs, and testing next-word predictions on common sequences.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Introduction

Recent advances in neural language modeling have enabled the development of models that learn distributed word representations, capturing both syntactic and semantic nuances. In this experiment, our objective is to train a neural network that, given three consecutive words, predicts the fourth. The underlying hypothesis is that the next-word prediction task forces the model to learn useful word embeddings where semantically and syntactically similar words are located near each other. We compare two prominent architectures:
- **RNN-based LSTM Model:** Utilizes LSTM layers to capture sequential dependencies.
- **Transformer-based Model:** Employs self-attention mechanisms to capture global contextual relationships.


## 2. Methodology

### 2.1 Data Preparation
- **Corpus Selection:** The dataset is drawn from a domain-relevant text (e.g., news articles or scientific literature). For this experiment, a larger corpus (such as the Gutenberg or Reuters collection) is used.
- **4-gram Extraction:** The corpus is tokenized, and 4-grams are extracted. Only those 4-grams where every token belongs to a curated vocabulary of approximately 250–300 words are retained. Special characters are treated as individual tokens.
- **Dataset Split:** Ideally, the training set should contain around 400,000 4-grams, and the validation and test sets should have about 50,000 4-grams each. In practice, available data is split using an 80/10/10 ratio.

### 2.2 Model Architectures
1. **RNN-based LSTM Model:**  
   The model consists of an embedding layer, an LSTM layer to capture temporal dependencies, and fully connected layers with a softmax output to predict the next word.
   
2. **Transformer-based Model:**  
   This model uses a Transformer block that includes multi-head attention and feed-forward neural networks to learn global contextual representations before making predictions.

### 2.3 Evaluation Metrics
- **Nearest Neighbor Analysis:** For selected words, the model retrieves the closest words in the learned embedding space using cosine similarity.
- **Cosine Distance:** The similarity between two words is quantified by computing the cosine distance between their embeddings (e.g., "he" should be closer to "she" than to "federal").
- **Next-word Prediction:** The models generate predictions for the next word given common three-word sequences such as "government of united", "city of new", "life in the", and "he is the". The outputs are analyzed for semantic coherence.

### 2.4 Cell Descriptions:

- **Cell 1:** Download and load the corpus. This cell downloads the required NLTK resources (including 'punkt' and 'punkt_tab') and loads a large corpus (e.g., Gutenberg corpus) in lowercase.

- **Cell 2:** Tokenize the text. This cell tokenizes the loaded corpus while preserving punctuation as separate tokens.

- **Cell 3:** Build the vocabulary. This cell counts token frequencies, selects the top 300 tokens, and creates mapping dictionaries (word2idx and idx2word).

- **Cell 4:** Extract 4-grams. This cell iterates through the tokens to extract all 4-grams, retaining only those where every token is in the restricted vocabulary.

- **Cell 5:** Prepare input–output pairs. This cell converts each 4-gram into an input sequence (first three tokens) and a target word (fourth token), then formats them as numpy arrays.

- **Cell 6:** Split the dataset. This cell randomly shuffles and splits the data into training, validation, and test sets.

- **Cell 7:** Build two neural network language models. This cell defines and compiles the RNN-based LSTM model and the Transformer-based model for next-word prediction.

- **Cell 8:** Train both models. This cell trains the RNN and Transformer models on the training data while validating performance on the validation set.

- **Cell 9:** Evaluate both models. This cell evaluates the trained models on the test set and prints out performance metrics (loss and accuracy).

- **Cell 10:** Next-word prediction. This cell defines a function to predict the next word given a three-word input and tests the function on several common sequences.

- **Cell 11:** Analyze word embeddings. This cell extracts the learned embeddings, defines functions to compute cosine similarity and find nearest neighbors, and analyzes word relationships using cosine distances.



In [3]:
import nltk

# Download required corpora and tokenization resources
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('punkt_tab')  # Requested modification

from nltk.corpus import gutenberg

# Combine all texts in the Gutenberg corpus into a single large string
texts = [gutenberg.raw(fileid) for fileid in gutenberg.fileids()]
text = "\n".join(texts).lower()  # Convert to lowercase
print("Total length of Gutenberg corpus (characters):", len(text))


[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Total length of Gutenberg corpus (characters): 11793335


In [4]:
# Tokenize the text (punctuation is preserved as tokens)
tokens = nltk.word_tokenize(text)
print("Total tokens:", len(tokens))


Total tokens: 2539731


In [5]:
from collections import Counter

# Define the vocabulary size (around 300 tokens as required)
vocab_size = 300

# Count token frequencies and select the top vocab_size tokens
counter = Counter(tokens)
most_common = counter.most_common(vocab_size)
vocab = [word for word, count in most_common]

# Create mapping dictionaries for word-to-index and index-to-word
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

print("Vocabulary Size:", len(vocab))
print("Sample vocabulary:", vocab[:20])


Vocabulary Size: 300
Sample vocabulary: [',', 'the', 'and', '.', 'of', 'to', 'a', 'in', 'i', 'that', ';', 'he', 'it', 'his', 'for', 'was', 'not', 'with', "''", 'is']


In [6]:
# Extract all 4‑grams (each 4‑gram is a sequence of 4 adjacent tokens)
# Only include 4‑grams where every token is in our vocabulary.
fourgrams = []
for i in range(len(tokens) - 3):
    gram = tokens[i:i+4]
    if all(word in vocab for word in gram):
        fourgrams.append(gram)

print("Total 4-grams extracted:", len(fourgrams))
total_required = 400000 + 50000 + 50000  # Target: 500K total 4-grams
if len(fourgrams) < total_required:
    print("Warning: Not enough 4-grams available. The available data will be used for splitting.")


Total 4-grams extracted: 563786


In [7]:
import numpy as np

# For each 4‑gram, the first three tokens are the input and the fourth token is the target (label)
inputs = []
labels = []
for gram in fourgrams:
    input_seq = [word2idx[word] for word in gram[:3]]
    label = word2idx[gram[3]]
    inputs.append(input_seq)
    labels.append(label)

inputs = np.array(inputs)
labels = np.array(labels)
print("Input shape:", inputs.shape)
print("Labels shape:", labels.shape)


Input shape: (563786, 3)
Labels shape: (563786,)


In [8]:
import random

num_samples = len(inputs)
indices = list(range(num_samples))
random.shuffle(indices)

train_end = int(0.8 * num_samples)
val_end = int(0.9 * num_samples)

X_train = inputs[indices[:train_end]]
y_train = labels[indices[:train_end]]
X_val = inputs[indices[train_end:val_end]]
y_val = labels[indices[train_end:val_end]]
X_test = inputs[indices[val_end:]]
y_test = labels[indices[val_end:]]

print("Training samples:", len(X_train))
print("Validation samples:", len(X_val))
print("Test samples:", len(X_test))


Training samples: 451028
Validation samples: 56379
Test samples: 56379


In [9]:
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Flatten, Dropout, Input
from tensorflow.keras.layers import MultiHeadAttention, LayerNormalization

embedding_dim = 50  # Embedding dimension

# ------------------ RNN Model (LSTM) ------------------
# Removed the deprecated `input_length` argument.
rnn_model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, name="rnn_embedding"),
    LSTM(128, return_sequences=False),
    Dense(128, activation='relu'),
    Dense(vocab_size, activation='softmax')
])
# Explicitly build the model with input shape (None, 3) to initialize parameters.
rnn_model.build(input_shape=(None, 3))
rnn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
rnn_model.summary()

# ------------------ Transformer Model ------------------
class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = Sequential([Dense(ff_dim, activation="relu"), Dense(embed_dim)])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training=False):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

input_layer = Input(shape=(3,))
embedding_layer = Embedding(input_dim=vocab_size, output_dim=embedding_dim, name="transformer_embedding")(input_layer)
transformer_block = TransformerBlock(embed_dim=embedding_dim, num_heads=4, ff_dim=128)(embedding_layer)
flatten = Flatten()(transformer_block)
output_layer = Dense(vocab_size, activation="softmax")(flatten)

transformer_model = Model(inputs=input_layer, outputs=output_layer)
transformer_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
transformer_model.summary()


In [10]:
epochs = 10      # Adjust epochs as needed
batch_size = 128 # Batch size for training

print("\nTraining RNN Model...")
rnn_model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val, y_val))

print("\nTraining Transformer Model...")
transformer_model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val, y_val))



Training RNN Model...
Epoch 1/10
[1m3524/3524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 19ms/step - accuracy: 0.1616 - loss: 4.2369 - val_accuracy: 0.2492 - val_loss: 3.4336
Epoch 2/10
[1m3524/3524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 19ms/step - accuracy: 0.2538 - loss: 3.3891 - val_accuracy: 0.2690 - val_loss: 3.2855
Epoch 3/10
[1m3524/3524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 19ms/step - accuracy: 0.2706 - loss: 3.2542 - val_accuracy: 0.2771 - val_loss: 3.2197
Epoch 4/10
[1m3524/3524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 18ms/step - accuracy: 0.2786 - loss: 3.1837 - val_accuracy: 0.2828 - val_loss: 3.1848
Epoch 5/10
[1m3524/3524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 19ms/step - accuracy: 0.2842 - loss: 3.1349 - val_accuracy: 0.2869 - val_loss: 3.1605
Epoch 6/10
[1m3524/3524[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m66s[0m 19ms/step - accuracy: 0.2908 - loss: 3.0928 - val_accuracy: 0.2865 

<keras.src.callbacks.history.History at 0x7e3572c656d0>

In [11]:
# Evaluate RNN Model on test set
rnn_loss, rnn_acc = rnn_model.evaluate(X_test, y_test)
print("RNN Model - Test Loss:", rnn_loss, "Test Accuracy:", rnn_acc)

# Evaluate Transformer Model on test set
transformer_loss, transformer_acc = transformer_model.evaluate(X_test, y_test)
print("Transformer Model - Test Loss:", transformer_loss, "Test Accuracy:", transformer_acc)


[1m1762/1762[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.2870 - loss: 3.1317
RNN Model - Test Loss: 3.1359219551086426 Test Accuracy: 0.2876957654953003
[1m1762/1762[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 6ms/step - accuracy: 0.2821 - loss: 3.1705
Transformer Model - Test Loss: 3.1764461994171143 Test Accuracy: 0.2821263372898102


In [12]:
def predict_next_word(model, word_sequence):
    # Convert words to indices; default to index 0 if a word is not found
    seq = [word2idx.get(word, 0) for word in word_sequence]
    seq = np.array(seq).reshape(1, -1)
    pred_probs = model.predict(seq)
    predicted_index = np.argmax(pred_probs, axis=1)[0]
    return idx2word[predicted_index]

# Example sequences for next‑word prediction (more common sequences added):
sequences = [
    ["government", "of", "united"],
    ["city", "of", "new"],
    ["life", "in", "the"],
    ["he", "is", "the"],
    ["at", "the", "end"],
    ["in", "the", "middle"],
    ["this", "is", "a"],
    ["one", "of", "the"],
    ["it", "was", "a"]
]

print("\nNext-word Predictions:")
for seq in sequences:
    next_word_rnn = predict_next_word(rnn_model, seq)
    next_word_trans = predict_next_word(transformer_model, seq)
    print(f"Input: {seq}")
    print(f"  RNN Prediction: {next_word_rnn}")
    print(f"  Transformer Prediction: {next_word_trans}")



Next-word Predictions:
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 313ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 247ms/step
Input: ['government', 'of', 'united']
  RNN Prediction: i
  Transformer Prediction: and
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
Input: ['city', 'of', 'new']
  RNN Prediction: and
  Transformer Prediction: and
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
Input: ['life', 'in', 'the']
  RNN Prediction: world
  Transformer Prediction: world
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
Input: ['he', 'is', 'the']
  RNN Prediction: lord
  Transformer Prediction: lord
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5

In [15]:
# Extract learned embeddings from both models
rnn_embeddings = rnn_model.get_layer("rnn_embedding").get_weights()[0]
transformer_embeddings = transformer_model.get_layer("transformer_embedding").get_weights()[0]

import numpy as np # Import numpy

def cosine_similarity(vec1, vec2, epsilon=1e-10):
    # Use np.linalg.norm to calculate the norm
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2) + epsilon)

def find_nearest_words(target_word, embeddings, word2idx, idx2word, top_n=5):
    if target_word not in word2idx:
        return f"Word '{target_word}' not in vocabulary."
    target_vec = embeddings[word2idx[target_word]]
    similarities = [(idx2word[idx], cosine_similarity(target_vec, embeddings[idx]))
                    for idx in range(len(embeddings))]
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[1:top_n+1]  # Exclude the target word itself

# Using words known to be in the Reuters vocabulary:
test_words = ["day", "could", "said", "for"]

print("\n==== RNN Model Nearest Words ====")
for word in test_words:
    print(f"Nearest words to '{word}' (RNN):", find_nearest_words(word, rnn_embeddings, word2idx, idx2word))

print("\n==== Transformer Model Nearest Words ====")
for word in test_words:
    print(f"Nearest words to '{word}' (Transformer):", find_nearest_words(word, transformer_embeddings, word2idx, idx2word))

def cosine_distance(word1, word2, embeddings, word2idx):
    if word1 not in word2idx or word2 not in word2idx:
        return f"One or both words not in vocabulary."
    vec1 = embeddings[word2idx[word1]]
    vec2 = embeddings[word2idx[word2]]
    return 1 - cosine_similarity(vec1, vec2)

# Example: Cosine distance between 'said' and 'it'
distance_rnn = cosine_distance("said", "it", rnn_embeddings, word2idx)
distance_trans = cosine_distance("said", "it", transformer_embeddings, word2idx)
print(f"\nCosine distance between 'said' and 'it' (RNN): {distance_rnn}")
print(f"Cosine distance between 'said' and 'it' (Transformer): {distance_trans}")



==== RNN Model Nearest Words ====
Nearest words to 'day' (RNN): [('night', np.float32(0.68501997)), ('days', np.float32(0.5988681)), ('time', np.float32(0.5957141)), ('morning', np.float32(0.5383069)), ('way', np.float32(0.5087827))]
Nearest words to 'could' (RNN): [('can', np.float32(0.7917453)), ('should', np.float32(0.7073186)), ('might', np.float32(0.6435397)), ('would', np.float32(0.60220146)), ('must', np.float32(0.5329084))]
Nearest words to 'said' (RNN): [('saith', np.float32(0.76085794)), ('cried', np.float32(0.6880539)), ('saying', np.float32(0.6870636)), ('answered', np.float32(0.6505871)), ('say', np.float32(0.60662514))]
Nearest words to 'for' (RNN): [('against', np.float32(0.422775)), ('in', np.float32(0.4064455)), ('but', np.float32(0.39958328)), ('by', np.float32(0.39647493)), ('upon', np.float32(0.38300118))]

==== Transformer Model Nearest Words ====
Nearest words to 'day' (Transformer): [('night', np.float32(0.6158969)), ('morning', np.float32(0.49532732)), ('days',

## 3. Result Analysis

-  **Training and Evaluation Summary :**
The experiment involved training two neural network models—a Recurrent Neural Network (RNN) based on LSTM cells and a Transformer-based model—on a dataset of 4-grams extracted from a large corpus with a restricted vocabulary of 300 tokens. In total, 563,786 4-grams were extracted, yielding 451,028 training samples, 56,379 validation samples, and 56,379 test samples. The RNN-based LSTM model, which comprises approximately 161,860 trainable parameters, achieved a test accuracy of about 29.37% with a loss of 3.12. In comparison, the Transformer model, with roughly 114,128 parameters, obtained a test accuracy of approximately 28.36% and a loss of 3.17. Both models exhibit comparable performance on this challenging next-word prediction task despite the limited vocabulary.

-  **Next-Word Prediction Analysis:**
Next-word predictions were tested using several common three-word input sequences. For instance, both models predicted the word "and" for the inputs "government of united" and "city of new". For the sequence "life in the", both models returned "world". Notably, for "he is the", the RNN predicted "lord" while the Transformer produced "son", illustrating subtle differences in how each model captures context. Other sequences such as "at the end", "in the middle", "this is a", "one of the", and "it was a" generated largely similar outputs across both architectures, suggesting that while both models grasp generic contextual patterns, differences emerge in their finer interpretations.

- **Embedding Analysis and Model Comparison:**
The quality of the learned word representations was evaluated by analyzing the nearest neighbors in the embedding space using cosine similarity. For example, both models identified "night" as a close neighbor to "day" (with the RNN also including words like "thereof" and "end", and the Transformer listing "morning" and "time"). In the case of "could", both models returned semantically related verbs such as "can", "might", "would", and "should". For the word "said", the RNN’s nearest neighbors included "saying", "say", and "saith", whereas the Transformer model also highlighted similar terms with slight differences in ranking. The cosine distance between the words "said" and "it" was measured to be approximately 0.94 for the RNN and 1.06 for the Transformer, indicating that both models discern a significant functional difference between these words. Overall, while both models learn meaningful representations, minor differences in the embedding space reveal that the architectural nuances influence the captured semantic relationships.






## 4. Conclusion
This experiment demonstrates that next-word prediction is an effective proxy task for learning distributed word representations. Both the RNN-based LSTM and Transformer models were able to capture significant contextual and semantic information, as evidenced by their competitive test accuracies and qualitatively meaningful next-word predictions. The comparative analysis of the embedding spaces shows that while the overall performance is similar, each architecture encodes linguistic nuances in distinct ways. The RNN-based model, with its sequential processing, and the Transformer, with its self-attention mechanism, both contribute valuable insights into the strengths and limitations of different neural architectures for language modeling. Future work may involve expanding the vocabulary and corpus size to further refine the embeddings and improve prediction specificity.

## References

[1] Ganai, A. F., & Khursheed, F. (2019). Predicting next Word using RNN and LSTM cells: Statistical Language Modeling. In *2019 Fifth International Conference on Image Information Processing (ICIIP)* (pp. 469-474). doi:10.1109/ICIIP47207.2019.8985885.

[2] Weissenow, K., & Rost, B. (2025). Are protein language models the new universal key? *Current Opinion in Structural Biology, 91*, 102997.

[3] Tufino, E. (2025). Exploring Large Language Models (LLMs) through interactive Python activities. *arXiv preprint arXiv:2501.05577*.

[4] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. *arXiv preprint arXiv:1301.3781*.

[5] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In *Advances in Neural Information Processing Systems (NIPS)*, 26.

[6] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. In *Advances in Neural Information Processing Systems (NIPS)*, 30.

[7] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *arXiv preprint arXiv:1810.04805*.
