
# Summative Exercise – Predictive Embeddings + RNN Classifier (Colab)

**Week 2 (NLP):** This summative task combines *predictive embeddings* (loaded from pre-trained **GloVe**, a Word2Vec-style model) with a compact **RNN** text classifier.

**You will:**
1. Load a small two-class dataset (20 Newsgroups subset).  
2. Tokenise and create padded sequences.  
3. Load pre-trained GloVe vectors and explore nearest neighbours.  
4. Build two models: (i) averaged-embedding baseline; (ii) SimpleRNN classifier using the same embeddings.  
5. Evaluate, compare, and reflect.

Note: You have complete all the missing parts in code to complete this exercise.
Missing parts are represented with '#########'.

*Estimated time: 90 minutes.*  


## Setup and versions

In [3]:

import os, random, sys, platform, numpy as np, tensorflow as tf, sklearn
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

SEED = 42
random.seed(SEED); np.random.seed(SEED); tf.random.set_seed(SEED)

print("Python:", sys.version.split()[0])
print("TensorFlow:", tf.__version__)
print("NumPy:", np.__version__)
print("scikit-learn:", sklearn.__version__)
print("Platform:", platform.platform())


Python: 3.10.19
TensorFlow: 2.20.0
NumPy: 1.26.4
scikit-learn: 1.5.1
Platform: macOS-15.6.1-arm64-arm-64bit



## 1. Load and prepare data

We use two categories to keep training fast. This mirrors the earlier classroom exercises.


In [4]:

import pandas as pd
cats = ['rec.autos', 'sci.electronics']
raw = fetch_20newsgroups(subset='train', categories=cats, remove=('headers','footers','quotes'))
df = pd.DataFrame({'text': raw.data, 'label': raw.target}).sample(n=1000, random_state=SEED).reset_index(drop=True)
df.head(3)


Unnamed: 0,text,label
0,# 74S\tLater modification of 74 for even highe...,1
1,\nrecently-manufactured locomotives have wheel...,0
2,"\nYes, Fred, my heart and prayers go out to th...",0



## 2. Tokenise and vectorise

We use Keras `Tokenizer` for simple, robust tokenisation. Then we create padded integer sequences.


In [None]:

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Hint: try using 20000 words and maximum length of 120

num_words = 20000
# Least frequent words (not in 20k vocab) will be bucketed as <unk>
tokenizer = Tokenizer(num_words=num_words, oov_token="<unk>")
tokenizer.fit_on_texts(df["text"])

max_len = 120
# Convert texts to integer sequences (indices of the vocab) and pad them to the same length of 120 (adds 0s or cuts off, at 120, resp.), save that in X, label in y
# rare words get the <unk> index
seqs = tokenizer.texts_to_sequences(df["text"])
X = pad_sequences(seqs, maxlen=max_len, padding="post", truncating="post")
y = df["label"].values  # labels

# word_index is the text vocabulary for ALL words (incl. rare words!) + <unk>
word_index = tokenizer.word_index
index_word = {i:w for w,i in word_index.items()}
vocab_size = min(num_words, len(word_index) + 1)
vocab_size
# vocab_size < num_words, so no <unk> needed!


[[2750  521 1805 ... 4246 6215    7]
 [ 539 1808 4250 ...    0    0    0]
 [ 302 2753   27 ...  198   26   43]
 ...
 [   5   28  160 ...    0    0    0]
 [  49  769  337 ...    0    0    0]
 [   0    0    0 ...    0    0    0]]


12193

In [5]:
# Hint: 20% of data can go for testing
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=SEED, stratify=y
)
X_train.shape, X_val.shape


((800, 120), (200, 120))


## 3. Load pre-trained predictive embeddings (GloVe)

We use **GloVe 6B** via Keras. If the download fails, you can continue with random initialisation.


In [None]:

import os, zipfile
from tensorflow.keras.utils import get_file
import numpy as np

EMBED_DIM = 100
# GLOVE_URL = "https://nlp.stanford.edu/data/glove.6B.zip"
# glove_zip_path = get_file("glove.6B.zip", GLOVE_URL, cache_dir=".", cache_subdir=".")
# if not os.path.exists("glove.6B.100d.txt"):
#     with zipfile.ZipFile(glove_zip_path, "r") as z:
#         z.extract("glove.6B.100d.txt", path=".")

# load GloVe embeddings into a dictionary
emb_index = {}
with open("data/glove.6B.100d.txt", "r", encoding="utf8") as f:
    for line in f:
        # parses the vector file line by line, the first part (part[0]) is the word, the rest (part[1:]) are the vector values
        parts = line.strip().split()
        word = parts[0]
        vec = np.array(parts[1:], dtype="float32")
        # store in dictionary word -> vector
        emb_index[word] = vec


embedding_matrix = np.random.normal(scale=0.6, size=(vocab_size, EMBED_DIM)).astype("float32")
hits = 0
for word, idx in word_index.items():
    if idx >= vocab_size:
        continue
    vec = emb_index.get(word)
    if vec is not None and vec.shape[0] == EMBED_DIM:
        embedding_matrix[idx] = vec
        hits += 1
print(f"Loaded embeddings: {len(emb_index)} | Vocab hits: {hits}/{vocab_size}")


Loaded embeddings: 400000 | Vocab hits: 10224/12193



## 4. Explore nearest neighbours (cosine similarity)


In [9]:

from numpy.linalg import norm

def nearest_neighbours(query, topk=10):
    if query not in emb_index:
        return []
    # qv = query vector
    qv = embedding_matrix[query]
    
    sims = embedding_matrix @ qv / (norm(embedding_matrix, axis=1) * norm(qv) + 1e-8)
    idx = np.argsort(-sims)[:topk+1]
    return [(index_word[i], float(sims[i])) for i in idx if index_word[i] != query][:topk]
    ########
    ########
    ########
    ########
    ################
    ########
    ########
    ########

probes = ["car", "engine", "battery", "circuit", "voltage"]
for p in probes:
    print(f"Probe: {p}")
    print(nearest_neighbours(p, topk=8))
    print("-"*50)


Probe: car


IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices


## 5. Baseline: averaged embeddings + Logistic Regression


In [None]:

from sklearn.linear_model import LogisticRegression

# Hint: Iteration could be 300.

def doc_mean_vector(seq_row):
    valid = [embedding_matrix[idx] for idx in seq_row if idx != 0 and idx < vocab_size]
    if not valid:
        return np.zeros((EMBED_DIM,), dtype="float32")
    return np.mean(valid, axis=0)

X_tr_mean = np.stack([doc_mean_vector(r) for r in X_train])
X_va_mean = np.stack([doc_mean_vector(r) for r in ###############])

clf = LogisticRegression(max_iter=#############)
clf.fit(X_tr_mean, y_train)
pred_lr = clf.predict(X_va_mean)

acc_lr = accuracy_score(###########)
f1_lr = f1_score(##########, #######, average="macro")
print(f"Baseline (avg embeddings) | acc={acc_lr:.3f} | f1_macro={f1_lr:.3f}")



## 6. RNN classifier with pre-trained embeddings (Keras)


In [None]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

# Hint: For dense layer use sigmoid as activation function. Use adam optimizer and loss should be binary crossentropy.

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=EMBED_DIM,
              input_length=X.shape[1], weights=[embedding_matrix],
              trainable=False),
    SimpleRNN(64, activation="tanh"),
    Dense(1, activation=############)
])

model.compile(optimizer=########, loss=###########, metrics=["accuracy"])
model.summary()


In [None]:

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=10, batch_size=32, verbose=1
)

y_pred_prob = model.predict(##########, verbose=0).ravel()
y_pred = (y_pred_prob >= 0.5).astype("int64")

acc_rnn = accuracy_score(y_val, ############)
f1_rnn = f1_score(y_val, y_pred, average="macro")
print(f"RNN (frozen GloVe) | acc={acc_rnn:.3f} | f1_macro={f1_rnn:.3f}")


## 7. Compare results & confusion matrix

In [None]:

import pandas as pd
results = pd.DataFrame([
    {"model": "Avg-embeddings + LogisticRegression", "accuracy": acc_lr, "f1_macro": f1_lr},
    {"model": "SimpleRNN (frozen GloVe)", "accuracy": acc_rnn, "f1_macro": f1_rnn},
])
results


In [None]:

cm = confusion_matrix(################)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['rec.autos','sci.electronics'])
disp.plot(values_format='d')
plt.title("Confusion matrix: RNN (frozen GloVe)")
plt.show()



## 8. Reflection (120–180 words)

- Where did the averaged-embedding baseline do well, and where did the RNN improve?
- Why do predictive embeddings help compared with raw counts or TF-IDF?  
- If you fine-tuned the embedding layer (`trainable=True`) in this small-data setting, when would you expect gains or overfitting, and why?
