[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextLab/embeddings-llm-course/blob/main/Assignment3_Wikipedia_Embeddings.ipynb)

# Assignment 3: Representing Meaning - A Computational Exploration of Semantic Space

**PSYC 51.17: Models of Language and Communication**

**Due Date: January 30, 2026 at 11:59 PM EST**

---

## Overview

*"You shall know a word by the company it keeps."* - J.R. Firth, 1957

In this assignment, you will explore how machines represent meaning using 250,000 Wikipedia articles. You will implement and compare methods spanning five decades of computational linguistics - from classical statistical techniques (LSA, LDA) to modern large language models (BERT, GPT-2, Sentence Transformers, Llama).

### Learning Objectives

By completing this assignment, you will:
- Understand the evolution of semantic representation from classical to modern NLP
- Implement and compare traditional, neural, and LLM-based embedding methods
- Develop expertise in clustering evaluation and unsupervised learning
- Connect computational methods to cognitive theories of semantic memory
- Create publication-quality visualizations of high-dimensional semantic spaces

---

## Table of Contents

1. [Setup and Installation](#1-setup-and-installation)
2. [Load Dataset](#2-load-dataset)
3. [Part 1: Implementing the Embedding Zoo](#3-part-1-implementing-the-embedding-zoo)
   - 1.1 Classical Statistical Methods (LSA, LDA)
   - 1.2 Static Word Embeddings (Word2Vec, GloVe, FastText)
   - 1.3 Contextualized Embeddings (BERT, GPT-2)
   - 1.4 Modern Sentence/Document Embeddings
   - 1.5 Modern Topic Models (BERTopic, Top2Vec)
4. [Part 2: Evaluation and Analysis](#4-part-2-evaluation-and-analysis)
5. [Part 3: Visualization](#5-part-3-visualization)
6. [Part 4: Cognitive Science Connection](#6-part-4-cognitive-science-connection)
7. [Part 5: Advanced Extensions](#7-part-5-advanced-extensions)
8. [Conclusion](#8-conclusion)

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install -q transformers sentence-transformers torch
!pip install -q gensim scikit-learn umap-learn hdbscan
!pip install -q bertopic top2vec
!pip install -q plotly matplotlib seaborn
!pip install -q pandas numpy tqdm

In [None]:
# Core imports
import os
import pickle
import urllib.request
import numpy as np
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
import random
random.seed(42)
np.random.seed(42)

# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2. Load Dataset

In [None]:
# Download the Wikipedia dataset if it doesn't exist
dataset_url = 'https://www.dropbox.com/s/v4juxkc5v2rd0xr/wikipedia.pkl?dl=1'
dataset_path = 'wikipedia.pkl'

if not os.path.exists(dataset_path):
    print("Downloading dataset (this may take a few minutes)...")
    urllib.request.urlretrieve(dataset_url, dataset_path)
    print("Download complete.")
else:
    print("Dataset already exists.")

# Load the dataset
print("Loading dataset...")
with open(dataset_path, 'rb') as f:
    wikipedia = pickle.load(f)

print(f"Loaded {len(wikipedia):,} Wikipedia articles")

In [None]:
# Explore the dataset structure
print("Dataset structure:")
print(f"Type: {type(wikipedia)}")
print(f"Number of articles: {len(wikipedia):,}")
print(f"\nFirst article keys: {wikipedia[0].keys()}")
print(f"\nExample article:")
print(f"  Title: {wikipedia[0]['title']}")
print(f"  Text length: {len(wikipedia[0]['text'])} characters")
print(f"  ID: {wikipedia[0]['id']}")
print(f"  URL: {wikipedia[0]['url']}")

In [None]:
# For development, start with a subset (uncomment for full dataset)
# You can start with 10K articles for faster iteration, then scale up
SAMPLE_SIZE = 10000  # Change to len(wikipedia) for full dataset

if SAMPLE_SIZE < len(wikipedia):
    print(f"Using subset of {SAMPLE_SIZE:,} articles for development")
    sample_indices = random.sample(range(len(wikipedia)), SAMPLE_SIZE)
    articles = [wikipedia[i] for i in sample_indices]
else:
    articles = wikipedia
    print(f"Using full dataset of {len(articles):,} articles")

# Extract texts and titles for convenience
texts = [article['text'] for article in articles]
titles = [article['title'] for article in articles]

## 3. Part 1: Implementing the Embedding Zoo (40 points)

In this section, you will implement 10+ different embedding methods.

### 1.1 Classical Statistical Methods (8 points)

Implement Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).

In [None]:
# TODO: Implement LSA using TF-IDF + TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation

# Your LSA implementation here
# ...

In [None]:
# TODO: Implement LDA
# Your LDA implementation here
# ...

### 1.2 Static Word Embeddings (8 points)

Implement Word2Vec, GloVe, and FastText with document-level aggregation.

In [None]:
# TODO: Implement Word2Vec, GloVe, FastText embeddings
import gensim.downloader as api

# Load pre-trained models
# word2vec_model = api.load('word2vec-google-news-300')
# glove_model = api.load('glove-wiki-gigaword-300')
# fasttext_model = api.load('fasttext-wiki-news-subwords-300')

# Your implementation here
# ...

### 1.3 Contextualized Embeddings (8 points)

Implement BERT and GPT-2 embeddings.

In [None]:
# TODO: Implement BERT embeddings
from transformers import AutoTokenizer, AutoModel

# Your BERT implementation here
# ...

In [None]:
# TODO: Implement GPT-2 embeddings
# Your GPT-2 implementation here
# ...

### 1.4 Modern Sentence/Document Embeddings (8 points)

Implement Sentence Transformers and optionally Llama embeddings.

In [None]:
# TODO: Implement Sentence Transformers
from sentence_transformers import SentenceTransformer

# Your implementation here
# ...

### 1.5 Modern Topic Models (8 points)

Implement BERTopic and Top2Vec.

In [None]:
# TODO: Implement BERTopic
from bertopic import BERTopic

# Your implementation here
# ...

In [None]:
# TODO: Implement Top2Vec
# from top2vec import Top2Vec

# Your implementation here
# ...

## 4. Part 2: Evaluation and Analysis (30 points)

Apply clustering algorithms and compute comprehensive metrics.

In [None]:
# TODO: Implement clustering (K-Means, Hierarchical, DBSCAN/HDBSCAN)
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
import hdbscan

# Your implementation here
# ...

In [None]:
# TODO: Implement evaluation metrics
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Your implementation here
# ...

## 5. Part 3: Visualization (15 points)

In [None]:
# TODO: Implement dimensionality reduction and visualization
import umap
from sklearn.manifold import TSNE
import plotly.express as px
import plotly.graph_objects as go

# Your implementation here
# ...

## 6. Part 4: Cognitive Science Connection (10 points)

### Distributional Semantics and Cognitive Science

*Write your analysis here (2-3 pages) connecting your computational work to theories of meaning in cognitive science.*

TODO: Your essay here...

## 7. Part 5: Advanced Extensions (5 points)

Choose at least ONE extension: Cross-Lingual, Temporal Analysis, or Practical Applications.

In [None]:
# TODO: Implement your chosen extension
# Your implementation here
# ...

## 8. Conclusion

*Summarize your findings, discuss limitations, and suggest future directions.*

TODO: Your conclusion here...