<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/3.embeddings/DistributionalSimilarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/3.embeddings/DistributionalSimilarity.ipynb)

In [1]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/wiki.10K.txt

--2025-09-09 23:45:17--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/wiki.10K.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24159412 (23M) [text/plain]
Saving to: ‘wiki.10K.txt’


2025-09-09 23:45:17 (120 MB/s) - ‘wiki.10K.txt’ saved [24159412/24159412]



This notebook explores distribitional simliarity in a dataset of 10,000 Wikipedia articles (4.4M words), building high-dimensional, sparse representations for words from the distinct contexts they appear in.  These representations allow for analysis of the most similar words to a given query, and are interpretable with respect to the specific contexts that are most important for determining that two words are similar.

In [2]:
from collections import defaultdict, Counter
import math
import operator
import gzip

In [3]:
window = 2
vocab_size = 10000

In [4]:
filename = "wiki.10K.txt"
wiki_data = open(filename, encoding="utf-8").read().lower().split(" ")


In [5]:
# We'll only create word representation for the most frequent K words

def create_vocab(data):
    word_representations = {}
    vocab = Counter(data)

    top_k = [word for word, counts in vocab.most_common(vocab_size)]
    for k in top_k:
        word_representations[k] = defaultdict(float)
    return word_representations

In [6]:
# word representation for a word = its unigram distributional context (the unigrams that show
# up in a window before and after its occurence)

def count_unigram_context(data, word_representations):
    for i, word in enumerate(data):
        if word not in word_representations:
            continue
        start = i - window if i - window > 0 else 0
        end = i + window + 1 if i + window + 1 < len(data) else len(data)
        for j in range(start, end):
            if i != j:
                word_representations[word][data[j]] += 1

In [7]:
def count_directional_context(data, word_representations):
    for i, word in enumerate(data):
        if word not in word_representations:
            continue
        start = i - window if i - window > 0 else 0
        end = i + window + 1 if i + window + 1 < len(data) else len(data)
        left="L: %s" % ' '.join(data[start:i])
        right="R: %s" % ' '.join(data[i+1:end])

        word_representations[word][left] += 1
        word_representations[word][right] += 1

In [8]:
# normalize a word representation vector such that its L2 norm is 1.
# we do this so that the cosine similarity reduces to a simple dot product

def normalize(word_representations):
    for word in word_representations:
        total = 0
        for key in word_representations[word]:
            total += word_representations[word][key] * word_representations[word][key]

        total = math.sqrt(total)
        for key in word_representations[word]:
            word_representations[word][key] /= total


In [9]:
def dictionary_dot_product(dict1, dict2):
    dot = 0
    for key in dict1:
        if key in dict2:
            dot += dict1[key] * dict2[key]
    return dot

In [10]:
def find_sim(word_representations, query):
    if query not in word_representations:
        print("'%s' is not in vocabulary" % query)
        return None

    scores = {}
    for word in word_representations:
        cosine = dictionary_dot_product(word_representations[query], word_representations[word])
        scores[word] = cosine
    return scores

In [11]:
# Find the K words with highest cosine similarity to a query in a set of word_representations
def find_nearest_neighbors(word_representations, query, K):
    scores = find_sim(word_representations, query)
    if scores != None:
        sorted_x = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        for idx, (k, v) in enumerate(sorted_x[:K]):
            print("%s\t%s\t%.5f" % (idx,k,v))

Explore the difference between `count_unigram_context` and `count_directional_context` for determining what counts as "context".  `count_unigram_context` counts an individual unigram in the bag of words around a target as a "context" variable, while `count_directional_context` counts the sequence of words before and after the word as a single "context"--and specifies the direction it occurs (to the left or right of the word).

In [12]:
word_representations = create_vocab(wiki_data)
count_directional_context(wiki_data, word_representations)
normalize(word_representations)

In [13]:
find_nearest_neighbors(word_representations, "actor", 10)

0	actor	1.00000
1	politician	0.54099
2	actress	0.52242
3	cricketer	0.42361
4	artist	0.40005
5	writer	0.38234
6	cyclist	0.36833
7	musician	0.33385
8	diplomat	0.32010
9	poet	0.31124


In [14]:
# Let's find the contexts shared between two words that have the most contribution
# to the cosine similarity

def find_shared_contexts(word_representations, query1, query2, K):
    if query1 not in word_representations:
        print("'%s' is not in vocabulary" % query1)
        return None

    if query2 not in word_representations:
        print("'%s' is not in vocabulary" % query2)
        return None

    context_scores = {}
    dict1 = word_representations[query1]
    dict2 = word_representations[query2]

    for key in dict1:
        if key in dict2:
            score = dict1[key] * dict2[key]
            context_scores[key] = score

    sorted_x = sorted(context_scores.items(), key=lambda x: x[1], reverse=True)
    for idx, (k, v) in enumerate(sorted_x[:K]):
        print("%s\t%s\t%.5f" % (idx,k,v))

In [15]:
find_shared_contexts(word_representations, "actor", "politician", 10)

0	R: . he	0.21961
1	L: an american	0.13391
2	R: ) .	0.11417
3	R: in the	0.01410
4	L: an indian	0.00761
5	L: a canadian	0.00677
6	L: an english	0.00564
7	R: of the	0.00564
8	L: a french	0.00423
9	R: , who	0.00338


We can see here that the single feature that has the most impact on similarity between these parts is the directional ngram ". he" (which would appear in text like "John is an actor **. He** ..."

**Activity**: Find the nearest neighbors for other words above (in the `find_nearest_neighbors` cell); then find the shared contexts for a pair of nearest neighbors (as we did for actor/politician).  What does this reveal about what drives similarity?

In [18]:
find_nearest_neighbors(word_representations, "home", 10)

0	home	1.00000
1	back	0.60503
2	return	0.57933
3	adjacent	0.54039
4	according	0.53558
5	refers	0.53191
6	close	0.53029
7	belonging	0.53003
8	added	0.52894
9	contribution	0.52825


In [19]:
find_shared_contexts(word_representations, "home", "sanctuary", 10)

0	R: of the	0.07470
1	L: in the	0.04597
2	R: to the	0.03927
3	L: at the	0.02011
4	R: . the	0.02011
5	L: . the	0.02011
6	R: , and	0.01293
7	L: at his	0.00862
8	L: of the	0.00862
9	L: to the	0.00623
