## Distributional Semantics: Co-Occurrence Matrices and PMI

This exercise explores **Distributional Semantics**, the principle that "a word is characterised by the company it keeps". We will construct a numerical representation of this idea by creating a **Co-occurrence Matrix** from a sample corpus. To extract meaningful relationships beyond simple frequency, we will then compute **Pointwise Mutual Information (PMI) scores**. This method is foundational to understanding modern word embeddings (like **Word2Vec**) as it quantifies the strength of association between words based on their observed co-occurrences. Like the previous exercises, we will use the same email summary dataset to learn the concepts

#### We will cover the following topics as part of this exercise

- **Distributional Hypothesis**: Definition and core concept.

- **Co-occurrence Matrix Construction**: Generating counts using a context window.

- **Pointwise Mutual Information (PMI)**: Definition and calculation.

- **Comparative Analysis**: Contrasting raw counts with PMI scores.

- **Practical Application**: Illustrating word connections via the final matrix.

#### What we will learn from this exercise:

- The **Distributional Hypothesis** states that words with similar meanings appear in similar contexts.

- A **Co-occurrence Matrix** is the tabular representation of how often words appear together within a specified context window.

- **PMI** is a statistical measure used to quantify the strength of association between two words, prioritising interesting co-occurrences over merely frequent ones.

- High **PMI values reveal meaningful semantic relationships between words**.

#### Note:

Like the previous exercises, we will use our email thread dataset that has 4167 threads and 21684 emails. We will only use a small sample of the dataset and focus on exploring the concepts of distributional semantics.

**Let's get started now**

#### Setup and Pre-requisites

In [None]:
!pip install nltk



In [1]:
import yaml

config_path='/Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic/'
# Load the environment.yml file
print (config_path + "/configs/environment.yml")
with open(config_path + "/configs/environment.yml", "r") as f:
    config = yaml.safe_load(f)

# Choose environment (local or aws)
env = "local"   # or "aws"

base_path = config[env]["base_path"]
raw_data_path = base_path + config[env]["raw_data"]
processed_data_path = base_path + config[env]["processed_data"]
models_path = base_path + config[env]["models"]

print("Raw data path:", raw_data_path)
print("Processed data path:",  processed_data_path)
print("Models path:",  models_path)

/Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic//configs/environment.yml
Raw data path: /Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic/data/raw/
Processed data path: /Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic/data/processed/
Models path: /Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic/models/


In [2]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from math import log
import json

print(f"NumPy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"NLTK Version: {nltk.__version__}")

# Download NLTK resources required for tokenization
try:
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
except Exception as e:
    print(f"NLTK download error: {e}")

NumPy Version: 2.3.1
Pandas Version: 2.3.1
NLTK Version: 3.9.2


**Sample Corpus**

We create a small corpus from our email dataset to clearly illustrate the counting process.

In [5]:
# Loading the JSON data
email_data = json.load(open(raw_data_path + "email_thread_details.json"))
email_summary = json.load(open(raw_data_path + "email_thread_summaries.json"))

In [6]:
# Let's select some more samples to create our small corpus

import random

random.seed(123) # Set a random seed for reproducibility
sampled_keys = random.sample(list(range(len(email_summary))), 2)

sub_email_dataset = [email_summary[k]['summary'] for k in sampled_keys]

In [7]:
sub_email_dataset

["Clayton asks Chris for names and telephone numbers of individuals at various pipelines who can provide him with FT and IT charges, both peak and off-peak, going back 4 years. He mentions that these individuals should be cool with Enron and/or Chris's name. Chris agrees to help and asks Clayton to email him the specific information he needs. Clayton expresses his gratitude and offers to treat Chris to a nice lunch as a favor.",
 'The email thread covers various topics, including trading, relationships, hiring a call girl, hurricanes, and making plans for a Saturday night. The first email asks about someone\'s trading activities. The second email expresses a desire for a girlfriend after marriage. The third email discusses hiring a call girl for specific fantasies and asks about the cost. The fourth email refers to someone as a home wrecker. The fifth email hopes the recipient avoids a tropical storm in the Gulf and mentions the movie "The Perfect Storm." The sixth email agrees to meet

Since, the summaries are very long and consist of multiple sentences, let us take only one summary and break that into a small corpus

In [8]:
CORPUS = sub_email_dataset[1].split(". ")
CORPUS

['The email thread covers various topics, including trading, relationships, hiring a call girl, hurricanes, and making plans for a Saturday night',
 "The first email asks about someone's trading activities",
 'The second email expresses a desire for a girlfriend after marriage',
 'The third email discusses hiring a call girl for specific fantasies and asks about the cost',
 'The fourth email refers to someone as a home wrecker',
 'The fifth email hopes the recipient avoids a tropical storm in the Gulf and mentions the movie "The Perfect Storm." The sixth email agrees to meet up after the recipient returns and hopes they avoid hurricanes',
 'The final email suggests doing a mud pie activity and confirms plans for Saturday night.']

Let us build the vocabulary, which is the list of unique words present in out corpus

In [10]:
VOCAB = set()
for sentence in CORPUS:
  for word in sentence.split():
    VOCAB.add(word)

VOCAB = list(VOCAB)
VOCAB

['girl,',
 'activities',
 'various',
 'home',
 'agrees',
 'and',
 'trading,',
 'up',
 'suggests',
 'activity',
 'fantasies',
 'asks',
 'fourth',
 'to',
 'avoids',
 'someone',
 'returns',
 'they',
 'Gulf',
 'including',
 'for',
 'storm',
 'hurricanes,',
 'about',
 'discusses',
 'sixth',
 'call',
 'Perfect',
 'expresses',
 'cost',
 '"The',
 'plans',
 'final',
 'fifth',
 'as',
 'trading',
 'topics,',
 'specific',
 'The',
 'second',
 'Saturday',
 'the',
 'recipient',
 'girl',
 'third',
 'avoid',
 'tropical',
 'refers',
 'mentions',
 'Storm."',
 'email',
 'night',
 'a',
 'making',
 'movie',
 'covers',
 'first',
 'girlfriend',
 'hurricanes',
 'doing',
 'in',
 'pie',
 'confirms',
 "someone's",
 'night.',
 'hiring',
 'desire',
 'hopes',
 'mud',
 'after',
 'marriage',
 'meet',
 'wrecker',
 'thread',
 'relationships,']

In [11]:
# Define parameters
CONTEXT_WINDOW = 2
VOCAB_SIZE = len(VOCAB)

print("\n--- Corpus and Parameters ---")
print(f"Total Sentences: {len(CORPUS)}")
print(f"Total Vocabulary Size: {VOCAB_SIZE}")
print(f"Context Window Size: {CONTEXT_WINDOW} (meaning 2 words to the left and 2 to the right)")


--- Corpus and Parameters ---
Total Sentences: 7
Total Vocabulary Size: 75
Context Window Size: 2 (meaning 2 words to the left and 2 to the right)


#### Distributional Hypothesis

The **Distributional Hypothesis** is the central idea behind distributional semantics: **words that occur in similar contexts tend to have similar meanings**. For instance, the words "car" and "truck" might both frequently co-occur with "drive," "engine," and "road," suggesting they are semantically related (vehicles).

**Preprocessing and Vocabulary**:
We have seen a simple way to build the vocabulary. But, if you remember from our lexical processing exercise, we need to perform additional pre-processing steps like case conversion, tokenisation, extracting only words and etc. to build a noise free vocabulary.

So, next, we will tokenise the corpus and build a unique vocabulary.

In [12]:
# Tokenise and flatten the corpus
tokenized_corpus = [word_tokenize(sent.lower()) for sent in CORPUS]
all_words = [word for sent in tokenized_corpus for word in sent if word.isalnum()]

# Create unique vocabulary and word-to-index mapping
vocabulary = sorted(list(set(all_words)))
word_to_index = {word: i for i, word in enumerate(vocabulary)}
index_to_word = {i: word for i, word in enumerate(vocabulary)}
VOCAB_SIZE = len(vocabulary)

print(f"\nVocabulary Size: {VOCAB_SIZE}")
vocabulary


Vocabulary Size: 67


['a',
 'about',
 'activities',
 'activity',
 'after',
 'agrees',
 'and',
 'as',
 'asks',
 'avoid',
 'avoids',
 'call',
 'confirms',
 'cost',
 'covers',
 'desire',
 'discusses',
 'doing',
 'email',
 'expresses',
 'fantasies',
 'fifth',
 'final',
 'first',
 'for',
 'fourth',
 'girl',
 'girlfriend',
 'gulf',
 'hiring',
 'home',
 'hopes',
 'hurricanes',
 'in',
 'including',
 'making',
 'marriage',
 'meet',
 'mentions',
 'movie',
 'mud',
 'night',
 'perfect',
 'pie',
 'plans',
 'recipient',
 'refers',
 'relationships',
 'returns',
 'saturday',
 'second',
 'sixth',
 'someone',
 'specific',
 'storm',
 'suggests',
 'the',
 'they',
 'third',
 'thread',
 'to',
 'topics',
 'trading',
 'tropical',
 'up',
 'various',
 'wrecker']

#### Co-occurrence Matrix Construction

A Co-occurrence Matrix M is a square matrix where rows and columns represent words in the vocabulary. The entry $M_{ij}$ counts how many times word $w_i$ and word $w_j$ appear together within the defined context window across the entire corpus.

**Implementation**: We iterate through the corpus and increment counts for co-occurring pairs.

In [13]:
# Initialise the co-occurrence matrix with zeros
co_occurrence_matrix = np.zeros((VOCAB_SIZE, VOCAB_SIZE), dtype=np.int32)

# Build the matrix
for sentence in tokenized_corpus:
    # Remove punctuation/non-alpha words for simplicity
    clean_sentence = [word for word in sentence if word in word_to_index]

    for i, target_word in enumerate(clean_sentence):
        target_index = word_to_index[target_word]

        # Define the context boundary (excluding the target word itself)
        start = max(0, i - CONTEXT_WINDOW)
        end = min(len(clean_sentence), i + CONTEXT_WINDOW + 1)

        for j in range(start, end):
            if i != j: # Ensure we don't count self-co-occurrences
                context_word = clean_sentence[j]
                context_index = word_to_index[context_word]

                # Increment the count for the pair (target_word, context_word)
                co_occurrence_matrix[target_index, context_index] += 1

# Convert to a DataFrame for clear visualization
co_occurrence_df = pd.DataFrame(
    co_occurrence_matrix,
    index=vocabulary,
    columns=vocabulary
)

print("\n--- Raw Co-occurrence Matrix (partial display)---")
print(co_occurrence_df.head(5).iloc[:, 0:5])



--- Raw Co-occurrence Matrix (partial display)---
            a  about  activities  activity  after
a           0      0           0         0      1
about       0      0           0         0      0
activities  0      0           0         0      0
activity    0      0           0         0      0
after       1      0           0         0      0


In [14]:
# Inspect a frequent pair:
pair = ('email', 'thread')
q_idx = word_to_index[pair[0]]
f_idx = word_to_index[pair[1]]
print(f"\nRaw Count for ('{pair[0]}', '{pair[1]}'): {co_occurrence_matrix[q_idx, f_idx]}")

# Normalisation (optional, but standard for some models)
# L2 Normalisation (vector length):
# co_occurrence_normalized = co_occurrence_matrix / np.linalg.norm(co_occurrence_matrix, axis=1, keepdims=True)


Raw Count for ('email', 'thread'): 1


#### Pointwise Mutual Information (PMI)

**Pointwise Mutual Information (PMI)** measures the dependency between two random variables (in this case, two words). It quantifies how much more often two words $w_1$​ and $w_2$ co-occur than would be expected by chance, assuming independence.

$$\text{PMI}(w_1, w_2) = \log_2 \left( \frac{P(w_1, w_2)}{P(w_1) P(w_2)} \right)$$


Where:

- $P(w_1 , w_2)$ is the **joint probability** of word $w_1$ and word $w_2$ co-occurring.
- $P(w_1)$ and $P(w_2)$ are the **individual probabilities** of $w_1$ and $w_2$ occurring in the context, respectively.

**PMI Calculation**: We first calculate the necessary probabilities from the raw counts.

In [15]:
# Total number of co-occurrences in the matrix (sum of all entries)
total_co_occurrences = np.sum(co_occurrence_matrix)

# Word frequency (sum of counts across rows/columns) - approximates P(w)
word_frequencies = np.sum(co_occurrence_matrix, axis=1)

# Initialise the PMI matrix
pmi_matrix = np.zeros((VOCAB_SIZE, VOCAB_SIZE))

for i in range(VOCAB_SIZE):
    for j in range(VOCAB_SIZE):
        raw_count = co_occurrence_matrix[i, j]

        # Only proceed if there is at least one co-occurrence (to avoid log(0))
        if raw_count > 0:
            # 1. P(w_i, w_j): Joint Probability
            p_ij = raw_count / total_co_occurrences

            # 2. P(w_i) and P(w_j): Individual Probabilities (approximation)
            p_i = word_frequencies[i] / total_co_occurrences
            p_j = word_frequencies[j] / total_co_occurrences

            # 3. Calculate PMI
            pmi = log(p_ij / (p_i * p_j), 2)
            pmi_matrix[i, j] = pmi
        else:
            # Common technique: use Positive PMI (PPMI), setting negative values to zero.
            pmi_matrix[i, j] = 0

# Convert to a DataFrame for visualisation
pmi_df = pd.DataFrame(
    pmi_matrix,
    index=vocabulary,
    columns=vocabulary
)

print("\n--- Positive PMI (PPMI) Matrix (Sample) ---")
print(pmi_df.head(5).iloc[:, 0:5].round(2))


--- Positive PMI (PPMI) Matrix (Sample) ---
               a  about  activities  activity  after
a           0.00    0.0         0.0       0.0   0.93
about       0.00    0.0         0.0       0.0   0.00
activities  0.00    0.0         0.0       0.0   0.00
activity    0.00    0.0         0.0       0.0   0.00
after       0.93    0.0         0.0       0.0   0.00


#### Comparative Analysis: Raw Co-occurrence vs. PMI

Raw co-occurrence counts favor frequent words (e.g., "the," "a"). PMI, however, highlights unexpected and semantically meaningful co-occurrences.

**Inspecting Relationships** : Let's compare the count and PMI for common (less informative) and specific (more informative) pairs.

In [16]:
pairs_to_compare = [
    ('the', 'email'),   # Common function word + specific word
    ('fourth', 'movie'), # Descriptive word + specific word
    ('making', 'plans')  # Strong verb-noun relationship
]

print("\n--- Comparison: Raw Counts vs. PMI ---")
print(f"{'Pair':15} | {'Raw Count':10} | {'PMI Score':10}")
print("-" * 40)

for w1, w2 in pairs_to_compare:
    idx1 = word_to_index[w1]
    idx2 = word_to_index[w2]

    raw = co_occurrence_matrix[idx1, idx2]
    pmi = pmi_matrix[idx1, idx2]

    print(f"({w1}, {w2}):15 | {raw:10} | {pmi:10.2f}")



--- Comparison: Raw Counts vs. PMI ---
Pair            | Raw Count  | PMI Score 
----------------------------------------
(the, email):15 |          9 |       1.59
(fourth, movie):15 |          0 |       0.00
(making, plans):15 |          1 |       3.73


1. **('the', 'email')**: High **Raw Count (9)** due to the frequent word *'the'*, but only **moderate PMI (1.59)** since *'the'* co-occurs widely and isn’t strongly linked to *'email'*.

2. **('fourth', 'movie')**: **Raw Count (0)** and **PMI (0.00)** indicate no meaningful co-occurrence.

3. **('making', 'plans')**: **Low Raw Count (1)** but **high PMI (3.73)**, showing a strong, meaningful association between the two words.

#### Practical Application: Revealing Word Connections

The final PMI matrix serves as a vector-based representation of word meaning. By looking at the column (or row) vectors, we can see the words that are most strongly associated with a target word.

In [17]:
target_word = 'email'

# Get the PMI vector for the target word
pmi_vector = pmi_df.loc[target_word]

# Find the words with the highest PMI scores with the target
most_associated_words = pmi_vector.sort_values(ascending=False).drop(target_word).head(5)

print(f"\n--- Words Most Associated with '{target_word}' (by PMI) ---")
print(most_associated_words.round(2))

target_word_2 = 'confirms'
pmi_vector_2 = pmi_df.loc[target_word_2]
most_associated_words_2 = pmi_vector_2.sort_values(ascending=False).drop(target_word_2).head(5)

print(f"\n--- Words Most Associated with '{target_word_2}' (by PMI) ---")
print(most_associated_words_2.round(2))


--- Words Most Associated with 'email' (by PMI) ---
second    2.2
third     2.2
fourth    2.2
first     2.2
final     2.2
Name: email, dtype: float64

--- Words Most Associated with 'confirms' (by PMI) ---
activity    4.73
plans       3.73
for         2.73
and         2.41
a           0.00
Name: confirms, dtype: float64


#### Conclusion

This exercise successfully demonstrated the foundational steps of Distributional Semantics. We moved from a raw text corpus to a numerical representation by:

1. Constructing a Co-occurrence Matrix using a defined context window.

2. Computing the Pointwise Mutual Information (PMI) score.

By comparing raw counts and PMI, we saw that raw counts are skewed by high-frequency words, while PMI effectively filters for the most meaningful semantic associations. This technique, often refined using dimensionality reduction, forms the basis of classic word embedding models, proving that word meaning can be effectively encoded by observing the patterns of their neighbours.