DSBA-6162<br>
Nathan Schaaf<br>
3/18/2025

# Word Association Mining Exercise
In this exercise we will explore some basic word association mining techniques. The example corpus contains 180 movie reviews, eact text file (review) written by one of two reviewers. The first 80 were written by Berardinelli and the remaining were by Schwartz.

If you want to work through this exercise, you will first need to clone the repository. Please refer to the README.MD file.

In [1]:
import os
import math
from collections import Counter # Used in Problem 3
import nltk # Used in Problem 3
from nltk.corpus import stopwords # Used in Problem 3
import string # Used in Problem 3

## Problem 1
Calculate the entropy of the word "director" appearance in the corpus.

In [2]:
# Define the path to the folder containing the movie reviews
folder_path = r"C:/Users/natha/OneDrive/Desktop/Data_Mining/07/word_association_mining_exercise/MovieReviews"  # Replace this path this with your location of the cloned repo

# Word to analyze
word = "director"

# List to store word counts per file
word_counts = []

# Iterate through each text file in the directory
for filename in os.listdir(folder_path):
    if filename.endswith('.txt'):  # Ensure we only process text files
        file_path = os.path.join(folder_path, filename)
        with open(file_path, 'r', encoding='latin-1') as file:  # Use encoding='latin-1 as 'utf-8' has errors
            text = file.read().lower()  # Convert to lowercase for consistency
            count = text.split().count(word)  # Count occurrences of the word
            word_counts.append(count)

# Compute probability distribution
total_occurrences = sum(word_counts)
if total_occurrences == 0:
    print(f"The word '{word}' does not appear in the corpus.")
    exit(1)

probabilities = [count / total_occurrences for count in word_counts]

# Calculate Shannon entropy
entropy = -sum(p * math.log2(p) for p in probabilities if p > 0)

In [3]:
print(f"The entropy of the word '{word}' in the corpus is {entropy:.4f} bits.")

The entropy of the word 'director' in the corpus is 5.4528 bits.


## Explanation
<p><strong>Entropy</strong> is a measure of uncertainty or randomness in a distribution. In this case, the distribution of the word "director" across the 180 movie reviews.</p>
<p>Entropy is measured in bits (binary digits). It represents the smallest possible amount of uncertainty, e.g., a single yes/no or 0/1 decision.</p>
<p>A <strong>low entropy</strong> is closer to zero and indicates a high predicatability.<br>
A <strong>high entropy</strong>, in this case 5.45 bits, means the appearance of the word "director" is more unpredicatable across the corpus of reviews.</p>


# Problem 2
Calculate the mutual informatino between the "director" and the document author.

In [4]:
# Number of reviews per author
num_reviews_B = 80
num_reviews_S = 100
total_reviews = num_reviews_B + num_reviews_S

# Initialize counters
c_BY = c_BN = c_SY = c_SN = 0

# Iterate through the first 80 reviews (Berardinelli)
for i, filename in enumerate(sorted(os.listdir(folder_path))):
    if filename.endswith('.txt'):
        file_path = os.path.join(folder_path, filename)
        with open(file_path, 'r', encoding='latin-1', errors='ignore') as file:
            text = file.read().lower()
            if word in text:
                if i < 80:  # First 80 files → Berardinelli
                    c_BY += 1
                else:       # Remaining files → Schwartz
                    c_SY += 1

# Compute missing values
c_BN = num_reviews_B - c_BY
c_SN = num_reviews_S - c_SY

# Compute probabilities
P_B = num_reviews_B / total_reviews
P_S = num_reviews_S / total_reviews
P_Y = (c_BY + c_SY) / total_reviews
P_N = (c_BN + c_SN) / total_reviews

P_BY = c_BY / total_reviews
P_BN = c_BN / total_reviews
P_SY = c_SY / total_reviews
P_SN = c_SN / total_reviews

# Compute Mutual Information
def safe_log2(x):
    return math.log2(x) if x > 0 else 0  # Avoid log(0)

MI = (
    P_BY * safe_log2(P_BY / (P_B * P_Y)) +
    P_BN * safe_log2(P_BN / (P_B * P_N)) +
    P_SY * safe_log2(P_SY / (P_S * P_Y)) +
    P_SN * safe_log2(P_SN / (P_S * P_N))
)

In [5]:
print(f"Mutual Information between 'director' and the document author: {MI:.4f} bits")

Mutual Information between 'director' and the document author: 0.0541 bits


## Explanation
<p><strong>Mutual Information (MI)</strong> quantifies the reduction in uncertainty about one variable (in this case the author) given knowledge of another variable (in this case the appearance of the word "director"). In other words, MI can tell us how much knowing the document author helps to predict the presence of the word "director" in a review.</p>
<p>A <strong>high MI value</strong> indicates knowing the author give a lot of information about whether "director" appears.<br>A <strong>low MI value</strong> indicates the presence of "director" is not strongly linked to the author.</p>
<p>Steps to compute the MI:
    <ol>
        <li>Iterate through the reviews and label which reviews are Berardinelli vs. Schwartz.</li>
        <li>Compute the missing values, where a review *does not* contain the word "director".</li>
        <li>Compute the probabilities (Yes/No) for each author and if the word "director" appears.</li>
        <li>Compute the mutual informaitn using the MI formula.</li>
    </ol>
</p>
<p>In this exercise, the MI of 0.0541 bits means that knowing whether the word "director" appears in a review provides a small amount of informatino about who wrote teh review, either Berardinelli or Schwartz.</p>

# Problem 3
Find the <strong>Top Ten</strong> words with the highest mutual information with the document autor and their respective mutual information. Filter out the 'stop words'.
<p><strong>Stop Words</strong> are common words like "the", "a", "is", etc. that are often removed from text before analysis because they do not carry much meaning on their own. For this exercise, we can use the Python library NLTK (Natural Language Toolkit) to gain access to a pre-defined list of stop words.</p>

In [6]:
# Download stop words from NLTK
nltk.download('stopwords')

# Load NLTK's stop words
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\natha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
# Function to read all reviews and process them
def load_reviews(folder_path):
    reviews = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(folder_path, filename)
            with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
                text = file.read().lower()
                reviews.append(text)
    return reviews

In [8]:
# Preprocess the reviews (tokenize, clean, remove punctuation, and remove stop words)
def preprocess_reviews(reviews):
    word_counts = Counter()
    total_words = 0
    for review in reviews:
        # Remove punctuation
        review = review.translate(str.maketrans('', '', string.punctuation))
        words = review.split()
        # Remove stop words
        filtered_words = [word for word in words if word not in stop_words]
        total_words += len(filtered_words)
        word_counts.update(filtered_words)
    return word_counts, total_words

In [9]:
# Count words for each author (Berardinelli and Schwartz)
def count_words_for_authors(reviews):
    author_reviews = {'Berardinelli': reviews[:80], 'Schwartz': reviews[80:]}
    author_word_counts = {'Berardinelli': Counter(), 'Schwartz': Counter()}
    total_words_by_author = {'Berardinelli': 0, 'Schwartz': 0}

    for author, author_reviews_list in author_reviews.items():
        for review in author_reviews_list:
            # Remove punctuation
            review = review.translate(str.maketrans('', '', string.punctuation))
            words = review.split()
            # Remove stop words
            filtered_words = [word for word in words if word not in stop_words]
            total_words_by_author[author] += len(filtered_words)
            author_word_counts[author].update(filtered_words)

    return author_word_counts, total_words_by_author

In [10]:
# Function to calculate mutual information for each word
def calculate_mutual_information(word_counts, author_word_counts, total_words_by_author, total_reviews, author_totals):
    mutual_info = {}
    for word in word_counts:
        P_w = word_counts[word] / sum(word_counts.values())  # Probability of the word in the corpus
        for author in ['Berardinelli', 'Schwartz']:
            P_A = author_totals[author] / total_reviews  # Probability of the author
            P_w_A = author_word_counts[author].get(word, 0) / total_words_by_author[author]  # Probability of word given author
            if P_w_A > 0:  # To avoid log(0)
                MI = P_w_A * math.log2(P_w_A / (P_w * P_A))
                mutual_info[word] = mutual_info.get(word, 0) + MI
    return mutual_info

In [11]:
# Load reviews and process
reviews = load_reviews(folder_path)
word_counts, total_words = preprocess_reviews(reviews)

# Calculate word counts for each author
author_word_counts, total_words_by_author = count_words_for_authors(reviews)
author_totals = {'Berardinelli': 80, 'Schwartz': 100}
total_reviews_count = 180  # The total number of reviews (80 from Berardinelli, 100 from Schwartz)

# Calculate mutual information for each word
mutual_info = calculate_mutual_information(word_counts, author_word_counts, total_words_by_author, total_reviews_count, author_totals)

# Get top 10 words with the highest mutual information
top_10_words = sorted(mutual_info.items(), key=lambda x: x[1], reverse=True)[:10]

In [12]:
# Display the top 10 words and their MI values
for word, mi_value in top_10_words:
    print(f"Word: '{word}', MI: {mi_value:.4f} bits")

Word: 'film', MI: 0.0323 bits
Word: 'one', MI: 0.0156 bits
Word: 'movie', MI: 0.0111 bits
Word: 'story', MI: 0.0107 bits
Word: 'even', MI: 0.0089 bits
Word: 'life', MI: 0.0077 bits
Word: 'like', MI: 0.0075 bits
Word: 'see', MI: 0.0075 bits
Word: 'schwartz', MI: 0.0069 bits
Word: 'dennis', MI: 0.0068 bits


## Explanation
<p>We can follow these steps calculate the top ten words with the highest mutual information (MI) with the document author.</p>

<p>Steps:
    <ol>
        <li><strong>Preprocess the text</strong>:
            <ul>
                <li>Read the reviews and clean the data (i.e., lowercase, remove punctuation, etc.).</li>
                <li>Remove <strong>stop words</strong>.</li>
                <li>Tokenize the reviews into individual words.</li>
            </ul>
        </li>
        <li><strong>Count occurances of each word for each reviewer</strong>.</li>
        <li><strong>Compute mutual information (MI) for each word</strong>:
            <ul>
                <li>Calculate how often each word appears in each author's reviews.</li>
                <li>Use the MI formula between the word and document author.</li>
            </ul>
        </li>
        <li><strong>Sort words by their MI score and select top 10</strong>.</li>
    </ol>
</p>
<p>The <strong>Top Ten</strong> words sorted by highest MI are:
    <ol>
    <li>film</li>
    <li>one</li>
    <li>movie</li>
    <li>story</li>
    <li>even</li>
    <li>life</li>
    <li>like</li>
    <li>see</li>
    <li>schwartz</li>
    <li>dennis</li>
    </ol>
</p>