# PROBLEM 1 – Representing English Text (5 pts)

Read in these two GLUE datasets (see section “DATA” above). Also convert alphabetical characters to lower case:

    •    Dataset : SST  
         Use file : SST/train.tsv  
         Notes : Use column “sentence” (Ignore column “label”)

    •    Dataset : QNLI 
         Use File : QNLI/dev.tsv  
         Notes: Use column “sentence” (ignore columns “question” and “label”)
     
Convert each dataset into a single list of tokens by applying the function “word_tokenize()” in the NLTK :: nltk.tokenize package. We will use these lists represent two distributions of English text.
To show you have finished this step, print the first 10 tokens from each dataset.

In [9]:
import nltk
from nltk.tokenize import word_tokenize
import pandas as pd

# Download NLTK data
nltk.download('punkt')

# Read in the SST datasets
sst_df = pd.read_csv('/Users/pradaapss/Desktop/Semester 3/CS 585 NLP/Assignment 1/SST-2/train.tsv', sep='\t', usecols=['sentence'])

# Read in the QNLI datasets
qnli_df = pd.read_csv('/Users/pradaapss/Desktop/Semester 3/CS 585 NLP/Assignment 1/QNLI/dev.tsv', sep='\t', usecols=['sentence'])

# Convert sentences to lowercase
sst_df['sentence'] = sst_df['sentence'].str.lower()
qnli_df['sentence'] = qnli_df['sentence'].str.lower()

# Tokenize the sentences
sst_tokens = [word_tokenize(sentence) for sentence in sst_df['sentence']]
qnli_tokens = [word_tokenize(sentence) for sentence in qnli_df['sentence']]

# Print the first 10 tokens from each dataset
print("First 10 tokens in SST dataset:")
for sst_token in sst_tokens[:10]:
    print(sst_token)

print("\nFirst 10 tokens in QNLI dataset:")
for qnli_token in qnli_tokens[:10]:
    print(qnli_token)


[nltk_data] Downloading package punkt to /Users/pradaapss/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


First 10 tokens in SST dataset:
['hide', 'new', 'secretions', 'from', 'the', 'parental', 'units']
['contains', 'no', 'wit', ',', 'only', 'labored', 'gags']
['that', 'loves', 'its', 'characters', 'and', 'communicates', 'something', 'rather', 'beautiful', 'about', 'human', 'nature']
['remains', 'utterly', 'satisfied', 'to', 'remain', 'the', 'same', 'throughout']
['on', 'the', 'worst', 'revenge-of-the-nerds', 'clichés', 'the', 'filmmakers', 'could', 'dredge', 'up']
['that', "'s", 'far', 'too', 'tragic', 'to', 'merit', 'such', 'superficial', 'treatment']
['demonstrates', 'that', 'the', 'director', 'of', 'such', 'hollywood', 'blockbusters', 'as', 'patriot', 'games', 'can', 'still', 'turn', 'out', 'a', 'small', ',', 'personal', 'film', 'with', 'an', 'emotional', 'wallop', '.']
['of', 'saucy']
['a', 'depressed', 'fifteen-year-old', "'s", 'suicidal', 'poetry']
['are', 'more', 'deeply', 'thought', 'through', 'than', 'in', 'most', '`', 'right-thinking', "'", 'films']

First 10 tokens in QNLI dat

# PROBLEM 2 – Word probability (10pts)
• Write a python function that creates a probability distribution from a list of tokens. This function should return a dictionary that maps a token to a probability (I.e., maps a string to a floating-point value)
• Apply your function to the list created in Problem 1 to create SST and QNLI distributions.
• Show that both probability distributions sum to 1, allowing for some small numerical rounding
error. Or, if they do not, add a comment in your notebook to explain why.

In [2]:
from collections import Counter

# Define function to create probability distribution
def create_probability_distribution(token_list):

  # Count token frequencies
  token_counts = Counter(token_list)
  
  # Calculate total number of tokens
  total_tokens = len(token_list)

  # Create distribution as dict of {token : probability} 
  probability_dist = {token: count/total_tokens for token, count in token_counts.items()}

  return probability_dist

# Flatten lists of tokens
sst_tokens_flat = [token for tokens in sst_tokens for token in tokens]
qnli_tokens_flat = [token for tokens in qnli_tokens for token in tokens]

# Create probability distributions
sst_dist = create_probability_distribution(sst_tokens_flat)
qnli_dist = create_probability_distribution(qnli_tokens_flat)


In [3]:
# Print SST Probabilty distributions
print("SST Probability Distribution:")
print(sst_dist)

SST Probability Distribution:


In [4]:
# Print QNLI Probabilty distributions
print("\nQNLI Probability Distribution:")
print(qnli_dist)


QNLI Probability Distribution:


In [5]:
# Check if both distributions sum to approximately 1
epsilon = 1e-6

if abs(sum(sst_dist.values()) - 1) < epsilon:
    print("sst_distribution sums to approximately 1")
else:
    print("sst_distribution does not sum to approximately 1")

if abs(sum(qnli_dist.values()) - 1) < epsilon:
    print("qnli_distribution sums to approximately 1")
else:
    print("qnli_distribution does not sum to approximately 1")

sst_distribution sums to approximately 1
qnli_distribution sums to approximately 1


# PROBLEM 3 – Entropy (20pts)
• Write a python function that computes the entropy of a random variable, input as a probability distribution.• Use this function to compute the word-level entropy of SST and QNLI, using the distributions you created in Problem 2. Show results in your notebook.

In [6]:
import math

def compute_entropy(probability_distribution):
    return -sum(p * math.log2(p) for p in probability_distribution.values() if p > 0)

# Compute word-level entropy

# Calculate SST entropy 
sst_entropy = compute_entropy(sst_dist)

# Calculate QNLI entropy
qnli_entropy = compute_entropy(qnli_dist)

# Print entropy results  
print("Word-level entropy for SST:", sst_entropy)
print("Word-level entropy for QNLI:", qnli_entropy)

Word-level entropy for SST: 10.079162530566823
Word-level entropy for QNLI: 10.056278588664085


# PROBLEM 4 – KL Divergence (20pts)
• Write a python function to compute the KL divergence between two probability distributions.
• Apply this function to the distributions you created in Problem 2 to show that KL divergence is
not symmetric. [This is also question 2.12 of M&S, p79].

In [7]:
# Function to compute KL divergence between two distributions

from math import log2 
def kl_divergence(p, q):
    # Find intersection of keys
    keys = set(p.keys()) & set(q.keys())  

    # Filter distributions
    p = {i: p[i] for i in keys}
    q = {i: q[i] for i in keys}

    return sum(p[i] * log2(p[i]/q[i]) for i in p if p[i] > 0)

# Compute KL divergence
common_tokens = set(sst_dist) & set(qnli_dist)
sst_filtered = {t: sst_dist[t] for t in common_tokens}
qnli_filtered = {t: qnli_dist[t] for t in common_tokens}

kl_sst_to_qnli = kl_divergence(sst_filtered, qnli_filtered)
kl_qnli_to_sst = kl_divergence(qnli_filtered, sst_filtered)

print(f"KL divergence from SST to QNLI: {kl_sst_to_qnli}")  
print(f"KL divergence from QNLI to SST: {kl_qnli_to_sst}")

KL divergence from SST to QNLI: 0.8306977014111805
KL divergence from QNLI to SST: 0.7322996980521798


# PROBLEM 5 – Entropy Rate (20 pts)
• Write a python function that computes the per-word entropy rate of a message relative to a specific probability distribution.
• Find a recent movie review online (any website) and compute the entropy rates of this movie review using the distributions you created for both SST and QNLI datasets. Show results in your notebook.

In [8]:
def compute_entropy_rate(message, prob_dist):
  # Tokenize message
  tokens = word_tokenize(message.lower())
  # Get token probabilities
  token_probs = [prob_dist.get(token, 1e-10) for token in tokens]
  # Calculate entropy 
  entropy = -sum(p * math.log2(p) for p in token_probs if p > 0)
  # Calculate rate
  num_tokens = len(tokens)
  entropy_rate = entropy / num_tokens
  return entropy_rate

# Example movie review text (replace with your own)
movie_review = "This movie was amazing. The acting was superb, and the plot was captivating."

# Compute entropy rates for the movie review using SST and QNLI distributions
entropy_rate_sst = compute_entropy_rate(movie_review, sst_dist)
entropy_rate_qnli = compute_entropy_rate(movie_review, qnli_dist)

print("Entropy Rate (SST distribution):", entropy_rate_sst)
print("Entropy Rate (QNLI distribution):", entropy_rate_qnli)

Entropy Rate (SST distribution): 0.06850143684364149
Entropy Rate (QNLI distribution): 0.0867261548675276


# PROBLEM 6 – Observed Entropy Rate (10pts – Answer in Blackboard)
Refer to your results from Problem 5. Which distribution gives you the lowest entropy rate for your
movie review? Does this match what you expected? Why or why not?

# PROBLEM 7 – Zero probabilities (10 pts – Answer in Blackboard)
Problem 5 required that you handle “zero probabilities” cases, where a token occurred in one dataset but not the other. How did you handle these tokens? (Hint: Dropping the word from both probability distributions is not an ideal solution).