Stefan Pophristic + Boxuan Li
May 1st, 2025
Information Theory Project

After our Meeting with Noga on April 30th, we agreed to first try to implement a simpler information bottlneck calcuation to test whether Mandarin classifiers are optimal. 

The general IB Formula is:
$$
  min I(X;T) - \beta I(T;Y)
$$

Noga suggested the quantify our analysis as follows:
$$
I_{q}(N;W) + \beta \mathbb{E}[d(N; N_{w})]
$$

N = the set of all Nouns to which MW can refer to
W = the set of all MW


**In this script we quantify the first term**

**The First Term (Input Term)**
The mutual information between N and W can be computed using the normal mutual information calculations:
$$
I_{q}(N;W) = \sum_{n\in N}\sum_{w \in W} p(n,w) log\frac{p(n,w)}{p(n)p(w)}
$$
$$
p(n, w) = p(n|w)p(w)
$$

p(n) = probbability of a noun in the corpus
p(w) = probability of a MW in the corpus
p(n|w) = probability of the noun given a MW


**The Second Term (Output Term)**
Also known as the reconstruction error. 

$N_{w}$ = The set of centroids in the semantic vector space of all nouns, grouped by MW

$d$ is a function that measures the reconstruction error. It could be something like KL divergence or mean square errors. In our case, we will just use the cosine similarity (i.e. distance in semantic vector space) between an N and its associated MW. 

# Parameters

In [1]:
import os
import pandas as pd
import requests
import matplotlib.pyplot as plt
import seaborn as sns

# For displaying Chinese characters properly
plt.rcParams['font.family'] = ['Arial Unicode MS', 'SimHei', 'sans-serif']

In [2]:
# Import MW and Noun combinations from the corpus

df = pd.read_csv("../chinese_noun_mw.csv")

In [3]:
df.head()

Unnamed: 0,Noun,MW,Count_Pre,Count_Post
0,上午,日,1,0
1,下半,局,1,0
2,下旬,年,1,0
3,下旬,月,2,0
4,下颌,个,1,0


In [4]:
token_count = df["Count_Pre"].sum()
print(f"Total number of MW + Noun tokens: {token_count}")

MW_count = df["MW"].nunique()
print(f"Total number of unique MW: {MW_count}")

N_count = df["Noun"].nunique()
print(f"Total number of unique Nouns: {N_count}")


Total number of MW + Noun tokens: 2289
Total number of unique MW: 110
Total number of unique Nouns: 847


# Quantify Probabilities [p(n), p(w), p(n|w)]

In [None]:
# Create and save dataframe with all 

In [18]:
# Calculate p(n) for all n

# Create dataframe with all N
df_N = pd.DataFrame(df["Noun"], columns=["Noun"])


# Get N count from corpus 



Unnamed: 0,Noun
0,上午
1,下半
2,下旬
3,下旬
4,下颌


In [13]:
# Calculate p(n), p(w), p(n|w)

# Import necessary libraries
import pandas as pd
import numpy as np

# First, load the dataset
df = pd.read_csv("/Volumes/server/SHARED/Corpora/Universal_Dependencies/2025_InformationTheory_Project/chinese_noun_mw.csv")

# Get the total number of tokens (using Count_Pre as specified)
total_tokens = df["Count_Pre"].sum()

# Calculate p(n) for each noun
# Group by Noun and sum the Count_Pre for each noun
noun_counts = df.groupby("Noun")["Count_Pre"].sum().reset_index()
noun_counts["P(n)"] = noun_counts["Count_Pre"] / total_tokens

# Calculate p(w) for each measure word
# Group by MW and sum the Count_Pre for each measure word
mw_counts = df.groupby("MW")["Count_Pre"].sum().reset_index()
mw_counts["P(w)"] = mw_counts["Count_Pre"] / total_tokens

# Create a new dataframe to store our results
result_df = df.copy()

# For each row, calculate p(n|w)
# Get total occurrences for each measure word
mw_totals = df.groupby("MW")["Count_Pre"].sum().to_dict()

# Add p(n|w) to the result dataframe - handling the case where Count_Pre might be 0
def calculate_p_n_given_w(row):
    if mw_totals[row["MW"]] == 0:
        return 0
    return row["Count_Pre"] / mw_totals[row["MW"]]

result_df["P(n|w)"] = result_df.apply(calculate_p_n_given_w, axis=1)

# Merge with P(n) values
result_df = result_df.merge(noun_counts[["Noun", "P(n)"]], on="Noun")

# Merge with P(w) values
result_df = result_df.merge(mw_counts[["MW", "P(w)"]], on="MW")

# Select and reorder the columns for the final output
final_df = result_df[["Noun", "MW", "Count_Pre", "P(n)", "P(w)", "P(n|w)"]]

# Save the final dataframe to CSV
final_df.to_csv("/Volumes/server/SHARED/Corpora/Universal_Dependencies/2025_InformationTheory_Project/noun_mw_probabilities.csv", index=False, encoding="utf-8")


In [12]:
def lookup_mw_noun_stats(noun, mw, probability_df, noun_counts_df, mw_counts_df):
    """
    Look up statistics for a specific noun and measure word combination.
    
    Parameters:
    -----------
    noun : str
        The noun to look up
    mw : str
        The measure word to look up
    probability_df : pandas.DataFrame
        DataFrame containing the probabilities
    noun_counts_df : pandas.DataFrame
        DataFrame containing the noun counts
    mw_counts_df : pandas.DataFrame
        DataFrame containing the measure word counts
        
    Returns:
    --------
    dict
        Dictionary with the count and probabilities for the noun-measure word pair
    """
    # Try to find the specific noun and measure word combination
    result = probability_df[(probability_df["Noun"] == noun) & (probability_df["MW"] == mw)]
    
    if len(result) == 0:
        print(f"No data found for noun '{noun}' with measure word '{mw}'")
        return None
    
    # Get total count for the noun
    noun_total = noun_counts_df[noun_counts_df["Noun"] == noun]["Count_Pre"].values[0]
    
    # Get total count for the measure word
    mw_total = mw_counts_df[mw_counts_df["MW"] == mw]["Count_Pre"].values[0]
    
    # Extract the values
    row = result.iloc[0]
    
    # Create a dictionary with the statistics
    stats = {
        "Noun": noun,
        "Measure Word": mw,
        "Count": row["Count_Pre"],
        "Total Noun Count": noun_total,
        "Total MW Count": mw_total,
        "P(n)": row["P(n)"],
        "P(w)": row["P(w)"],
        "P(n|w)": row["P(n|w)"]
    }
    
    # Print the statistics in a readable format
    print(f"Statistics for noun '{noun}' with measure word '{mw}':")
    print(f"Count of this combination: {stats['Count']}")
    print(f"Total count of noun '{noun}': {stats['Total Noun Count']}")
    print(f"Total count of measure word '{mw}': {stats['Total MW Count']}")
    print(f"P(n): {stats['P(n)']:.6f}")
    print(f"P(w): {stats['P(w)']:.6f}")
    print(f"P(n|w): {stats['P(n|w)']:.6f}")
    
    return stats

noun = "上午"
mw = "日"
stats = lookup_mw_noun_stats(noun, mw, final_df, noun_counts, mw_counts)

Statistics for noun '上午' with measure word '日':
Count of this combination: 1
Total count of noun '上午': 1
Total count of measure word '日': 8
P(n): 0.000437
P(w): 0.003495
P(n|w): 0.125000


In [9]:
# Verification and summary
# Verify that p(n) sums to 1 (or very close due to floating point)
p_n_sum = noun_counts["P(n)"].sum()
print(f"\nSum of all P(n): {p_n_sum:.6f}")

# Verify that p(w) sums to 1
p_w_sum = mw_counts["P(w)"].sum()
print(f"Sum of all P(w): {p_w_sum:.6f}")

# For each MW, verify that sum of p(n|w) = 1 for a few sample MWs
sample_mws = final_df["MW"].unique()[:5]  # Take first 5 MWs as samples
for mw in sample_mws:
    p_n_given_w_sum = final_df[final_df["MW"] == mw]["P(n|w)"].sum()
    print(f"Sum of P(n|w) for MW '{mw}': {p_n_given_w_sum:.6f}")

# Statistics on distributions
print("\nStatistics for P(n):")
print(noun_counts["P(n)"].describe())

print("\nStatistics for P(w):")
print(mw_counts["P(w)"].describe())

print("\nStatistics for P(n|w):")
print(final_df["P(n|w)"].describe())

# Calculate the mutual information I(N;W)
# Joint probability p(n,w)
final_df["P(n,w)"] = final_df["Count_Pre"] / total_tokens

# Log term for mutual information
# Add small epsilon to avoid log(0)
epsilon = 1e-10
final_df["log_term"] = np.log2((final_df["P(n,w)"] + epsilon) / 
                               (final_df["P(n)"] * final_df["P(w)"] + epsilon))

# Contribution to mutual information
final_df["MI_contribution"] = final_df["P(n,w)"] * final_df["log_term"]

# Total mutual information
mutual_info = final_df["MI_contribution"].sum()
print(f"\nMutual Information I(N;W): {mutual_info:.6f} bits")

# Save extended dataframe with MI calculations
final_df_extended = final_df.copy()
final_df_extended["P(n,w)"] = final_df["P(n,w)"]
final_df_extended["MI_contribution"] = final_df["MI_contribution"]
final_df_extended.to_csv("noun_mw_probabilities_with_MI.csv", index=False, encoding="utf-8")


Sum of all P(n): 1.000000
Sum of all P(w): 1.000000
Sum of P(n|w) for MW '日': 1.000000
Sum of P(n|w) for MW '局': 1.000000
Sum of P(n|w) for MW '年': 1.000000
Sum of P(n|w) for MW '月': 1.000000
Sum of P(n|w) for MW '个': 1.000000

Statistics for P(n):
count    847.000000
mean       0.001181
std        0.009914
min        0.000000
25%        0.000437
50%        0.000437
75%        0.000874
max        0.278724
Name: P(n), dtype: float64

Statistics for P(w):
count    110.000000
mean       0.009091
std        0.034838
min        0.000000
25%        0.000437
50%        0.001311
75%        0.003495
max        0.238969
Name: P(w), dtype: float64

Statistics for P(n|w):
count    1006.000000
mean        0.108350
std         0.224800
min         0.000000
25%         0.001934
50%         0.015504
75%         0.090909
max         1.000000
Name: P(n|w), dtype: float64

Mutual Information I(N;W): 3.368736 bits
