
# Semantic Similarity with Chatterjee's ξ and BERT Embeddings

This notebook demonstrates how to use **Chatterjee's rank-based correlation coefficient** (ξ) as a similarity measure for sentence embeddings.  Unlike cosine similarity, ξ can detect nonlinear functional relationships.  We'll compute ξ alongside cosine similarity for a set of sentence pairs using a lightweight BERT-based sentence embedding model from the [`sentence-transformers`](https://github.com/UKPLab/sentence-transformers) library.

We will:

1. Install required libraries.
2. Define a function to compute Chatterjee's ξ for two vectors.
3. Create a small dataset of semantically *similar* and *unrelated* sentence pairs.
4. Obtain embeddings using a pretrained model (e.g., `all-MiniLM-L6-v2`).
5. Compute cosine similarity and ξ for each pair.
6. Compare the results and discuss the findings.

Feel free to modify the dataset or the model to experiment with other examples.


In [1]:

# Install sentence-transformers (includes transformers and torch)
!pip install -q sentence-transformers


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Define Chatterjee's ξ for two 1D numpy arrays

def chatterjee_xi(x: np.ndarray, y: np.ndarray) -> float:
    """Compute Chatterjee's rank correlation coefficient ξ for two vectors.

    Parameters
    ----------
    x, y : np.ndarray
        1D arrays of equal length.
    Returns
    -------
    float
        The ξ value.
    """
    if x.ndim != 1 or y.ndim != 1:
        raise ValueError("x and y must be 1D arrays")
    if len(x) != len(y):
        raise ValueError("x and y must have the same length")
    n = len(x)
    # Sort by x and get the ordering of y
    sorted_idx = np.argsort(x)
    y_sorted = y[sorted_idx]
    # Assign ranks to y_sorted (ties are averaged by argsort methodology)
    ranks = np.argsort(np.argsort(y_sorted)) + 1  # ranks from 1 to n
    # Compute successive absolute rank differences
    diff = np.abs(np.diff(ranks))
    xi = 1 - (3 * np.sum(diff)) / (n ** 2 - 1)
    return xi


In [3]:

# Define a list of (sentence1, sentence2, label) tuples.
# label = 1 for similar pairs, 0 for unrelated pairs.
sentence_pairs = [
    ("The quick brown fox jumps over the lazy dog.",
     "A swift auburn fox leaps above a sleepy canine.", 1),
    ("A man is playing guitar on stage.",
     "Someone is strumming a musical instrument in front of an audience.", 1),
    ("The capital of France is Paris.",
     "Paris is the capital city of France.", 1),
    ("Ice cream tastes delicious on a hot day.",
     "Eating frozen dessert is enjoyable when it's warm outside.", 1),
    ("The stock market crashed causing panic.",
     "An octopus is swimming in the ocean.", 0),
    ("A student is studying mathematics.",
     "Fish live in the coral reef.", 0),
    ("She went shopping for a new dress.",
     "The earth revolves around the sun.", 0),
    ("He is writing code in Python.",
     "The flowers bloom in spring.", 0)
]

# Flatten sentences for embedding
document_list = [s for pair in sentence_pairs for s in pair[:2]]

print(f"Total sentences for embedding: {len(document_list)}")


Total sentences for embedding: 16


In [4]:

# Load a lightweight pre-trained sentence transformer model
# This model is small (~80MB) and works well for semantic similarity tasks.
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

# Encode the sentences into embeddings
embeddings = model.encode(document_list, convert_to_numpy=True, show_progress_bar=True)

# Reshape embeddings: every two rows correspond to a pair
assert embeddings.shape[0] == len(sentence_pairs) * 2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


In [6]:

from pandas import DataFrame

# Prepare results
results = []

for i, (sent1, sent2, label) in enumerate(sentence_pairs):
    emb1 = embeddings[2*i]
    emb2 = embeddings[2*i + 1]
    # Cosine similarity
    cos_sim = cosine_similarity([emb1], [emb2])[0][0]
    # Compute xi on the component vectors
    xi_val = chatterjee_xi(emb1, emb2)
    results.append({
        'Sentence 1': sent1,
        'Sentence 2': sent2,
        'Label (1=similar)': label,
        'Cosine similarity': cos_sim,
        'Xi similarity': xi_val
    })

# Convert to DataFrame for pretty display
results_df = DataFrame(results)

print("Similarity results:")
results_df


Similarity results:


Unnamed: 0,Sentence 1,Sentence 2,Label (1=similar),Cosine similarity,Xi similarity
0,The quick brown fox jumps over the lazy dog.,A swift auburn fox leaps above a sleepy canine.,1,0.704137,0.278763
1,A man is playing guitar on stage.,Someone is strumming a musical instrument in f...,1,0.473693,0.124099
2,The capital of France is Paris.,Paris is the capital city of France.,1,0.96976,0.760212
3,Ice cream tastes delicious on a hot day.,Eating frozen dessert is enjoyable when it's w...,1,0.621089,0.243464
4,The stock market crashed causing panic.,An octopus is swimming in the ocean.,0,0.015969,-0.040941
5,A student is studying mathematics.,Fish live in the coral reef.,0,0.021841,-0.001085
6,She went shopping for a new dress.,The earth revolves around the sun.,0,0.018538,-0.010424
7,He is writing code in Python.,The flowers bloom in spring.,0,-0.041176,-0.02261


In [7]:

# Compute average similarities for similar and unrelated groups
similar_df = results_df[results_df['Label (1=similar)'] == 1]
unrelated_df = results_df[results_df['Label (1=similar)'] == 0]

avg_cos_similar = similar_df['Cosine similarity'].mean()
avg_cos_unrelated = unrelated_df['Cosine similarity'].mean()
avg_xi_similar = similar_df['Xi similarity'].mean()
avg_xi_unrelated = unrelated_df['Xi similarity'].mean()

print(f"Average cosine similarity (similar pairs): {avg_cos_similar:.3f}")
print(f"Average cosine similarity (unrelated pairs): {avg_cos_unrelated:.3f}")

print(f"Average xi similarity (similar pairs): {avg_xi_similar:.3f}")
print(f"Average xi similarity (unrelated pairs): {avg_xi_unrelated:.3f}")

# Optional: compute classification accuracy by thresholding xi
import numpy as np
thresholds = np.linspace(results_df['Xi similarity'].min(), results_df['Xi similarity'].max(), 50)
best_acc = 0
best_thresh = None
labels_array = results_df['Label (1=similar)'].values
xi_values = results_df['Xi similarity'].values
for thresh in thresholds:
    preds = (xi_values > thresh).astype(int)
    acc = np.mean(preds == labels_array)
    if acc > best_acc:
        best_acc = acc
        best_thresh = thresh

print(f"Best classification threshold for xi: {best_thresh:.3f}, accuracy: {best_acc:.3f}")


Average cosine similarity (similar pairs): 0.692
Average cosine similarity (unrelated pairs): 0.004
Average xi similarity (similar pairs): 0.352
Average xi similarity (unrelated pairs): -0.019
Best classification threshold for xi: 0.008, accuracy: 1.000


In [8]:
# Synthetic demonstration of xi vs. cosine on nonlinear functions
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

def synthetic_experiment(dim=500, reps=5):
    """Run a synthetic experiment comparing cosine and xi on various relationships."""
    results = []
    for _ in range(reps):
        x = np.random.randn(dim)
        # Define different functional relationships
        y_linear   = x + 0.05 * np.random.randn(dim)  # noisy linear
        y_square   = x ** 2                           # nonlinear monotonic
        y_absolute = np.abs(x)                        # nonlinear non-monotonic (even)
        y_random   = np.random.randn(dim)             # independent
        for name, y in [('linear', y_linear),
                        ('quadratic', y_square),
                        ('absolute', y_absolute),
                        ('random', y_random)]:
            cos_val = cosine_similarity([x], [y])[0][0]
            xi_val  = chatterjee_xi(x, y)
            results.append({
                'relation': name,
                'cosine': cos_val,
                'xi': xi_val
            })
    return results

# Run the experiment and display the mean results by relation type
synth_results = synthetic_experiment()
df_synth = pd.DataFrame(synth_results)
display(df_synth.groupby('relation').agg({'cosine':'mean', 'xi':'mean'}))


Unnamed: 0_level_0,cosine,xi
relation,Unnamed: 1_level_1,Unnamed: 2_level_1
absolute,-0.017867,0.988038
linear,0.998765,0.948493
quadratic,-0.001125,0.988038
random,-0.019399,-0.010075


In [9]:
# Demonstration of xi on negation/paraphrase pairs
test_pairs = [
    ("He is happy.", "He is not unhappy."),
    ("She likes cats.", "She does not dislike cats."),
    ("It is raining heavily.", "It isn't sunny outside."),
    ("The team won the match.", "The match wasn't lost by the team."),
]

for s1, s2 in test_pairs:
    emb1 = model.encode(s1, convert_to_numpy=True)
    emb2 = model.encode(s2, convert_to_numpy=True)
    cos_val = cosine_similarity([emb1], [emb2])[0][0]
    xi_val  = chatterjee_xi(emb1, emb2)
    print(f'Pair: \"{s1}\" vs. \"{s2}\"\\n  Cosine similarity: {cos_val:.3f}\\n  Xi similarity: {xi_val:.3f}\\n')


  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)
  return forward_call(*args, **kwargs)


Pair: "He is happy." vs. "He is not unhappy."\n  Cosine similarity: 0.693\n  Xi similarity: 0.308\n
Pair: "She likes cats." vs. "She does not dislike cats."\n  Cosine similarity: 0.740\n  Xi similarity: 0.362\n
Pair: "It is raining heavily." vs. "It isn't sunny outside."\n  Cosine similarity: 0.434\n  Xi similarity: 0.142\n
Pair: "The team won the match." vs. "The match wasn't lost by the team."\n  Cosine similarity: 0.723\n  Xi similarity: 0.333\n


In [10]:
# Simple RAG-style retrieval demonstration using cosine and xi

# Define a small set of knowledge base documents
docs = [
    "The stock price increased significantly during the last quarter.",
    "She enjoys playing tennis on weekends.",
    "Rainfall has been heavy in the northern regions.",
    "The patient is not unhappy with the treatment.",
    "Wildflowers bloom beautifully in spring."
]

# Define a set of queries and the index of the document that should be most relevant
# The second element of each tuple is the target document's index in the docs list.
queries = [
    ("The patient is happy with the treatment.", 3),  # paraphrase/negation relationship to doc 3
    ("Share prices rose a lot in the previous quarter.", 0)  # paraphrase of doc 0
]

# Encode the documents once
doc_embeddings = model.encode(docs, convert_to_numpy=True)

for query_text, target_idx in queries:
    query_embedding = model.encode(query_text, convert_to_numpy=True)
    cosine_scores = []
    xi_scores = []

    # Compute both similarities for every document
    for doc_embedding in doc_embeddings:
        cos_score = cosine_similarity([query_embedding], [doc_embedding])[0][0]
        xi_score  = chatterjee_xi(query_embedding, doc_embedding)
        cosine_scores.append(cos_score)
        xi_scores.append(xi_score)

    # Obtain rankings (highest score first)
    cosine_ranking = sorted(range(len(docs)), key=lambda i: cosine_scores[i], reverse=True)
    xi_ranking     = sorted(range(len(docs)), key=lambda i: xi_scores[i], reverse=True)

    print(f"\nQuery: {query_text}")
    print("Cosine scores:", [f"{s:.3f}" for s in cosine_scores])
    print("Xi scores:    ", [f"{s:.3f}" for s in xi_scores])
    print("Cosine ranking (best to worst):", cosine_ranking,
          "— target position:", cosine_ranking.index(target_idx))
    print("Xi ranking (best to worst):    ", xi_ranking,
          "— target position:", xi_ranking.index(target_idx))


  return forward_call(*args, **kwargs)



Query: The patient is happy with the treatment.
Cosine scores: ['0.010', '0.138', '-0.012', '0.678', '-0.010']
Xi scores:     ['0.013', '0.018', '0.008', '0.298', '0.015']
Cosine ranking (best to worst): [3, 1, 0, 4, 2] — target position: 0
Xi ranking (best to worst):     [3, 1, 4, 0, 2] — target position: 0

Query: Share prices rose a lot in the previous quarter.
Cosine scores: ['0.737', '-0.026', '0.074', '-0.095', '0.068']
Xi scores:     ['0.308', '0.013', '0.002', '0.046', '0.051']
Cosine ranking (best to worst): [0, 2, 4, 1, 3] — target position: 0
Xi ranking (best to worst):     [0, 4, 3, 1, 2] — target position: 0


  return forward_call(*args, **kwargs)
