# Paraphrase Similarity Experiment

This notebook explores how Voyage AI embeddings capture semantic similarity between paraphrases.

**Hypothesis**: Texts that are paraphrases of each other should have high cosine similarity in the embedding space, while unrelated texts should have lower similarity scores.

## Setup

First, let's load our dependencies and configure the environment.

In [None]:
import json
import sys
import os
from pathlib import Path

# Get the project root directory (parent of notebooks/)
NOTEBOOK_DIR = Path(os.path.dirname(os.path.abspath('__file__'))).resolve()
# If we're in notebooks/, go up one level; otherwise assume we're at project root
if NOTEBOOK_DIR.name == 'notebooks':
    PROJECT_ROOT = NOTEBOOK_DIR.parent
else:
    PROJECT_ROOT = NOTEBOOK_DIR

# Add src to path for local imports
sys.path.insert(0, str(PROJECT_ROOT / 'src'))

print(f"Project root: {PROJECT_ROOT}")
print(f"Source path: {PROJECT_ROOT / 'src'}")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from embeddings_space.embeddings import EmbeddingsClient
from embeddings_space.metrics import cosine_similarity, pairwise_similarities

# Set up plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

## Load Paraphrase Data

We have a dataset with groups of paraphrases (semantically equivalent texts) and some unrelated texts for comparison.

In [None]:
# Load the paraphrase dataset
with open(PROJECT_ROOT / 'data' / 'paraphrases.json', 'r') as f:
    data = json.load(f)

print(f"Loaded {len(data['paraphrase_groups'])} paraphrase groups")
print(f"Loaded {len(data['unrelated_texts'])} unrelated texts")

# Preview the data
for group in data['paraphrase_groups']:
    print(f"\nüìù {group['topic']} ({len(group['texts'])} variants)")
    print(f"   Example: {group['texts'][0][:80]}...")

## Generate Embeddings

Connect to Voyage AI and generate embeddings for all texts.

In [None]:
# Initialize the embeddings client
# Make sure VOYAGE_API_KEY is set in your .env file
client = EmbeddingsClient(model="voyage-4-large")

print(f"Using model: {client.model}")

In [None]:
# Collect all texts and their metadata
all_texts = []
text_labels = []
group_ids = []

# Add paraphrase groups
for group in data['paraphrase_groups']:
    for i, text in enumerate(group['texts']):
        all_texts.append(text)
        text_labels.append(f"{group['id']}_{i+1}")
        group_ids.append(group['id'])

# Add unrelated texts
for item in data['unrelated_texts']:
    all_texts.append(item['text'])
    text_labels.append(item['id'])
    group_ids.append('unrelated')

print(f"Total texts to embed: {len(all_texts)}")

In [None]:
# Generate embeddings for all texts
embeddings = client.embed_texts(all_texts)

print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")

## Analyze Similarity

Compute pairwise cosine similarities and visualize the results.

In [None]:
# Compute pairwise cosine similarities
similarity_matrix = pairwise_similarities(embeddings, metric="cosine")

# Create a DataFrame for easier analysis
sim_df = pd.DataFrame(
    similarity_matrix,
    index=text_labels,
    columns=text_labels
)

print("Similarity matrix shape:", sim_df.shape)

In [None]:
# Create a heatmap visualization
fig, ax = plt.subplots(figsize=(14, 12))

sns.heatmap(
    sim_df,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    center=0.5,
    vmin=0,
    vmax=1,
    ax=ax,
    annot_kws={'size': 8}
)

ax.set_title('Cosine Similarity Matrix: Paraphrases vs Unrelated Texts', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

## Statistical Analysis

Compare the similarity distributions between paraphrases and unrelated texts.

In [None]:
# Collect similarity scores by relationship type
within_group_similarities = []
between_group_similarities = []

n = len(all_texts)
for i in range(n):
    for j in range(i + 1, n):
        sim = similarity_matrix[i, j]
        if group_ids[i] == group_ids[j] and group_ids[i] != 'unrelated':
            within_group_similarities.append(sim)
        else:
            between_group_similarities.append(sim)

print(f"Within-group pairs: {len(within_group_similarities)}")
print(f"Between-group pairs: {len(between_group_similarities)}")

In [None]:
# Summary statistics
print("=" * 50)
print("SIMILARITY STATISTICS")
print("=" * 50)
print(f"\nüìä Within-group (paraphrases):")
print(f"   Mean: {np.mean(within_group_similarities):.4f}")
print(f"   Std:  {np.std(within_group_similarities):.4f}")
print(f"   Min:  {np.min(within_group_similarities):.4f}")
print(f"   Max:  {np.max(within_group_similarities):.4f}")

print(f"\nüìä Between-group (different topics):")
print(f"   Mean: {np.mean(between_group_similarities):.4f}")
print(f"   Std:  {np.std(between_group_similarities):.4f}")
print(f"   Min:  {np.min(between_group_similarities):.4f}")
print(f"   Max:  {np.max(between_group_similarities):.4f}")

print(f"\nüéØ Separation gap: {np.mean(within_group_similarities) - np.mean(between_group_similarities):.4f}")

In [None]:
# Distribution visualization
fig, ax = plt.subplots(figsize=(10, 6))

ax.hist(
    within_group_similarities,
    bins=20,
    alpha=0.7,
    label='Paraphrases (within group)',
    color='#2ecc71'
)
ax.hist(
    between_group_similarities,
    bins=20,
    alpha=0.7,
    label='Different topics (between groups)',
    color='#e74c3c'
)

ax.axvline(
    np.mean(within_group_similarities),
    color='#27ae60',
    linestyle='--',
    linewidth=2,
    label=f'Paraphrase mean: {np.mean(within_group_similarities):.3f}'
)
ax.axvline(
    np.mean(between_group_similarities),
    color='#c0392b',
    linestyle='--',
    linewidth=2,
    label=f'Different topic mean: {np.mean(between_group_similarities):.3f}'
)

ax.set_xlabel('Cosine Similarity', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('Distribution of Similarity Scores', fontsize=14)
ax.legend(loc='upper left')
ax.set_xlim(0, 1)

plt.tight_layout()
plt.show()

## Conclusions

The results above should demonstrate:

1. **Paraphrases cluster together**: Texts with the same semantic meaning but different wording have high cosine similarity
2. **Unrelated texts are distant**: Texts on different topics have lower similarity scores
3. **Clear separation**: There should be a measurable gap between within-group and between-group similarities

This validates that Voyage embeddings effectively capture semantic similarity, making them useful for paraphrase detection, semantic search, and similar applications.

## Next Steps

Potential follow-up experiments:
- Compare different Voyage models (voyage-4-large vs voyage-4-lite vs voyage-3.5)
- Test cross-lingual paraphrases
- Explore the shared embedding space with code examples (voyage-code-3)
- Visualize embeddings using dimensionality reduction (UMAP, t-SNE)