**Semantic Deduplication with Model2Vec**

In this tutorial, we’ll explore how Model2Vec can help identify duplicates in text data that traditional exact matching would miss. While exact matching works for identical texts, it fails to detect near-duplicates—documents that may differ slightly in wording but convey the same meaning. Using Model2Vec, we embed documents into vectors and measure their similarity. This allows us to catch both exact and semantic duplicates, improving the quality of our dataset. With Model2Vec’s speed and efficiency, we can very efficiently perform deduplication on large datasets, ensuring cleaner, more robust data for downstream tasks.

In [None]:
!pip install datasets model2vec reach numpy tqdm python-Levenshtein datasketch
from datasets import load_dataset
from model2vec import StaticModel
from reach import Reach
import numpy as np
from tqdm import tqdm

In [14]:
# Load the model and dataset
model = StaticModel.from_pretrained("minishlab/M2V_base_output")
ds = load_dataset("ag_news")["train"]
texts = ds['text']

We will first try to find exact matches in the dataset as a baseline. Then, we will use Model2Vec to identify semantic duplicates.

In [15]:
seen = set()
deduplicated_text_indices = np.array([i for i, text in enumerate(texts) if text not in seen and not seen.add(text)])
len(deduplicated_text_indices)

120000

As can be seen, we find no duplicate instances using exact string matching. Now, let's use Model2Vec to embed our documents and identify duplicates.

In [4]:
# Encode texts into embeddings
embeddings = model.encode(texts, show_progressbar=True)
embedding_matrix = np.vstack(embeddings)

100%|██████████| 118/118 [00:02<00:00, 47.80it/s]


In [25]:
# Define a function to deduplicate embeddings
def deduplicate(embedding_matrix: np.ndarray, threshold: float, batch_size: int = 1024) -> tuple[np.ndarray, dict[str, int]]:
    """
    Deduplicate embeddings and return the deduplicated indices and a mapping of removed indices to their corresponding original indices.
    
    :param embedding_matrix: The embeddings to deduplicate.
    :param threshold: The similarity threshold to use for deduplication.
    :param batch_size: The batch size to use for similarity computation.
    :return: A tuple containing the deduplicated indices and a dictionary mapping removed indices to original indices.
    """
    reach = Reach(vectors=embedding_matrix, items=[str(i) for i in range(len(embedding_matrix))])
    
    # Find similar documents
    is_duplicate = np.zeros(len(embedding_matrix), dtype=bool)
    duplicate_to_original_mapping = {}

    results = reach.threshold(
        [str(i) for i in range(len(embedding_matrix))], 
        threshold=threshold, 
        batch_size=batch_size, 
        show_progressbar=True
    )
    
    # Process duplicates
    for i, similar_items in tqdm(enumerate(results), total=len(results)):
        if is_duplicate[i]:
            continue  # Skip already marked duplicates

        # Similar items are returned as (index, score), we are only interested in the index
        similar_indices = [int(item[0]) for item in similar_items if int(item[0]) != i]
        
        # Mark similar documents as duplicates and map them to the original
        for sim_idx in similar_indices:
            is_duplicate[sim_idx] = True
            duplicate_to_original_mapping[sim_idx] = i  # Map duplicate to original

    deduplicated_indices = np.where(~is_duplicate)[0]

    return deduplicated_indices, duplicate_to_original_mapping


In [37]:
# Deduplicate (with a high threshold)
deduplicated_indices, duplicate_to_original_mapping = deduplicate(embedding_matrix, threshold=0.99)
len(deduplicated_indices)

 99%|█████████▉| 117/118 [00:24<00:00,  4.77it/s]
100%|██████████| 120000/120000 [00:00<00:00, 945566.39it/s]


118769

Using Model2Vec, we find > 1000 duplicates with a very high threshold, in < 30 seconds. Now, let's look at a few examples to see if these are indeed duplicates.

In [27]:
# Show a few duplicates with their originals
num_examples = 5
for duplicate_idx, original_idx in list(duplicate_to_original_mapping.items())[:num_examples]:
    print(f"Original text: {texts[original_idx]}")
    print(f"Duplicate text: {texts[duplicate_idx]}")
    print("-" * 50)

Original text: Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.
Duplicate text: Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market this week during the depth of the\summer doldrums.
--------------------------------------------------
Original text: Oil and Economy Cloud Stocks' Outlook  NEW YORK (Reuters) - Soaring crude prices plus worries  about the economy and the outlook for earnings are expected to  hang over the stock market next week during the depth of the  summer doldrums.
Duplicate text: Oil and Economy Cloud Stocks' Outlook  NEW YORK (Reuters) - Soaring crude prices plus worries  about the economy and the outlook for earnings are expected to  hang over the stock mark

The found texts do indeed seem to be duplicates, nice! In a normal workflow where we use Model2Vec to embed our documents, deduplication our training corpus is essentially free. This gives us an easy to use, easy to integrate, fast alternative to other methods such as MinHash.