**Semantic Deduplication with Model2Vec**

In this tutorial, we’ll explore how Model2Vec can help identify duplicates in text data that traditional exact matching would miss. While exact matching works for identical texts, it fails to detect near-duplicates—documents that may differ slightly in wording but convey the same meaning. Using Model2Vec, we embed documents into vectors and measure their similarity. This allows us to catch both exact and semantic duplicates, improving the quality of our dataset. With Model2Vec’s speed and efficiency, we can very efficiently perform deduplication on large datasets, ensuring cleaner, more robust data for downstream tasks. Additionally, we will use Model2Vec to detect train-test overlap, ensuring our models are not overfitting.

In [None]:
!pip install datasets model2vec reach numpy wordllama tqdm datasketch

from difflib import ndiff
from time import perf_counter

from datasets import load_dataset
from datasketch import MinHash, MinHashLSH
import numpy as np
from model2vec import StaticModel
from reach import Reach
from tqdm import tqdm
from wordllama import WordLlama

**Loading data and model**

We will use the AG News dataset and the Model2Vec pretrained model for deduplication.

In [40]:
# Load the model and dataset
model = StaticModel.from_pretrained("minishlab/M2V_base_output")
ds = load_dataset("ag_news")["train"]
texts = ds['text']

**Exact overlap baseline**

We will first try to find exact matches in the dataset as a baseline.

In [96]:
seen = set()
deduplicated_text_indices = []

for i, text in enumerate(texts):
    if text not in seen:
        deduplicated_text_indices.append(i)
        seen.add(text)

print("Number of deduplicated docs:", len(deduplicated_text_indices))

Number of deduplicated docs: 120000


As can be seen, we find no duplicate instances using exact string matching. 

**Deduplication using Model2Vec**

Let's now use Model2Vec to embed our documents and identify duplicates.

In [41]:
# Encode texts into embeddings
embedding_matrix = model.encode(texts)

100%|██████████| 118/118 [00:02<00:00, 45.65it/s]


In [53]:
def deduplicate(embedding_matrix: np.ndarray, threshold: float, batch_size: int = 1024) -> tuple[np.ndarray, dict[int, int]]:
    """
    Deduplicate embeddings and return the deduplicated indices and a mapping of removed indices to their corresponding original indices.
    
    :param embedding_matrix: The embeddings to deduplicate.
    :param threshold: The similarity threshold to use for deduplication.
    :param batch_size: The batch size to use for similarity computation.
    :return: A tuple containing the deduplicated indices and a dictionary mapping removed indices to original indices.
    """
    reach = Reach(vectors=embedding_matrix, items=[str(i) for i in range(len(embedding_matrix))])
    
    # Use a set for deduplicated indices and keep track of duplicates
    deduplicated_indices = set(range(len(embedding_matrix)))  # Start with all indices as deduplicated
    duplicate_to_original_mapping = {}

    results = reach.nearest_neighbor_threshold(
        embedding_matrix, 
        threshold=threshold, 
        batch_size=batch_size, 
        show_progressbar=True
    )
    
    # Process duplicates
    for i, similar_items in enumerate(tqdm(results)):
        if i not in deduplicated_indices:
            continue  # Skip already marked duplicates

        # Similar items are returned as (index, score), we are only interested in the index
        similar_indices = [int(item[0]) for item in similar_items if int(item[0]) != i]
        
        # Mark similar documents as duplicates and map them to the original
        for sim_idx in similar_indices:
            if sim_idx in deduplicated_indices:
                deduplicated_indices.remove(sim_idx)
                duplicate_to_original_mapping[sim_idx] = i  # Map duplicate to original

    return np.array(list(deduplicated_indices)), duplicate_to_original_mapping


In [81]:
# Deduplicate (with a high threshold)
deduplicated_indices, duplicate_to_original_mapping = deduplicate(embedding_matrix, threshold=0.99)
print(f"Number of deduplicated docs: {len(deduplicated_indices)}")

100%|██████████| 118/118 [00:25<00:00,  4.64it/s]
100%|██████████| 120000/120000 [00:00<00:00, 679800.97it/s]

Number of deduplicated docs: 118769





Using Model2Vec, we find > 1000 duplicates with a very high threshold, in < 30 seconds. Now, let's look at a few examples to see if these are indeed duplicates.

In [71]:
def display_word_differences(x: str, y: str) -> str:
    diff = ndiff(x.split(), y.split())
    return " ".join([f"{word}" for word in diff if word.startswith(('+', '-'))])

# Show a few duplicates with their originals, highlighting word-level differences
num_examples = 5
for duplicate_idx, original_idx in list(duplicate_to_original_mapping.items())[:num_examples]:
    print(f"Original text:\n{texts[original_idx]}")
    print(f"Duplicate text:\n{texts[duplicate_idx]}")
    print("Differences:")
    print(display_word_differences(texts[original_idx], texts[duplicate_idx]))
    print("-" * 50)


Original text:
Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.
Duplicate text:
Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market this week during the depth of the\summer doldrums.
Differences:
- next + this
--------------------------------------------------
Original text:
Oil and Economy Cloud Stocks' Outlook  NEW YORK (Reuters) - Soaring crude prices plus worries  about the economy and the outlook for earnings are expected to  hang over the stock market next week during the depth of the  summer doldrums.
Duplicate text:
Oil and Economy Cloud Stocks' Outlook  NEW YORK (Reuters) - Soaring crude prices plus worries  about the economy and the outlook for earnings are expected t

The found texts do indeed seem to be duplicates, nice! In a normal workflow where we use Model2Vec to embed our documents, deduplication our training corpus is essentially free. This gives us an easy to use, easy to integrate, fast way to deduplicate.

**Deduplication using WordLlama**

For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data.

In [84]:
wl = WordLlama.load()

time = perf_counter()
deduplicated_docs = wl.deduplicate(texts, threshold=0.99)
print(f"Number of deduplicated docs: {len(deduplicated_docs)}")
print(f"Time taken: {perf_counter() - time}")


Number of deduplicated docs: 119128
Time taken: 42.821428374998504


This approach is considerably slower than Model2Vec for encoding + deduplication (43 vs 27 seconds). It also finds less duplicates with the same threshold.

**Deduplication using MinHash**

As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates.

In [77]:
def get_minhash(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for word in text.split():
        m.update(word.encode('utf8'))
    return m

def deduplicate_with_minhash(texts: list[str], threshold: float = 0.9) -> list[int]:
    """
    Deduplicate texts using MinHash and return the indices of unique texts.

    :param texts: List of texts to deduplicate.
    :param threshold: Jaccard similarity threshold for considering texts as duplicates.
    :return: List of indices of deduplicated texts.
    """
    lsh = MinHashLSH(threshold=threshold)
    deduplicated_text_indices = []

    for i, text in enumerate(texts):
        # Generate MinHash for the current text
        minhash = get_minhash(text)

        # Check if the MinHash is already in the LSH (i.e., if it is a duplicate)
        if not lsh.query(minhash):
            # If it's not a duplicate, add the MinHash and keep the index
            deduplicated_text_indices.append(i)
            lsh.insert(i, minhash)

    return deduplicated_text_indices


time = perf_counter()
deduplicated_text_indices = deduplicate_with_minhash(texts)
print(f"Number of deduplicated docs: {len(deduplicated_text_indices)}")
print(f"Time taken: {perf_counter() - time}")


Number of deduplicated docs: 118653
Time taken: 56.46521229199425


Model2Vec is again much faster, with 27 seconds vs 56 seconds for MinHash. The number of found duplicates is roughly the same using the default settings for MinHash.

**Train test leakage detection using Model2Vec**

Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set.


In [90]:
# Load the datasets
ds_train = load_dataset("ag_news")["train"]
ds_test = load_dataset("ag_news")["test"]

texts_train = ds_train['text']
texts_test = ds_test['text']

# Encode texts into embeddings
embedding_matrix_train = model.encode(texts_train)
embedding_matrix_test = model.encode(texts_test)

def deduplicate_across_datasets(embedding_matrix_1: np.ndarray, embedding_matrix_2: np.ndarray, threshold: float, batch_size: int = 1024) -> tuple[list[int], dict[int, int]]:
    """
    Deduplicate embeddings across two datasets and return the indices of duplicates between them.
    
    :param embedding_matrix_1: The embeddings of the first dataset (e.g., train).
    :param embedding_matrix_2: The embeddings of the second dataset (e.g., test).
    :param threshold: The similarity threshold to use for deduplication.
    :param batch_size: The batch size to use for similarity computation.
    :return: A tuple containing the duplicate indices and a dictionary mapping removed indices in the second dataset to their corresponding indices in the first dataset.
    """
    reach = Reach(vectors=embedding_matrix_1, items=[str(i) for i in range(len(embedding_matrix_1))])

    # Keep track of duplicates in the second dataset
    duplicate_indices_in_test = []
    duplicate_to_original_mapping = {}

    # Find nearest neighbors from the test set in the train set
    results = reach.nearest_neighbor_threshold(
        embedding_matrix_2, 
        threshold=threshold, 
        batch_size=batch_size, 
        show_progressbar=True
    )
    
    # Process duplicates
    for i, similar_items in enumerate(tqdm(results)):
        # Similar items are returned as (index, score), we are only interested in the index
        similar_indices = [int(item[0]) for item in similar_items if item[1] >= threshold]  # Keep those above the threshold
        
        # If we find a similar item in the train set, mark it as a duplicate
        if similar_indices:
            duplicate_indices_in_test.append(i)
            duplicate_to_original_mapping[i] = similar_indices[0]  # Map duplicate in test to original in train

    return duplicate_indices_in_test, duplicate_to_original_mapping

# Check for train/test bleed
duplicate_indices_in_test, duplicate_to_original_mapping = deduplicate_across_datasets(
    embedding_matrix_train, 
    embedding_matrix_test, 
    threshold=0.99  # High threshold for deduplication
)

print(f"Number of duplicates found between train and test: {len(duplicate_indices_in_test)}")


100%|██████████| 118/118 [00:02<00:00, 45.36it/s]
100%|██████████| 8/8 [00:00<00:00, 51.05it/s]
100%|██████████| 8/8 [00:01<00:00,  5.40it/s]
100%|██████████| 7600/7600 [00:00<00:00, 901108.42it/s]

Number of duplicates found between train and test: 138





In [99]:
# Show a few duplicates with their originals, highlighting word-level differences
num_examples = 5
for i, test_idx in enumerate(duplicate_indices_in_test[:num_examples]):
    train_idx = duplicate_to_original_mapping[test_idx]

    print(f"Train text:\n{texts_train[train_idx]}")
    print(f"Test text:\n{texts_test[test_idx]}")
    print("Differences:")
    print(display_word_differences(texts_train[train_idx], texts_test[test_idx]))
    print("-" * 50)


Train text:
Jackson Squares Off With Attorney SANTA MARIA, Calif. - Fans of Michael Jackson erupted in cheers Monday as the pop star emerged from a double-decker tour bus and went into court for a showdown with the prosecutor who has pursued him for years on child molestation charges...
Test text:
Jackson Squares Off With Prosecutor SANTA MARIA, Calif. - Fans of Michael Jackson erupted in cheers Monday as the pop star emerged from a double-decker tour bus and went into court for a showdown with the prosecutor who has pursued him for years on child molestation charges...
Differences:
- Attorney + Prosecutor
--------------------------------------------------
Train text:
Cassini Spies Two Moons Around Saturn (AP) AP - NASA's Cassini spacecraft has spied two new little moons around satellite-rich Saturn, the space agency said.
Test text:
Cassini Spies Two Little Saturn Moons (AP) AP - NASA's Cassini spacecraft has spied two new little moons around satellite-rich Saturn, the space agency sa

These again look like duplicates. We can very efficiently find train/test leakage examples using Model2Vec, ensuring that our test set is clean and does not contain any duplicates from the training set.

**Conclusion**

Model2Vec provides an efficient and fast solution for semantic deduplication, outperforming other methods like WordLlama and MinHash in terms of speed. Additionally, its ability to detect train-test overlap makes it a valuable tool for preparing clean datasets for machine learning tasks.