<a href="https://colab.research.google.com/github/LxYuan0420/nlp/blob/main/notebooks/Semantic_deduplication_using_model2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Semantic Deduplication with Model2Vec**

In this tutorial, we’ll explore how Model2Vec can help identify duplicates in text data that traditional exact matching would miss. While exact matching works for identical texts, it fails to detect near-duplicates—documents that may differ slightly in wording but convey the same meaning. Using Model2Vec, we embed documents into vectors and measure their similarity. This allows us to catch both exact and semantic duplicates, improving the quality of our dataset. With Model2Vec’s speed and efficiency, we can very efficiently perform deduplication on large datasets, ensuring cleaner, more robust data for downstream tasks. Additionally, we will use Model2Vec to detect train-test overlap, ensuring our models are not overfitting.

In [10]:
!pip install datasets model2vec reach numpy wordllama tqdm datasketch
!pip install -U sentence-transformers

from difflib import ndiff
from time import perf_counter

from datasets import load_dataset
from datasketch import MinHash, MinHashLSH
import numpy as np
from model2vec import StaticModel
from reach import Reach
from tqdm import tqdm
from wordllama import WordLlama
from sentence_transformers import SentenceTransformer

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

**Loading data and model**

We will use the AG News dataset and the Model2Vec pretrained model for deduplication.

In [None]:
# Load the model and dataset
model = SentenceTransformer("Thaweewat/gte-multilingual-base-m2v-256")
#model = StaticModel.from_pretrained("Thaweewat/gte-multilingual-base-m2v-256", folder="0_StaticEmbedding")
ds = load_dataset("PolyAI/banking77")["train"]
texts = ds['text']

In [14]:
print(f"Before deduplication: {len(texts)}")

Before deduplication: 10003


**Exact overlap baseline**

We will first try to find exact matches in the dataset as a baseline.

In [15]:
seen = set()
deduplicated_text_indices = []

for i, text in enumerate(texts):
    if text not in seen:
        deduplicated_text_indices.append(i)
        seen.add(text)

print("Number of deduplicated docs:", len(deduplicated_text_indices))

Number of deduplicated docs: 10003


As can be seen, we find no duplicate instances using exact string matching.

**Deduplication using Model2Vec**

Let's now use Model2Vec to embed our documents and identify duplicates.

In [16]:
# Encode texts into embeddings
embedding_matrix = model.encode(texts)

In [24]:
embedding_matrix.shape

(10003, 256)

In [17]:
def deduplicate(embedding_matrix: np.ndarray, threshold: float, batch_size: int = 1024) -> tuple[np.ndarray, dict[int, int]]:
    """
    Deduplicate embeddings and return the deduplicated indices and a mapping of removed indices to their corresponding original indices.

    :param embedding_matrix: The embeddings to deduplicate.
    :param threshold: The similarity threshold to use for deduplication.
    :param batch_size: The batch size to use for similarity computation.
    :return: A tuple containing the deduplicated indices and a dictionary mapping removed indices to original indices.
    """
    reach = Reach(vectors=embedding_matrix, items=[str(i) for i in range(len(embedding_matrix))])

    # Use a set for deduplicated indices and keep track of duplicates
    deduplicated_indices = set(range(len(embedding_matrix)))  # Start with all indices as deduplicated
    duplicate_to_original_mapping = {}

    results = reach.nearest_neighbor_threshold(
        embedding_matrix,
        threshold=threshold,
        batch_size=batch_size,
        show_progressbar=True
    )

    # Process duplicates
    for i, similar_items in enumerate(tqdm(results)):
        if i not in deduplicated_indices:
            continue  # Skip already marked duplicates

        # Similar items are returned as (index, score), we are only interested in the index
        similar_indices = [int(item[0]) for item in similar_items if int(item[0]) != i]

        # Mark similar documents as duplicates and map them to the original
        for sim_idx in similar_indices:
            if sim_idx in deduplicated_indices:
                deduplicated_indices.remove(sim_idx)
                duplicate_to_original_mapping[sim_idx] = i  # Map duplicate to original

    return np.array(list(deduplicated_indices)), duplicate_to_original_mapping


In [29]:
# Deduplicate (with a high threshold)
time = perf_counter()
deduplicated_indices, duplicate_to_original_mapping = deduplicate(embedding_matrix, threshold=0.99)
print(f"Number of deduplicated docs: {len(deduplicated_indices)}")
print(f"Time taken: {perf_counter() - time}")

100%|██████████| 10/10 [00:01<00:00,  7.65it/s]
100%|██████████| 10003/10003 [00:00<00:00, 588215.16it/s]

Number of deduplicated docs: 9913
Time taken: 1.3645381600001656





In [30]:
10003 - 9913

90

Using Model2Vec, we find about 90 duplicates with a very high threshold, in < 3 seconds. Now, let's look at a few examples to see if these are indeed duplicates.

In [23]:
def display_word_differences(x: str, y: str) -> str:
    diff = ndiff(x.split(), y.split())
    return " ".join([f"{word}" for word in diff if word.startswith(('+', '-'))])

# Show a few duplicates with their originals, highlighting word-level differences
num_examples = 10
for duplicate_idx, original_idx in list(duplicate_to_original_mapping.items())[:num_examples]:
    print(f"\n[Original text]:\n{texts[original_idx]}")
    print(f"\n[Duplicate text]:\n{texts[duplicate_idx]}")
    print("\n[Differences]:")
    print(display_word_differences(texts[original_idx], texts[duplicate_idx]))
    print("-" * 50)



[Original text]:
How do I track the card you sent me?

[Duplicate text]:
How do I track the card you sent to me?

[Differences]:
+ to
--------------------------------------------------

[Original text]:
from where are coming your exchange rates?

[Duplicate text]:
Your exchange rates are coming from where?

[Differences]:
- from - where + Your + exchange + rates + from + where? - your - exchange - rates?
--------------------------------------------------

[Original text]:
The exchange rate you are using is bad.This can't be the official interbank exchange rate.

[Duplicate text]:
The exchange rate you are using is really bad.This can't be the official interbank exchange rate.

[Differences]:
+ really
--------------------------------------------------

[Original text]:
I got a €1 extra fee in my statement

[Duplicate text]:
In my statement, I got a €1 extra fee.

[Differences]:
+ In + my + statement, - fee + fee. - in - my - statement
--------------------------------------------------


The found texts do indeed seem to be duplicates, nice! In a normal workflow where we use Model2Vec to embed our documents, deduplication our training corpus is essentially free. This gives us an easy to use, easy to integrate, fast way to deduplicate.

**Deduplication using WordLlama**

For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data.

In [27]:
wl = WordLlama.load()

time = perf_counter()
deduplicated_docs = wl.deduplicate(texts, threshold=0.99)
print(f"Number of deduplicated docs: {len(deduplicated_docs)}")
print(f"Time taken: {perf_counter() - time}")


Number of deduplicated docs: 9900
Time taken: 2.520501794999973


This approach is considerably slower than Model2Vec for encoding + deduplication (1.3 vs 2.5 seconds). It also finds less duplicates with the same threshold.

**Deduplication using MinHash**

As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates.

In [28]:
def get_minhash(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for word in text.split():
        m.update(word.encode('utf8'))
    return m

def deduplicate_with_minhash(texts: list[str], threshold: float = 0.9) -> list[int]:
    """
    Deduplicate texts using MinHash and return the indices of unique texts.

    :param texts: List of texts to deduplicate.
    :param threshold: Jaccard similarity threshold for considering texts as duplicates.
    :return: List of indices of deduplicated texts.
    """
    lsh = MinHashLSH(threshold=threshold)
    deduplicated_text_indices = []

    for i, text in enumerate(texts):
        # Generate MinHash for the current text
        minhash = get_minhash(text)

        # Check if the MinHash is already in the LSH (i.e., if it is a duplicate)
        if not lsh.query(minhash):
            # If it's not a duplicate, add the MinHash and keep the index
            deduplicated_text_indices.append(i)
            lsh.insert(i, minhash)

    return deduplicated_text_indices


time = perf_counter()
deduplicated_text_indices = deduplicate_with_minhash(texts)
print(f"Number of deduplicated docs: {len(deduplicated_text_indices)}")
print(f"Time taken: {perf_counter() - time}")


Number of deduplicated docs: 9957
Time taken: 18.21741431999999


Model2Vec is again much faster, with 1.3 seconds vs 18 seconds for MinHash. The number of found duplicates is roughly the same using the default settings for MinHash.

-----

**Train test leakage detection using Model2Vec**

Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set.


In [31]:
# Load the datasets
ds_train = load_dataset("PolyAI/banking77")["train"]
ds_test = load_dataset("PolyAI/banking77")["test"]

texts_train = ds_train['text']
texts_test = ds_test['text']

# Encode texts into embeddings
embedding_matrix_train = model.encode(texts_train)
embedding_matrix_test = model.encode(texts_test)

def deduplicate_across_datasets(embedding_matrix_1: np.ndarray, embedding_matrix_2: np.ndarray, threshold: float, batch_size: int = 1024) -> tuple[list[int], dict[int, int]]:
    """
    Deduplicate embeddings across two datasets and return the indices of duplicates between them.

    :param embedding_matrix_1: The embeddings of the first dataset (e.g., train).
    :param embedding_matrix_2: The embeddings of the second dataset (e.g., test).
    :param threshold: The similarity threshold to use for deduplication.
    :param batch_size: The batch size to use for similarity computation.
    :return: A tuple containing the duplicate indices and a dictionary mapping removed indices in the second dataset to their corresponding indices in the first dataset.
    """
    reach = Reach(vectors=embedding_matrix_1, items=[str(i) for i in range(len(embedding_matrix_1))])

    # Keep track of duplicates in the second dataset
    duplicate_indices_in_test = []
    duplicate_to_original_mapping = {}

    # Find nearest neighbors from the test set in the train set
    results = reach.nearest_neighbor_threshold(
        embedding_matrix_2,
        threshold=threshold,
        batch_size=batch_size,
        show_progressbar=True
    )

    # Process duplicates
    for i, similar_items in enumerate(tqdm(results)):
        # Similar items are returned as (index, score), we are only interested in the index
        similar_indices = [int(item[0]) for item in similar_items if item[1] >= threshold]  # Keep those above the threshold

        # If we find a similar item in the train set, mark it as a duplicate
        if similar_indices:
            duplicate_indices_in_test.append(i)
            duplicate_to_original_mapping[i] = similar_indices[0]  # Map duplicate in test to original in train

    return duplicate_indices_in_test, duplicate_to_original_mapping

# Check for train/test bleed
duplicate_indices_in_test, duplicate_to_original_mapping = deduplicate_across_datasets(
    embedding_matrix_train,
    embedding_matrix_test,
    threshold=0.99  # High threshold for deduplication
)

print(f"Number of duplicates found between train and test: {len(duplicate_indices_in_test)}")


100%|██████████| 4/4 [00:00<00:00,  9.36it/s]
100%|██████████| 3080/3080 [00:00<00:00, 644119.28it/s]

Number of duplicates found between train and test: 61





In [32]:
# Show a few duplicates with their originals, highlighting word-level differences
num_examples = 5
for i, test_idx in enumerate(duplicate_indices_in_test[:num_examples]):
    train_idx = duplicate_to_original_mapping[test_idx]

    print(f"\n[Train text]:\n{texts_train[train_idx]}")
    print(f"\n[Test text]:\n{texts_test[test_idx]}")
    print("\n[Differences]:")
    print(display_word_differences(texts_train[train_idx], texts_test[test_idx]))
    print("-" * 50)



[Train text]:
How do I add a card on to the app?

[Test text]:
How do I add a card to the app?

[Differences]:
- on
--------------------------------------------------

[Train text]:
I got a €1 extra fee in my statement

[Test text]:
I got a extra €1 fee in my statement

[Differences]:
+ extra - extra
--------------------------------------------------

[Train text]:
I tried to get some money but the machine was not working .The transaction still seems in progress! Can you please check what's going on.I don't want to be charged for money that I did not received.

[Test text]:
Hi,I tried to get some money out but the machine was not working .The transaction still seems in progress! Can you please check what's going on.I don't want to be charged for money that I did not received.

[Differences]:
- I + Hi,I + out
--------------------------------------------------

[Train text]:
I got cash from an ATM earlier but it shows up as pending in the app. How can this still be pending, I already re

These again look like duplicates. We can very efficiently find train/test leakage examples using Model2Vec, ensuring that our test set is clean and does not contain any duplicates from the training set.

**Conclusion**

Model2Vec provides an efficient and fast solution for semantic deduplication, outperforming other methods like WordLlama and MinHash in terms of speed. Additionally, its ability to detect train-test overlap makes it a valuable tool for preparing clean datasets for machine learning tasks.

Credit:

All credit for this notebook goes to minishlab (https://huggingface.co/minishlab) as it originated from their repository. I have made two primary modifications for experimentation purposes:
- [Thaweewat/gte-multilingual-base-m2v-256 model](https://huggingface.co/Thaweewat/gte-multilingual-base-m2v-256) and
- [PolyAI/banking77 dataset](https://huggingface.co/datasets/PolyAI/banking77).
