diff --git a/tutorials/README.md b/tutorials/README.md index fc1992e7..6d9c6805 100644 --- a/tutorials/README.md +++ b/tutorials/README.md @@ -12,5 +12,4 @@ This is a list of all our tutorials. They are all self-contained ipython noteboo | | what? | Link | |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------| | **Recipe search** 🍝 | Learn how to do lightning-fast semantic search by distilling a small model. Compare a really tiny model to a larger with one with a better vocabulary. Learn what Fattoush is (delicious). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/recipe_search.ipynb) | -| **Semantic deduplication** 🧹 | Learn how Model2Vec can be used to detect duplicate texts. Clean your dataset efficiently by finding both exact and semantic duplicates. Detect train-test leakage. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_deduplication.ipynb) | | **Semantic chunking** 🧩 | Learn how to chunk your text into meaningful segments with [Chonkie](https://github.com/bhavnicksm/chonkie) at lightning-speed. Efficiently query your chunks with [Vicinity](https://github.com/MinishLab/vicinity). | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/semantic_chunking.ipynb) | diff --git a/tutorials/semantic_deduplication.ipynb b/tutorials/semantic_deduplication.ipynb deleted file mode 100644 index e3b30a56..00000000 --- a/tutorials/semantic_deduplication.ipynb +++ /dev/null @@ -1,552 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Semantic Deduplication with Model2Vec**\n", - "\n", - "In this tutorial, we’ll explore how Model2Vec can help identify duplicates in text data that traditional exact matching would miss. While exact matching works for identical texts, it fails to detect near-duplicatesβ€”documents that may differ slightly in wording but convey the same meaning. Using Model2Vec, we embed documents into vectors and measure their similarity. This allows us to catch both exact and semantic duplicates, improving the quality of our dataset. With Model2Vec’s speed and efficiency, we can very efficiently perform deduplication on large datasets, ensuring cleaner, more robust data for downstream tasks. Additionally, we will use Model2Vec to detect train-test overlap, ensuring our models are not overfitting." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install datasets model2vec reach numpy wordllama tqdm datasketch\n", - "\n", - "from difflib import ndiff\n", - "from time import perf_counter\n", - "\n", - "from datasets import load_dataset\n", - "from datasketch import MinHash, MinHashLSH\n", - "import numpy as np\n", - "from model2vec import StaticModel\n", - "from reach import Reach\n", - "from tqdm import tqdm\n", - "from wordllama import WordLlama" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Loading data and model**\n", - "\n", - "We will use the AG News dataset and the Model2Vec pretrained model for deduplication." - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [], - "source": [ - "# Load the model and dataset\n", - "model = StaticModel.from_pretrained(\"minishlab/M2V_base_output\")\n", - "ds = load_dataset(\"ag_news\")[\"train\"]\n", - "texts = ds['text']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Exact overlap baseline**\n", - "\n", - "We will first try to find exact matches in the dataset as a baseline." - ] - }, - { - "cell_type": "code", - "execution_count": 96, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of deduplicated docs: 120000\n" - ] - } - ], - "source": [ - "seen = set()\n", - "deduplicated_text_indices = []\n", - "\n", - "for i, text in enumerate(texts):\n", - " if text not in seen:\n", - " deduplicated_text_indices.append(i)\n", - " seen.add(text)\n", - "\n", - "print(\"Number of deduplicated docs:\", len(deduplicated_text_indices))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As can be seen, we find no duplicate instances using exact string matching. \n", - "\n", - "**Deduplication using Model2Vec**\n", - "\n", - "Let's now use Model2Vec to embed our documents and identify duplicates." - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 118/118 [00:02<00:00, 45.65it/s]\n" - ] - } - ], - "source": [ - "# Encode texts into embeddings\n", - "embedding_matrix = model.encode(texts)" - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": {}, - "outputs": [], - "source": [ - "def deduplicate(embedding_matrix: np.ndarray, threshold: float, batch_size: int = 1024) -> tuple[np.ndarray, dict[int, int]]:\n", - " \"\"\"\n", - " Deduplicate embeddings and return the deduplicated indices and a mapping of removed indices to their corresponding original indices.\n", - " \n", - " :param embedding_matrix: The embeddings to deduplicate.\n", - " :param threshold: The similarity threshold to use for deduplication.\n", - " :param batch_size: The batch size to use for similarity computation.\n", - " :return: A tuple containing the deduplicated indices and a dictionary mapping removed indices to original indices.\n", - " \"\"\"\n", - " reach = Reach(vectors=embedding_matrix, items=[str(i) for i in range(len(embedding_matrix))])\n", - " \n", - " # Use a set for deduplicated indices and keep track of duplicates\n", - " deduplicated_indices = set(range(len(embedding_matrix))) # Start with all indices as deduplicated\n", - " duplicate_to_original_mapping = {}\n", - "\n", - " results = reach.nearest_neighbor_threshold(\n", - " embedding_matrix, \n", - " threshold=threshold, \n", - " batch_size=batch_size, \n", - " show_progressbar=True\n", - " )\n", - " \n", - " # Process duplicates\n", - " for i, similar_items in enumerate(tqdm(results)):\n", - " if i not in deduplicated_indices:\n", - " continue # Skip already marked duplicates\n", - "\n", - " # Similar items are returned as (index, score), we are only interested in the index\n", - " similar_indices = [int(item[0]) for item in similar_items if int(item[0]) != i]\n", - " \n", - " # Mark similar documents as duplicates and map them to the original\n", - " for sim_idx in similar_indices:\n", - " if sim_idx in deduplicated_indices:\n", - " deduplicated_indices.remove(sim_idx)\n", - " duplicate_to_original_mapping[sim_idx] = i # Map duplicate to original\n", - "\n", - " return np.array(list(deduplicated_indices)), duplicate_to_original_mapping\n" - ] - }, - { - "cell_type": "code", - "execution_count": 81, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 118/118 [00:25<00:00, 4.64it/s]\n", - "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 120000/120000 [00:00<00:00, 679800.97it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of deduplicated docs: 118769\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n" - ] - } - ], - "source": [ - "# Deduplicate (with a high threshold)\n", - "deduplicated_indices, duplicate_to_original_mapping = deduplicate(embedding_matrix, threshold=0.99)\n", - "print(f\"Number of deduplicated docs: {len(deduplicated_indices)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Using Model2Vec, we find > 1000 duplicates with a very high threshold, in < 30 seconds. Now, let's look at a few examples to see if these are indeed duplicates." - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Original text:\n", - "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.\n", - "Duplicate text:\n", - "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market this week during the depth of the\\summer doldrums.\n", - "Differences:\n", - "- next + this\n", - "--------------------------------------------------\n", - "Original text:\n", - "Oil and Economy Cloud Stocks' Outlook NEW YORK (Reuters) - Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.\n", - "Duplicate text:\n", - "Oil and Economy Cloud Stocks' Outlook NEW YORK (Reuters) - Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market this week during the depth of the summer doldrums.\n", - "Differences:\n", - "- next + this\n", - "--------------------------------------------------\n", - "Original text:\n", - "Phelps, Thorpe Advance in 200 Freestyle ATHENS, Greece - Michael Phelps took care of qualifying for the Olympic 200-meter freestyle semifinals Sunday, and then found out he had been added to the American team for the evening's 400 freestyle relay final. Phelps' rivals Ian Thorpe and Pieter van den Hoogenband and teammate Klete Keller were faster than the teenager in the 200 free preliminaries...\n", - "Duplicate text:\n", - "Phelps, Thorpe Advance in 200 Freestyle ATHENS, Greece - Michael Phelps took care of qualifying for the Olympic 200-meter freestyle semifinals Sunday, and then found out he had been added to the American team for the evening's 400 freestyle relay final. Phelps' rivals Ian Thorpe and Pieter van den Hoogenband and teammate Klete Keller were faster than the teenager in the 200 free preliminaries...\n", - "Differences:\n", - "\n", - "--------------------------------------------------\n", - "Original text:\n", - "Government Spending Up Sharply Locally Federal procurement spending in the Washington area rose last year at its highest rate since the 1980s, according to a study to be released today, creating tens of thousands of jobs and increasing economic growth disproportionately in Northern Virginia.\n", - "Duplicate text:\n", - "Government Spending Up Sharply Locally Federal procurement spending in the Washington area rose last year at its highest rate since the 1980s, according to a study to be released today, creating tens of thousands of jobs and increasing economic growth disproportionately in Northern Virginia.\n", - "Differences:\n", - "\n", - "--------------------------------------------------\n", - "Original text:\n", - "F.B.I. Goes Knocking for Political Troublemakers The F.B.I. has been questioning demonstrators in an effort to forestall violent protests at the Republican National Convention.\n", - "Duplicate text:\n", - "F.B.I. Goes Knocking for Political Troublemakers The F.B.I. has been questioning demonstrators in an effort to forestall violent protests at the Republican convention.\n", - "Differences:\n", - "- National - Convention. + convention.\n", - "--------------------------------------------------\n" - ] - } - ], - "source": [ - "def display_word_differences(x: str, y: str) -> str:\n", - " diff = ndiff(x.split(), y.split())\n", - " return \" \".join([f\"{word}\" for word in diff if word.startswith(('+', '-'))])\n", - "\n", - "# Show a few duplicates with their originals, highlighting word-level differences\n", - "num_examples = 5\n", - "for duplicate_idx, original_idx in list(duplicate_to_original_mapping.items())[:num_examples]:\n", - " print(f\"Original text:\\n{texts[original_idx]}\")\n", - " print(f\"Duplicate text:\\n{texts[duplicate_idx]}\")\n", - " print(\"Differences:\")\n", - " print(display_word_differences(texts[original_idx], texts[duplicate_idx]))\n", - " print(\"-\" * 50)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The found texts do indeed seem to be duplicates, nice! In a normal workflow where we use Model2Vec to embed our documents, deduplication our training corpus is essentially free. This gives us an easy to use, easy to integrate, fast way to deduplicate.\n", - "\n", - "**Deduplication using WordLlama**\n", - "\n", - "For comparison, let's also try a different library (WordLlama), which also uses static embeddings to deduplicate text data." - ] - }, - { - "cell_type": "code", - "execution_count": 84, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of deduplicated docs: 119128\n", - "Time taken: 42.821428374998504\n" - ] - } - ], - "source": [ - "wl = WordLlama.load()\n", - "\n", - "time = perf_counter()\n", - "deduplicated_docs = wl.deduplicate(texts, threshold=0.99)\n", - "print(f\"Number of deduplicated docs: {len(deduplicated_docs)}\")\n", - "print(f\"Time taken: {perf_counter() - time}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This approach is considerably slower than Model2Vec for encoding + deduplication (43 vs 27 seconds). It also finds less duplicates with the same threshold.\n", - "\n", - "**Deduplication using MinHash**\n", - "\n", - "As a last comparison, let's use MinHash, a common method for deduplication. We will use the datasketch library to find duplicates." - ] - }, - { - "cell_type": "code", - "execution_count": 77, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of deduplicated docs: 118653\n", - "Time taken: 56.46521229199425\n" - ] - } - ], - "source": [ - "def get_minhash(text: str, num_perm: int = 128) -> MinHash:\n", - " m = MinHash(num_perm=num_perm)\n", - " for word in text.split():\n", - " m.update(word.encode('utf8'))\n", - " return m\n", - "\n", - "def deduplicate_with_minhash(texts: list[str], threshold: float = 0.9) -> list[int]:\n", - " \"\"\"\n", - " Deduplicate texts using MinHash and return the indices of unique texts.\n", - "\n", - " :param texts: List of texts to deduplicate.\n", - " :param threshold: Jaccard similarity threshold for considering texts as duplicates.\n", - " :return: List of indices of deduplicated texts.\n", - " \"\"\"\n", - " lsh = MinHashLSH(threshold=threshold)\n", - " deduplicated_text_indices = []\n", - "\n", - " for i, text in enumerate(texts):\n", - " # Generate MinHash for the current text\n", - " minhash = get_minhash(text)\n", - "\n", - " # Check if the MinHash is already in the LSH (i.e., if it is a duplicate)\n", - " if not lsh.query(minhash):\n", - " # If it's not a duplicate, add the MinHash and keep the index\n", - " deduplicated_text_indices.append(i)\n", - " lsh.insert(i, minhash)\n", - "\n", - " return deduplicated_text_indices\n", - "\n", - "\n", - "time = perf_counter()\n", - "deduplicated_text_indices = deduplicate_with_minhash(texts)\n", - "print(f\"Number of deduplicated docs: {len(deduplicated_text_indices)}\")\n", - "print(f\"Time taken: {perf_counter() - time}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Model2Vec is again much faster, with 27 seconds vs 56 seconds for MinHash. The number of found duplicates is roughly the same using the default settings for MinHash.\n", - "\n", - "**Train test leakage detection using Model2Vec**\n", - "\n", - "Now, as a last experiment, let's also embed the test set, and see if there are any duplicates between the training and test set. This is a common issue in NLP, where the test set may contain instances that are also in the training set.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 90, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 118/118 [00:02<00:00, 45.36it/s]\n", - "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 51.05it/s]\n", - "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:01<00:00, 5.40it/s]\n", - "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7600/7600 [00:00<00:00, 901108.42it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of duplicates found between train and test: 138\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n" - ] - } - ], - "source": [ - "# Load the datasets\n", - "ds_train = load_dataset(\"ag_news\")[\"train\"]\n", - "ds_test = load_dataset(\"ag_news\")[\"test\"]\n", - "\n", - "texts_train = ds_train['text']\n", - "texts_test = ds_test['text']\n", - "\n", - "# Encode texts into embeddings\n", - "embedding_matrix_train = model.encode(texts_train)\n", - "embedding_matrix_test = model.encode(texts_test)\n", - "\n", - "def deduplicate_across_datasets(embedding_matrix_1: np.ndarray, embedding_matrix_2: np.ndarray, threshold: float, batch_size: int = 1024) -> tuple[list[int], dict[int, int]]:\n", - " \"\"\"\n", - " Deduplicate embeddings across two datasets and return the indices of duplicates between them.\n", - " \n", - " :param embedding_matrix_1: The embeddings of the first dataset (e.g., train).\n", - " :param embedding_matrix_2: The embeddings of the second dataset (e.g., test).\n", - " :param threshold: The similarity threshold to use for deduplication.\n", - " :param batch_size: The batch size to use for similarity computation.\n", - " :return: A tuple containing the duplicate indices and a dictionary mapping removed indices in the second dataset to their corresponding indices in the first dataset.\n", - " \"\"\"\n", - " reach = Reach(vectors=embedding_matrix_1, items=[str(i) for i in range(len(embedding_matrix_1))])\n", - "\n", - " # Keep track of duplicates in the second dataset\n", - " duplicate_indices_in_test = []\n", - " duplicate_to_original_mapping = {}\n", - "\n", - " # Find nearest neighbors from the test set in the train set\n", - " results = reach.nearest_neighbor_threshold(\n", - " embedding_matrix_2, \n", - " threshold=threshold, \n", - " batch_size=batch_size, \n", - " show_progressbar=True\n", - " )\n", - " \n", - " # Process duplicates\n", - " for i, similar_items in enumerate(tqdm(results)):\n", - " # Similar items are returned as (index, score), we are only interested in the index\n", - " similar_indices = [int(item[0]) for item in similar_items if item[1] >= threshold] # Keep those above the threshold\n", - " \n", - " # If we find a similar item in the train set, mark it as a duplicate\n", - " if similar_indices:\n", - " duplicate_indices_in_test.append(i)\n", - " duplicate_to_original_mapping[i] = similar_indices[0] # Map duplicate in test to original in train\n", - "\n", - " return duplicate_indices_in_test, duplicate_to_original_mapping\n", - "\n", - "# Check for train/test bleed\n", - "duplicate_indices_in_test, duplicate_to_original_mapping = deduplicate_across_datasets(\n", - " embedding_matrix_train, \n", - " embedding_matrix_test, \n", - " threshold=0.99 # High threshold for deduplication\n", - ")\n", - "\n", - "print(f\"Number of duplicates found between train and test: {len(duplicate_indices_in_test)}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": 99, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train text:\n", - "Jackson Squares Off With Attorney SANTA MARIA, Calif. - Fans of Michael Jackson erupted in cheers Monday as the pop star emerged from a double-decker tour bus and went into court for a showdown with the prosecutor who has pursued him for years on child molestation charges...\n", - "Test text:\n", - "Jackson Squares Off With Prosecutor SANTA MARIA, Calif. - Fans of Michael Jackson erupted in cheers Monday as the pop star emerged from a double-decker tour bus and went into court for a showdown with the prosecutor who has pursued him for years on child molestation charges...\n", - "Differences:\n", - "- Attorney + Prosecutor\n", - "--------------------------------------------------\n", - "Train text:\n", - "Cassini Spies Two Moons Around Saturn (AP) AP - NASA's Cassini spacecraft has spied two new little moons around satellite-rich Saturn, the space agency said.\n", - "Test text:\n", - "Cassini Spies Two Little Saturn Moons (AP) AP - NASA's Cassini spacecraft has spied two new little moons around satellite-rich Saturn, the space agency said Monday.\n", - "Differences:\n", - "+ Little + Saturn - Around - Saturn - said. + said + Monday.\n", - "--------------------------------------------------\n", - "Train text:\n", - "Intel to Delay Product for High-Definition TVs SAN FRANCISCO (Reuters) - In the latest of a series of product delays, Intel Corp. has postponed the launch of a video display chip it had previously planned to introduce by year end, putting off a showdown with Texas Instruments Inc. in the fast-growing market for high-definition television displays.\n", - "Test text:\n", - "Intel to delay product aimed for high-definition TVs SAN FRANCISCO -- In the latest of a series of product delays, Intel Corp. has postponed the launch of a video display chip it had previously planned to introduce by year end, putting off a showdown with Texas Instruments Inc. in the fast-growing market for high-definition television displays.\n", - "Differences:\n", - "- Delay + delay - Product + product + aimed - High-Definition + high-definition + -- - (Reuters) - -\n", - "--------------------------------------------------\n", - "Train text:\n", - "Staples Profit Up Sharply, to Enter China NEW YORK (Reuters) - Staples Inc. <A HREF=\"http://www.investor.reuters.com/FullQuote.aspx?ticker=SPLS.O target=/stocks/quickinfo/fullquote\">SPLS.O</A>, the top U.S. office products retailer, on Tuesday reported a 39 percent jump in quarterly profit, raised its full-year forecast and said it plans to enter the fast-growing Chinese market.\n", - "Test text:\n", - "Staples Profit Up, to Enter China Market NEW YORK (Reuters) - Staples Inc. <A HREF=\"http://www.investor.reuters.com/FullQuote.aspx?ticker=SPLS.O target=/stocks/quickinfo/fullquote\">SPLS.O</A>, the top U.S. office products retailer, on Tuesday reported a 39 percent jump in quarterly profit, raised its full-year forecast and said it plans to enter the fast-growing Chinese market, sending its shares higher.\n", - "Differences:\n", - "- Up + Up, - Sharply, + Market - market. + market, + sending + its + shares + higher.\n", - "--------------------------------------------------\n", - "Train text:\n", - "Stocks Climb on Drop in Consumer Prices NEW YORK - Stocks rose for a second straight session Tuesday as a drop in consumer prices Tuesday allowed investors to put aside worries about inflation, at least for the short term. With gasoline prices falling to eight-month lows, the Consumer Price Index registered a small drop in July, giving consumers a respite from soaring energy prices...\n", - "Test text:\n", - "Stocks Climb on Drop in Consumer Prices NEW YORK - Stocks rose for a second straight session Tuesday as a drop in consumer prices allowed investors to put aside worries about inflation, at least for the short term. With gasoline prices falling to eight-month lows, the Consumer Price Index registered a small drop in July, giving consumers a respite from soaring energy prices...\n", - "Differences:\n", - "- Tuesday\n", - "--------------------------------------------------\n" - ] - } - ], - "source": [ - "# Show a few duplicates with their originals, highlighting word-level differences\n", - "num_examples = 5\n", - "for i, test_idx in enumerate(duplicate_indices_in_test[:num_examples]):\n", - " train_idx = duplicate_to_original_mapping[test_idx]\n", - "\n", - " print(f\"Train text:\\n{texts_train[train_idx]}\")\n", - " print(f\"Test text:\\n{texts_test[test_idx]}\")\n", - " print(\"Differences:\")\n", - " print(display_word_differences(texts_train[train_idx], texts_test[test_idx]))\n", - " print(\"-\" * 50)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "These again look like duplicates. We can very efficiently find train/test leakage examples using Model2Vec, ensuring that our test set is clean and does not contain any duplicates from the training set.\n", - "\n", - "**Conclusion**\n", - "\n", - "Model2Vec provides an efficient and fast solution for semantic deduplication, outperforming other methods like WordLlama and MinHash in terms of speed. Additionally, its ability to detect train-test overlap makes it a valuable tool for preparing clean datasets for machine learning tasks." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.11" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}