# News Stream Deduplication with Metric Learning (Contrastive Loss)

Our goal is to train a model that can embed news articles into a vector space such that similar articles are close together, and dissimilar articles are far apart. We will then use these embeddings to identify duplicates.

In this notebook, you will:
1.  Set up the environment and download the required datasets from Hugging Face.
2.  Load and explore the news data.
3.  Prepare the data for training (generating positive and negative pairs).
4.  Set up an appropriate embedding model and the Contrastive Loss function.
5.  Train the model on the prepared data (several times).
6.  Use the trained model to find duplicates in a test set.

** It is very simplified version of real code:**
1. We use very crude mining process, but simple and efficient to understand what is happening.
2. We use only 2 layers of data instead of 3.
3. We train only on subsample of data, because real training takes 3 hours.

# Prerequisites

In [None]:
%pip install datasets >> None
#%pip install sentence-transformers[train]==3.0.1 >> None
%pip install -U sentence-transformers transformers accelerate

from random import randint, choice
from tqdm.auto import tqdm
import pandas as pd
from typing import Union, List
import numpy as np

import torch
import pandas as pd
from sentence_transformers import SentenceTransformer, losses, evaluation
from sklearn.model_selection import train_test_split

import requests
import random
import os

from sklearn.preprocessing import normalize
from numpy import dot
from numpy.linalg import norm
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from sklearn.metrics.pairwise import pairwise_distances
from tqdm.auto import tqdm



In [None]:
def set_seed(seed_value=42):
    """
    Sets the seed for reproducibility across multiple libraries (random, numpy, torch, tensorflow).

    Args:
        seed_value (int): The integer value to use as the seed. Default is 42.
    """
    print(f"Setting seed to {seed_value}")

    # 1. Set `PYTHONHASHSEED` environment variable (optional but good practice)
    os.environ['PYTHONHASHSEED'] = str(seed_value)

    # 2. Set the `python` built-in random number generator
    random.seed(seed_value)

    # 3. Set the `numpy` random number generator
    np.random.seed(seed_value)

    # 4. Set the `pytorch` random number generator (if torch is installed)
    if 'torch' in globals() and torch.__version__:
        try:
            if torch.cuda.is_available():
                torch.cuda.manual_seed_all(seed_value) # Seed all GPUs
            torch.manual_seed(seed_value)
        except Exception as e:
             pass

set_seed(42)

Setting seed to 42


In [None]:
def download_hf_datasets(dataset_name="AAAAAA2121/file_host", subfolder="main", files=None, save_dir="./data"):
    """
    Downloads specified files from a Hugging Face dataset repository.

    Args:
        dataset_name (str): The namespace/dataset_name on Hugging Face.
        subfolder (str): The branch or subfolder within the dataset (usually 'main').
        files (list): A list of filenames to download. If None, attempts to download
                      the default files specified in the function.
        save_dir (str): The local directory to save the files to.
    """
    if files is None:
        files = [
            "4traingpt.csv",
            "4trainnyan.csv",
            "gpt100k.csv",
            "nyan.csv"
        ]

    base_url = f"https://huggingface.co/datasets/{dataset_name}/resolve/{subfolder}/"

    # Create the save directory if it doesn't exist
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)
        print(f"Created directory: {save_dir}")

    print(f"Downloading files from {base_url}...")

    for file_name in files:
        file_url = base_url + file_name
        save_path = os.path.join(save_dir, file_name)

        # Check if file already exists
        if os.path.exists(save_path):
            print(f"File already exists: {file_name}. Skipping download.")
            continue

        try:
            print(f"Downloading {file_name}...")
            # Use stream=True for large files and process in chunks
            with requests.get(file_url, stream=True) as r:
                r.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
                with open(save_path, 'wb') as f:
                    for chunk in r.iter_content(chunk_size=8192):
                        f.write(chunk)
            print(f"Successfully downloaded {file_name} to {save_path}")

        except requests.exceptions.RequestException as e:
            print(f"Error downloading {file_name}: {e}")
        except Exception as e:
            print(f"An unexpected error occurred while downloading {file_name}: {e}")

    print("Download process finished.")

# Data
In this section we will download, prepare and mine neg samples.

In [None]:
download_hf_datasets()

### Generating Negative Samples

For training models with **Contrastive Loss**, we need pairs of data points:
* **Positive Pairs:** Items that are considered similar or duplicates.
* **Negative Pairs:** Items that are considered dissimilar or non-duplicates.

While creating positive pairs is usually straightforward (e.g., based on a 'true' duplicate ID, if available, or manually defined rules), generating effective *negative* pairs is crucial for successful training and often requires a specific strategy. Simply pairing random articles might result in negatives that are too easy for the model to distinguish. Your `generate_neg_sample` function implements a custom strategy for **structured negative sampling**.

**Function Purpose:**
This function iterates through each item in the input DataFrame (`df`) and attempts to find a *different* item to form a negative pair with it. It aims to create negative examples that are not just random, but potentially from related categories or different text types, making the learning task more robust.

**How it Works (Negative Sampling Strategy):**

For each item processed from your input DataFrame (let's call the text from this item `post_1`), the function randomly selects a potential source for its corresponding negative sample (`post_2`) based on two criteria extracted from your data:

1.  **Category:** Should `post_2` come from the *same* category as `post_1` or a *different* category?
2.  **Text Type:** Assuming your DataFrame has columns for original and potentially generated text (like 'gpt'), should `post_2` be of the *same* text type (original or GPT) as `post_1`, or the *other* text type?

This combination defines 4 different "modes" of sampling, which are chosen randomly (`randint(1,4)`) for each negative pair generated:

* **Mode 1:** Sample `post_2` from the **same category** as `post_1`, using the **original** text from that item.
* **Mode 2:** Sample `post_2` from the **same category** as `post_1`, using the **GPT-generated** text from that item.
* **Mode 3:** Sample `post_2` from a **different category** than `post_1`, using the **original** text from that item.
* **Mode 4:** Sample `post_2` from a **different category** than `post_1`, using the **GPT-generated** text from that item.

**Finding a Unique Sample:**

After selecting a mode, the function samples an item according to that mode's criteria. It includes a check (`while post_2 == post_1:`) to ensure that the selected `post_2` is *not identical* to the original `post_1`. It will keep re-sampling from the chosen mode's pool until it finds a distinct text.

**Handling Difficult Cases:**

* If it fails to find a distinct sample within 3 attempts using the initially chosen mode, the sampling `mode` is re-randomized to try a different strategy.
* If it still cannot find a distinct sample after 10 attempts in total (which could happen if a category is extremely small or contains only identical texts), it gives up and assigns `np.nan` as `post_2` for that specific pair, then moves to the next item.

**Output:**
The function iterates through the entire input DataFrame (`df`) and for each row, generates one negative pair based on the logic described above. It returns a new pandas DataFrame containing two columns, `post_1` and `post_2`, where each row represents a generated negative pair designed to be dissimilar.

This strategic approach to negative sampling, varying the source category and text type, helps create more diverse and potentially more challenging negative examples for the model to learn from. This is a form of **Negative Mining**, aiming to select negatives that are not trivially different, thereby improving the model's ability to make fine-grained distinctions.

In [None]:
import pandas as pd
import numpy as np
from random import choice, randint
from tqdm import tqdm

def generate_neg_sample(df: pd.DataFrame) -> pd.DataFrame:
    """
    Generates negative pairs for Contrastive Loss training by sampling
    a different article for each input article based on a random strategy.

    The sampling strategy considers the category of the original article
    and the type of text (original or generated) for both articles.

    Args:
        df (pd.DataFrame): Input DataFrame containing articles.
                           Assumes it has a 'category' column and
                           at least two text columns (assumed to be the first two).

    Returns:
        pd.DataFrame: A DataFrame with 'post_1' and 'post_2' columns
                      containing the generated negative pairs.
    """
    neg_pairs = {'post_1': [], 'post_2': []}

    # Assuming the first two columns of the DataFrame contain the texts (e.g., original and gpt)
    # Adjust text_columns if your text data is in different columns (e.g., ['text_orig', 'text_gpt'])
    text_columns = df.columns[:2]
    if len(text_columns) < 2:
         raise ValueError("DataFrame must have at least two text columns.")
    if 'category' not in df.columns:
         raise ValueError("DataFrame must have a 'category' column.")


    # Define constants for the sampling modes for clarity
    MODE_SAME_CAT_ORIG = 1
    MODE_SAME_CAT_OTHER_TEXT = 2 # Assumes 'other' text is GPT if orig is text_columns[0]
    MODE_DIFF_CAT_ORIG = 3
    MODE_DIFF_CAT_OTHER_TEXT = 4

    MAX_ATTEMPTS_PER_SAMPLE = 10 # Max attempts to find a unique negative sample
    RESAMPLE_MODE_THRESHOLD = 3 # Re-randomize mode after this many failed attempts

    print("Generating negative samples...")

    for index in tqdm(range(df.shape[0]), desc="Generating Negatives"):
        current_row = df.iloc[index]
        initial_category = current_row['category']

        # Randomly choose which text column from the current row will be 'post_1'
        post_1_col_index = randint(0, 1) # 0 or 1
        post_1_text = current_row[text_columns[post_1_col_index]]

        found_neg_text = None
        attempt_count = 0
        current_sampling_mode = randint(1, 4) # Start with a random mode

        # Loop to find a suitable negative sample (post_2) that is different from post_1
        while attempt_count < MAX_ATTEMPTS_PER_SAMPLE:
            attempt_count += 1

            # --- Step 1: Determine Filtering Criteria based on current_sampling_mode ---
            filter_by_same_category = current_sampling_mode in [MODE_SAME_CAT_ORIG, MODE_SAME_CAT_OTHER_TEXT]
            category_filter = df['category'] == initial_category if filter_by_same_category else df['category'] != initial_category

            # --- Step 2: Determine which Text Column to Sample from ---
            # Logic: Modes 1 & 3 sample from the same text type as post_1_col_index.
            # Modes 2 & 4 sample from the *other* text type.
            if current_sampling_mode in [MODE_SAME_CAT_ORIG, MODE_DIFF_CAT_ORIG]:
                 sample_col_index = post_1_col_index # Sample from the same text type column
            else: # Modes 2 and 4
                 sample_col_index = (post_1_col_index + 1) % 2 # Sample from the other text type column

            # --- Step 3: Filter DataFrame and Get List of Candidates ---
            candidate_df = df[category_filter]

            # Ensure there are candidates to sample from based on the filter
            if candidate_df.empty:
                 # If no candidates match the filter, try a different mode
                 if attempt_count % RESAMPLE_MODE_THRESHOLD == 0 or attempt_count == 1: # Re-randomize mode if filter yields nothing or every 3 attempts
                      current_sampling_mode = randint(1, 4)
                      # print(f"   Attempt {attempt_count}: Filtered df empty. Re-randomizing mode to {current_sampling_mode}") # Optional log
                 continue # Skip sampling in this attempt, try again with potentially new mode

            candidate_texts = candidate_df[text_columns[sample_col_index]].tolist()

            # Ensure the list of potential texts is not empty after selecting column
            if not candidate_texts:
                 # If list is empty (e.g., column has all NaNs in filtered rows), try different mode
                 if attempt_count % RESAMPLE_MODE_THRESHOLD == 0 or attempt_count == 1:
                      current_sampling_mode = randint(1, 4)
                      # print(f"   Attempt {attempt_count}: Candidate list empty. Re-randomizing mode to {current_sampling_mode}") # Optional log
                 continue # Skip sampling

            # --- Step 4: Randomly Sample and Check if Different ---
            sampled_text = choice(candidate_texts)

            if sampled_text != post_1_text:
                found_neg_text = sampled_text
                break # Successfully found a different negative sample

            # If sampled_text == post_1_text, the loop continues to the next attempt.
            # Check if we should re-randomize the mode for the next attempt.
            if attempt_count % RESAMPLE_MODE_THRESHOLD == 0:
                 current_sampling_mode = randint(1, 4)
                 # print(f"   Attempt {attempt_count}: Sampled same text. Re-randomizing mode to {current_sampling_mode}") # Optional log


        # --- After the while loop: Append the result ---
        # If found_neg_text is still None, it means we failed to find a distinct sample
        final_neg_text = found_neg_text if found_neg_text is not None else np.nan

        neg_pairs['post_1'].append(post_1_text)
        neg_pairs['post_2'].append(final_neg_text)

    print("Finished generating negative samples.")
    return pd.DataFrame(neg_pairs)


In [None]:
# Our GPT labeled data
df_gpt = pd.read_csv('./data/gpt100k.csv').rename({'original':'post_1', 'gpt':'post_2'}, axis=1)
print("GPT labeled data")
display(df_gpt)

# Our Nyan labeled data
df_pos = pd.read_csv('./data/nyan.csv')[['text', 'other_posts', 'category']]
print("Mined data from sources")
display(df_pos)

GPT labeled data


Unnamed: 0,post_1,post_2,category
0,А в России все тревожнее 🇺🇦 Украина Сейчас / П...,В России возрастает непокой 🇺🇦 Украина Сегодня...,Политика
1,Владимир Зеленский перепутал косоворотку с выш...,Твиттер раскритиковал Владимира Зеленского за ...,Политика
2,💥 Вот что слышат жители столицы 🇺🇦 Украина Сей...,🔥 Вот что ходят слухи в столице Украины 🇺🇦 Сей...,Политика
3,Вот так выглядит бой от первого лица. Защитник...,Вот так представляется сражение из первого лиц...,Политика
4,На пляже Брайтон-бич в США прошёл ЛГБТ-марш в ...,"В Соединенных Штатах, на пляже Брайтон-Бич, со...",Политика
...,...,...,...
106578,✔️Продолжение рубрики «фантастически лицемерны...,Выпуск новой рубрики «Обитатели Вашингтона: ли...,Политика
106579,"​​Оперативная сводка Генштаба, главное: ▪️На Х...","Оперативная сводка Генштаба, основные события:...",Политика
106580,Про вранье американских СМИ и фейки в Дагестан...,"Я хотела бы поделиться своим мнением о том, ка...",Политика
106581,"⚡️ Брифинг начальника войск радиационной, хими...",⚡️ Презентация генерал-лейтенанта Игоря Кирилл...,Политика


Mined data from sources


Unnamed: 0,text,other_posts,category
0,Корабль «Союз МС-24» пристыковался к модулю «Р...,['Корабль «Союз МС-24» пристыковался к модулю ...,Технологии
1,Из-за распыленного газа в суде девушку Саши Ск...,['Из-за распыленного газа в суде девушку Саши ...,Общее
2,Полицейского застрелили во время погони в Подм...,['Полицейского застрелили во время погони в По...,Общее
3,Шевченковский районный суд Киева увеличил разм...,['Шевченковский районный суд Киева увеличил ра...,Политика
4,Двухэтажный торговый центр горит на Бухарестск...,['Двухэтажный торговый центр горит на Бухарест...,Политика
...,...,...,...
86,Выжимка за 8 часов:\n\n🇷🇺⚔️🇺🇦 Конфликт России ...,['В Курской области под украинский обстрел поп...,Политика
87,Министр связи и средств массовой информации Ко...,['Министр связи и средств массовой информации ...,Политика
88,Средства ПВО уничтожили два украинских БПЛА на...,['Средства ПВО уничтожили два украинских БПЛА ...,Политика
89,Системы ПВО сработали в районе мыса Фиолент в ...,['Системы ПВО сработали в районе мыса Фиолент ...,Политика


In [None]:
# Prep gpt data
df_gpt = df_gpt.iloc[:100] # You can delete this section if you want to run fully
df_gpt = df_gpt.dropna()
df_gpt['post_1'] = df_gpt['post_1'].apply(lambda x: np.nan if "подписат" in x.lower() else x)
df_gpt['post_2'] = df_gpt['post_2'].apply(lambda x: np.nan if "подписат" in x.lower() else x)
df_gpt = df_gpt.dropna()
df_gpt['target'] = 1

df_neg = generate_neg_sample(df_gpt)
df_neg['target'] = 0
df = pd.concat([df_gpt[['post_1', 'post_2', 'target']],df_neg]).sample(frac=1)
df.to_csv('./data/4traingpt_seminar.csv', index=False)

# For seminar we will use already prepared result
df = pd.read_csv('./data/4traingpt.csv')
display(df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_gpt['target'] = 1


Generating negative samples...


Generating Negatives: 100%|██████████| 93/93 [00:00<00:00, 1568.71it/s]

Finished generating negative samples.





Unnamed: 0,post_1,post_2,target
0,Во время съемки фанат задал Криштиану вопрос о...,"Заниматься журналистикой можно до тех пор, пок...",0
1,Прокуратура Берлина пока не получала запроса о...,Как сообщает информационное агентство РИА Ново...,1
2,"Ситуация на Украине показала, что Евросоюзу не...",Tinder уходит из России. Но другие способы зна...,0
3,🔥 Сегодня «Спартак» и «Зенит» стартуют в новом...,Сегодня в новом сезоне РПЛ начинают соревноват...,1
4,Наблюдая за кадрами визита Путина в восстановл...,Основатель компании Forex Club (сейчас работае...,0
...,...,...,...
204475,Путин не имеет возможности следить за предвыбо...,"Президент Байден заявил, что возможные серьезн...",0
204476,Топ-республиканцы в Соединенных Штатах ассигно...,🚨 Возникают сообщения о взрывах в Винницкой об...,0
204477,🥇Вкалывают роботы - варят пельмени и взбалтыва...,"🥇Роботы усердно трудятся, готовя пельмени и см...",1
204478,Может ли доллар вскоре вернуться к отметке в 1...,Может ли курс доллара вернуться к отметке в 10...,1


In [None]:
# Prep mined from web data
df_pos = df_pos.iloc[:100]
df_pos['other_posts'] = df_pos['other_posts'].apply(lambda x: eval(x))
df_pos['text'] = df_pos['other_posts']
df_pos = df_pos.explode('other_posts')
df_pos = df_pos.explode('text')
df_pos['other_posts'] = df_pos['other_posts'].apply(lambda x: ' '.join([word if '@' not in word else '' for word in x.split()]))
df_pos['text'] = df_pos['text'].apply(lambda x: ' '.join([word if '@' not in word else '' for word in x.split()]))
df_pos['text'] = df_pos['text'].apply(lambda x: np.nan if "подписат" in x.lower() else x)
df_pos['other_posts'] = df_pos['other_posts'].apply(lambda x: np.nan if "подписат" in x.lower() else x)
df_pos['text'] = df_pos[['text', 'other_posts']].apply(lambda x: np.nan if x['text']==x['other_posts'] else x['text'], axis=1)
df_pos = df_pos.dropna()
df_pos = df_pos.rename({'text':'post_1', 'other_posts':'post_2'}, axis=1)
df_pos['target'] = 1

df_neg = generate_neg_sample(df_pos)
df_neg['target'] = 0
df = pd.concat([df_pos,df_neg]).sample(frac=1)[['post_1', 'post_2', 'target']]
df.to_csv('./data/4trainnyan.csv', index=False)

# For seminar we will use already prepared result
df = pd.read_csv('./data/4trainnyan.csv')
display(df)

Generating negative samples...


Generating Negatives: 100%|██████████| 12746/12746 [00:26<00:00, 484.24it/s]


Finished generating negative samples.


Unnamed: 0,post_1,post_2,target
0,Падение беспилотного летательного аппарата про...,В Турине в результате падения реактивного само...,0
1,❗️В Воронежской области приземлили беспилотник...,Киев пытался атаковать объекты на территории Р...,1
2,Силы ПВО отразили атаку БПЛА на Москву в город...,ВСУ минувшей ночью пытались атаковать нескольк...,1
3,Польша с воскресенья запрещает въезд на свою т...,С 17 сентября Польша запретит въезд в страну г...,0
4,Із 17 вересня Польща забороняє в'їзд автомобіл...,"И конечно, Польша не могла пройти мимо — вслед...",1
...,...,...,...
25487,⚡️Польша с воскресенья запрещает въезд на свою...,❗️Польша с 17 сентября запрещает въезд в стран...,1
25488,Три человека погибли из-за взрыва гранаты во д...,"В Москве скончался актер Вячеслав Гришечкин, и...",0
25489,Министерство по региональной безопасности Туль...,"В Орле, Крыму и Тульской области упали беспило...",1
25490,В результате обстрела Харькова ранены пять чел...,5 мирных жителей ранены в результате удара по ...,1


## Training time!
Real train loop took more than 3 hours. For simplicity of the seminar we will use only 10k pairs from our data.

### Configuration (`CFG`)

The `CFG` dictionary holds various hyperparameters and settings for our training process:

* **`NUM_WORKERS`**: Number of sub-processes to use for data loading. You can use: 4.
* **`BATCH_SIZE`**: The number of examples processed in one forward/backward pass during training. You can use: 64.
* **`EPOCHS`**: The total number of times the training algorithm will iterate over the entire training dataset. You can use: 5.
* **`SEED`**: A fixed number used to initialize random number generators (for sampling, shuffling, etc.) to ensure reproducibility. You can use: 42.
* **`LR`**: The initial learning rate for the optimizer, controlling the step size during model updates. You can use: 1e-5.
* **`SCHEDULER`**: The learning rate scheduler strategy ("CosineAnnealingWarmRestarts"). This dynamically adjusts the learning rate during training, potentially improving convergence.
* **`T_0`**: Parameter for the CosineAnnealingWarmRestarts scheduler. You can use: 3.
* **`min_lr`**: Minimum learning rate the scheduler can reduce to. You can use: 1e-6.

In [None]:
CFG = {
    "NUM_WORKERS": 4,
    "BATCH_SIZE": 64,
    "EPOCHS": 5,
    "SEED": 42,
    "LR": 1e-5,
    "SCHEDULER": "CosineAnnealingWarmRestarts",
    "T_0": 3,
    "min_lr": 1e-6
}

In [None]:
class InputExample:
    """
    Structure for one input example with texts, the label and a unique id
    """
    def __init__(self, guid: str = '', texts: List[str] = None,  label: Union[int, float] = 0):
        """
        Creates one InputExample with the given texts, guid and label


        :param guid
            id for the example
        :param texts
            the texts for the example.
        :param label
            the label for the example
        """
        self.guid = guid
        self.texts = texts
        self.label = label

    def __str__(self):
        return "<InputExample> label: {}, texts: {}".format(str(self.label), "; ".join(self.texts))

## Training the Model: The `run` Function

The `run` function is the core component that orchestrates the entire process of training. It takes a model and a data file as input and manages the data handling, setup, and execution of the training loop.

Here's a high-level overview of what the `run` function does:

1.  **Load and Prepare Data:**
    * It begins by loading the specified CSV data file containing pairs of news texts and their binary labels (0 for non-duplicate, 1 for duplicate).
    * It performs basic cleaning by removing any rows with missing values.
    * **Crucially for this seminar:** It then applies a **sampling step**, selecting a limited number (up to 10,000 positive and 10,000 negative) of examples from the loaded data. This is done to create a smaller, more manageable dataset for faster training during the session, allowing us to see results without waiting for hours on large files.

2.  **Split Data for Training and Validation:**

3.  **Format Data for the Model:**
    * The training data is converted into a list of `InputExample` objects, a format specifically required by the `sentence-transformers` library. Each `InputExample` bundles a text pair and its corresponding label.
    * A PyTorch `DataLoader` is created from these `InputExample`s. The DataLoader manages batching and shuffling the training data during the training epochs.

4.  **Set Up Training Components:**
    * The appropriate loss function, **Contrastive Loss**, is initialized. This function calculates how well the model's embeddings distinguish between positive and negative pairs, guiding the learning process.
    * A **Binary Classification Evaluator** is set up using the validation data. This tool will periodically calculate relevant metrics (like F1 score, accuracy) on the unseen validation data to monitor the model's progress and prevent overfitting.

5.  **Execute Training:**

## Mini-Theory: Contrastive Loss

**Contrastive Loss** is a widely used loss function in **Metric Learning**. Its primary goal is to train a model to learn an embedding function $f(x)$ that maps data points $x$ into a vector space such that the distance between embeddings reflects their similarity in the original data space.

The core idea is simple:

* **Similar** data points should have **small** distances between their embeddings.
* **Dissimilar** data points should have **large** distances between their embeddings.

Contrastive Loss achieves this by operating on **pairs** of data points:

1.  **Positive Pairs:** Pairs of data points that are considered similar or related (e.g., two views of the same object, two paraphrased sentences about the same event). The loss function encourages the distance between their embeddings to be minimal.
2.  **Negative Pairs:** Pairs of data points that are considered dissimilar or unrelated. For these pairs, the loss function encourages the distance between their embeddings to be large, specifically **greater than a certain margin ($m$)**.

The loss function for a single pair $(x_i, x_j)$ with a binary label $y$ (where $y=1$ for a positive pair and $y=0$ for a negative pair) is typically defined as:

$$L(x_i, x_j, y) = y \cdot D(f(x_i), f(x_j)) + (1-y) \cdot \max(0, m - D(f(x_i), f(x_j)))$$

Where:

* $f(x)$ is the embedding function (our neural network).
* $D(\cdot, \cdot)$ is a distance function (commonly squared Euclidean distance, $||f(x_i) - f(x_j)||_2^2$).
* $m$ is the **margin**, a crucial hyperparameter.

**In simpler terms:**

* If the pair is **Positive** ($y=1$), the first term $D(f(x_i), f(x_j))$ contributes to the loss, pushing the embeddings closer.
* If the pair is **Negative** ($y=0$), the second term $\max(0, m - D(f(x_i), f(x_j)))$ contributes. This term is zero if the distance $D$ is already greater than or equal to the margin $m$. It only penalizes the model if the negative pair's embeddings are **too close** (closer than $m$), pushing them further apart just enough to exceed the margin.

By minimizing the sum of the losses over many positive and negative pairs, the model learns to create a useful embedding space for distinguishing similar from dissimilar items.

In [None]:
def run(model, file):
    df = pd.read_csv(file).dropna()

    # --- SEMINAR: Sample 1000 positive and 1000 negative examples (you dont want to wait 3 hours) ---
    # Separate positive and negative samples
    df_positive = df[df['target'] == 1].reset_index(drop=True)
    df_negative = df[df['target'] == 0].reset_index(drop=True)

    num_pos_samples = min(15_000, len(df_positive))
    num_neg_samples = min(15_000, len(df_negative))

    print(f"Sampling {num_pos_samples} positive and {num_neg_samples} negative examples.")

    # Sample the desired number of examples from each category
    sampled_pos = df_positive.sample(n=num_pos_samples, random_state=42).reset_index(drop=True)
    sampled_neg = df_negative.sample(n=num_neg_samples, random_state=42).reset_index(drop=True)

    # Concatenate the sampled data and shuffle it
    df = pd.concat([sampled_pos, sampled_neg]).sample(frac=1, random_state=42).dropna().reset_index(drop=True)

    # --- END SEMINAR SAMPLING ---
    train, val = train_test_split(df, train_size=0.8, stratify=df['target'], shuffle=True)
    train = train.reset_index(drop=True)
    val = val.reset_index(drop=True)

    print(train.shape, val.shape)

    train_examples = []
    for i in range(train.shape[0]):
        example = train.iloc[i]
        train_examples.append(InputExample(texts=[example['post_1'], example['post_2']], label=example['target']))
    train_loader = torch.utils.data.DataLoader(train_examples, shuffle=True, batch_size=CFG["BATCH_SIZE"])

    evaluator = evaluation.BinaryClassificationEvaluator(sentences1=val['post_1'].values.tolist(),
                                                        sentences2=val['post_2'].values.tolist(),
                                                        labels=val['target'].values.tolist(),
                                                        batch_size=CFG['BATCH_SIZE'],
                                                        show_progress_bar=True,
                                                        name='eval_res',
                                                        write_csv=True)
    def cb(score, epoch, steps):
        print(score, epoch, steps)

    os.environ['WANDB_DISABLED'] = 'true'

    criterion = losses.ContrastiveLoss(model=model)
    model.fit(
        train_objectives=[(train_loader, criterion)],
        evaluator=evaluator,
        epochs=CFG["EPOCHS"],
        show_progress_bar=True,
        output_path='./model',
        save_best_model=True,
        callback=cb,
    )
    return model

model_name = 'distiluse-base-multilingual-cased'

model = SentenceTransformer(model_name, device='cuda:0')

CFG["EPOCHS"] = 2
run(model, './data/4traingpt.csv')

model = SentenceTransformer('./model', device='cuda:0')

CFG["EPOCHS"] = 5
run(model, './data/4trainnyan.csv')

Sampling 15000 positive and 15000 negative examples.
(24000, 3) (6000, 3)


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss,Validation Loss,Res Cosine Accuracy,Res Cosine Accuracy Threshold,Res Cosine F1,Res Cosine F1 Threshold,Res Cosine Precision,Res Cosine Recall,Res Cosine Ap,Res Cosine Mcc
375,No log,No log,0.993167,0.730812,0.993124,0.730812,0.999325,0.987000,0.998979,0.986408
500,0.005300,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log
750,0.005300,No log,0.997667,0.699933,0.997663,0.699933,0.999331,0.996000,0.999261,0.995339


Batches:   0%|          | 0/184 [00:00<?, ?it/s]

0.9989789243289239 1.0 375


Batches:   0%|          | 0/184 [00:00<?, ?it/s]

0.9992608018945678 2.0 750
Sampling 12692 positive and 12712 negative examples.
(20323, 3) (5081, 3)


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss,Validation Loss,Res Cosine Accuracy,Res Cosine Accuracy Threshold,Res Cosine F1,Res Cosine F1 Threshold,Res Cosine Precision,Res Cosine Recall,Res Cosine Ap,Res Cosine Mcc
318,No log,No log,0.959654,0.719660,0.960109,0.719660,0.948481,0.972025,0.960346,0.919592
500,0.008500,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log
636,0.008500,No log,0.967920,0.704858,0.968563,0.679572,0.948621,0.989362,0.964642,0.936705
954,0.008500,No log,0.974808,0.729531,0.975299,0.716301,0.955749,0.995666,0.969322,0.950446
1000,0.004600,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log
1272,0.004600,No log,0.978154,0.724502,0.978551,0.724502,0.960182,0.997636,0.973844,0.957036
1500,0.003200,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log,No Log
1590,0.003200,No log,0.979728,0.724168,0.980081,0.724168,0.962400,0.998424,0.975348,0.960130


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

0.9603464074085777 1.0 318


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

0.9646422976726536 2.0 636


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

0.9693222340512365 3.0 954


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

0.9738441518472337 4.0 1272


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

0.9753479960563298 5.0 1590


SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

# Calculate Metrics

We want to evaluate the trained Sentence Transformer model's performance for news deduplication and find the optimal threshold for making predictions based on embedding similarity.

The key steps are:

* **Embedding Generation:** It first uses the trained `model` to generate vector embeddings for all unique texts in the dataset.
* **Similarity Scoring:** For every pair of articles, it calculates the **cosine distance** between the embeddings of their respective texts. These scores represent how far apart the pairs are in the embedding space.
* **Threshold Optimization:** It iterates through a range of possible threshold values (from 0.01 to 0.99). For each threshold, it simulates making predictions (based on whether the cosine distance is above or below the threshold) and calculates the F1 score by comparing these predictions to the true labels. It keeps track of the threshold that results in the highest F1 score.
* **Final Evaluation:** Using the best threshold found, it applies this cutoff to the cosine distance scores across the entire dataset to make the final binary predictions (duplicate/non-duplicate).

In [None]:
df = pd.read_csv('./data/4trainnyan.csv')
df_td = pd.DataFrame(df['post_1'].unique())
df_td['emb'] = model.encode(df_td[0].tolist()).tolist()
df = pd.merge(pd.merge(df, df_td.rename({0:'post_1', 'emb':'emb_1'}, axis=1), on='post_1', how='left'), df_td.rename({0:'post_2', 'emb':'emb_2'}, axis=1), on='post_2', how='left')
df['emb_1'] = df['emb_1'].apply(lambda x: normalize([x]).ravel())
df['emb_2'] = df['emb_2'].apply(lambda x: normalize([x]).ravel())
df['cos_sim'] = df.apply(lambda x: pairwise_distances([x['emb_1']], [x['emb_2']], metric='cosine')[0][0], axis=1)
df = df.dropna()
df_td = df.copy()
max_f1 = 0
max_th = 0

for th_int in tqdm(range(1, 100)):
    th = th_int/100
    df_td['ans'] = df_td['cos_sim'].apply(lambda x: int(x>=th))
    f1 = f1_score(df_td['target'], df_td['ans'])
    if f1 > max_f1:
        max_f1 = f1
        max_th = th

df_td['ans'] = df_td['cos_sim'].apply(lambda x: int(x>=max_th))
print(f"F1: {f1_score(df_td['target'], df_td['ans'])}\nAccuracy: {accuracy_score(df_td['target'], df_td['ans'])}\nPrecision: {precision_score(df_td['target'], df_td['ans'])}\nRecall: {recall_score(df_td['target'], df_td['ans'])}")
