## **Abstraction**

This file show the feasibility and implementation of detecting data drift for text, using embedding models to vectorize the text then calculate the cosine similarity by a single dot product.


In [29]:
from transformers import AutoTokenizer, AutoModel
import torch
from huggingface_hub import snapshot_download
import pandas as pd

In [13]:
repo_id = "protostarss/distilbert_imdb_full"
# Download the model
model_path = snapshot_download(repo_id=repo_id)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

def get_bert_embedding(text: str) -> torch.Tensor:
    """
    Generate embeddings for the input text using DistilBERT base model without classification head.

    Args:
        text (str): Input text to generate embeddings for.

    Returns:
        torch.Tensor: Embedding vector for the input text.
    """
    # Tokenize and prepare input
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    # Generate embeddings using base model without classification head
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    
    # Use the [CLS] token embedding (first token) from the last hidden state
    embeddings = outputs.last_hidden_state[:, 0, :]
    
    return embeddings



Fetching 8 files: 100%|██████████| 8/8 [00:00<?, ?it/s]


In [11]:
# Load the IMDB dataset from Kaggle
df_imbd = pd.read_csv("IMDB Dataset.csv")
df_tmbd = pd.read_csv("TMDB Dataset.csv")

df_imbd.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [26]:
# Randomly sample 100 rows from df_imbd and df_tmbd, df_imbd do 3 times
# Sample 50 rows from each dataset
imbd_sample_1 = df_imbd.sample(n=50, random_state=66)
imbd_sample_2 = df_imbd.sample(n=50, random_state=99)


# Print shapes to verify
print(f"IMDB sample 1 shape: {imbd_sample_1.shape}")
print(f"IMDB sample 2 shape: {imbd_sample_2.shape}")
print(f"TMDB sample shape: {df_tmbd.shape}")

IMDB sample 1 shape: (50, 2)
IMDB sample 2 shape: (50, 2)
TMDB sample shape: (99, 2)


In [27]:
# Calculate the embedding of these three datasets, and then calculate the cosine similarity between the embeddings of the three datasets
def get_bert_embedding(text: str, model, tokenizer) -> torch.Tensor:
    """
    Generate BERT embeddings for a given text.

    Args:
        text (str): Input text to generate embeddings for.
        model: The BERT model to use for generating embeddings.
        tokenizer: The tokenizer to use for preprocessing the text.

    Returns:
        torch.Tensor: Embedding vector for the input text.
    """
    # Tokenize and prepare input
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    # Generate embeddings using base model without classification head
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    
    # Use the [CLS] token embedding (first token) from the last hidden state
    embeddings = outputs.last_hidden_state[:, 0, :]
    
    return embeddings

def calculate_dataset_embeddings(df: pd.DataFrame, text_column: str, model, tokenizer) -> torch.Tensor:
    """
    Calculate mean embeddings for a dataset.

    Args:
        df (pd.DataFrame): Input DataFrame containing the text data.
        text_column (str): Name of the column containing text to embed.
        model: The BERT model to use for generating embeddings.
        tokenizer: The tokenizer to use for preprocessing the text.

    Returns:
        torch.Tensor: Mean embedding vector for the dataset.
    """
    all_embeddings = []
    
    # Process each text in the dataset
    for text in df[text_column]:
        embedding = get_bert_embedding(text, model, tokenizer)
        all_embeddings.append(embedding)
    
    # Stack all embeddings and calculate mean
    stacked_embeddings = torch.stack(all_embeddings)
    mean_embedding = torch.mean(stacked_embeddings, dim=0)
    
    return mean_embedding

# Calculate embeddings for each dataset
imbd_embeddings_1 = calculate_dataset_embeddings(imbd_sample_1, 'review', model, tokenizer)
imbd_embeddings_2 = calculate_dataset_embeddings(imbd_sample_2, 'review', model, tokenizer)
tmbd_embeddings = calculate_dataset_embeddings(df_tmbd, 'reviews', model, tokenizer)

# Calculate cosine similarity between embeddings
def cosine_similarity(a: torch.Tensor, b: torch.Tensor) -> float:
    """
    Calculate cosine similarity between two tensors.

    Args:
        a (torch.Tensor): First tensor.
        b (torch.Tensor): Second tensor.

    Returns:
        float: Cosine similarity score between 0 and 1.
    """
    # Ensure tensors are 2D with shape [1, dim]
    a = a.reshape(1, -1)
    b = b.reshape(1, -1)
    
    # Calculate cosine similarity
    similarity = torch.nn.functional.cosine_similarity(a, b, dim=1)
    return similarity.item()

# Calculate and print similarities
print("\nCosine Similarities:")
print(f"IMDB Sample 1 vs IMDB Sample 2: {cosine_similarity(imbd_embeddings_1, imbd_embeddings_2):.4f}")
print(f"IMDB Sample 1 vs TMDB: {cosine_similarity(imbd_embeddings_1, tmbd_embeddings):.4f}")
print(f"IMDB Sample 2 vs TMDB: {cosine_similarity(imbd_embeddings_2, tmbd_embeddings):.4f}")


Cosine Similarities:
IMDB Sample 1 vs IMDB Sample 2: 0.9630
IMDB Sample 1 vs TMDB: 0.8859
IMDB Sample 2 vs TMDB: 0.7540


**Observation**:  
We can see that samples from IMDB are quite similar to each other. However, the similarity score between TMDB and IMDB is low. This means the meaning and style of TMDB reviews are different from IMDB ones — this is called data drift. So, using **text embeddings** to detect **data drift** is a good and reasonable method in this case.