#**Dataset Drift Detection**

#Drift Detection


**Steps carried out:**

Steps Carried Out:

**1.Data Preparation:**

- Loaded the Food-101 dataset and sample 10 classes out of 100 class.
- Utilized pre-trained ResNet model to extract image embeddings.

**2.Introducing Artificial Shift:**

- Introduced an artificial shift to the embeddings by adding Gaussian noise at varying levels, gradually increasing it. This created a transformed set for comparison with the original image embeddings.

**3.Drift Detection Methods:**

- Selected five embedding drift detection methods.
- Evaluated how each method's "drift score" responds to the level of artificial drift introduced.


**Drift Detection Methods**

1.Euclidean distance (takes values from 0 to infinity).

2.Cosine distance (takes values from 0 to 2).

3.Classifier model (ROC AUC can take values from 0 to 1).

4.Share of drifted embedding components (takes values from 0 to 1).

5.Maximum Mean Descripancy (MMD) (can take values above zero).


In [None]:
import os
from torchvision import models
from torch import nn
import torch
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.types import Device
import torchvision.transforms as transforms
from torchvision.datasets import ImageFolder
from torch.utils.data.dataloader import DataLoader
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

#Get Data



In [None]:
!git clone https://github.com/AbigailUchennaNkama/model-drift-simulation
!mv model-drift-simulation/drift_modules .
!mv model-drift-simulation/model/food_model.pth .
!rm -rf model-drift-simulation

Cloning into 'model-drift-simulation'...
remote: Enumerating objects: 415, done.[K
remote: Counting objects: 100% (248/248), done.[K
remote: Compressing objects: 100% (157/157), done.[K
remote: Total 415 (delta 155), reused 156 (delta 89), pack-reused 167[K
Receiving objects: 100% (415/415), 253.46 MiB | 21.01 MiB/s, done.
Resolving deltas: 100% (250/250), done.
Updating files: 100% (42/42), done.


In [None]:
!python /content/drift_modules/get_food15_data.py

#Get accuracies for all levels of noise added

In [None]:
import torch
from torchvision import models, transforms
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder

image_path = './data/food15c100_percent/correct_test'
loss_fn = torch.nn.CrossEntropyLoss()

def accuracy_fn(y_true, y_pred):
    correct = torch.eq(y_true, y_pred).sum().item()
    acc = (correct / len(y_pred))
    return acc

# Define noise levels
noise_levels = [0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06]


# Transformation pipeline with noise
def add_noise(image, noise_level=0.1):
    # Convert the image to a tensor
    image_tensor = transforms.ToTensor()(image)
    # Generate noise with the same shape as the image tensor
    noise = torch.randn_like(image_tensor) * noise_level
    # Add noise to the image tensor
    noisy_image_tensor = image_tensor + noise
    # Clip the values to be within the valid range [0, 1]
    noisy_image_tensor = torch.clamp(noisy_image_tensor, 0, 1)
    # Convert the tensor back to PIL Image
    noisy_image = transforms.ToPILImage()(noisy_image_tensor)
    return noisy_image

# Create a separate DataLoader for each noise level
data_loaders = []
class_names_list = []
for noise_level in noise_levels:
    noisy_transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.Lambda(lambda x: add_noise(x, noise_level)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    model_data_dir = ImageFolder(image_path, transform=noisy_transform)
    data_loader = DataLoader(model_data_dir, batch_size=64, shuffle=False)
    data_loaders.append(data_loader)
    class_names_list.append(model_data_dir.classes)

#load pretrained model
from drift_modules.load_model import load_custom_pretrained_model
loaded_food_model_c15 = load_custom_pretrained_model(model_path='./food_model.pth', num_classes=15)
class_names = model_data_dir.classes

loaded_food_model_c15.eval()

# Function to evaluate accuracy and loss for each DataLoader
def evaluate_model(loader, model):
    labels_list = []
    model.to(device)
    loss, acc,  = 0, 0

    model.eval()

    with torch.no_grad():
        for batch in loader:
            images, batch_labels = batch
            images, batch_labels = images.to(device), batch_labels.to(device)

            pred_logit = model(images)

            # Get prediction probability (logit -> prediction probability)
            pred_prob = pred_logit.argmax(dim=1)
            labels_list.append(batch_labels)

            loss += loss_fn(pred_logit, batch_labels)
            acc += accuracy_fn(y_true=batch_labels, y_pred=pred_prob)


    loss /= len(loader)
    acc /= len(loader)
    return acc, loss

# List to store accuracies
accuracies = []
losses = []

# Evaluate model for each noise level
for i, loader in enumerate(data_loaders):
    class_names = class_names_list[i]
    noise_level = noise_levels[i]
    print(f"\nNoise Level: {noise_level}")

    accuracy, average_loss = evaluate_model(loader, loaded_food_model_c15)
    accuracies.append(accuracy)  # Append accuracy to the list
    losses.append(average_loss)

    print(f"Accuracy: {accuracy:.2f}%")
    print(f"Average Loss: {average_loss:.4f}")

# Now, the accuracies list contains the accuracy values for each noise level
print("Accuracies for each noise level:", accuracies)
print("loss for each noise level:", losses)



Noise Level: 0.0
Accuracy: 1.00%
Average Loss: 0.0992

Noise Level: 0.01
Accuracy: 0.99%
Average Loss: 0.1086

Noise Level: 0.02
Accuracy: 0.96%
Average Loss: 0.1559

Noise Level: 0.03
Accuracy: 0.93%
Average Loss: 0.2394

Noise Level: 0.04
Accuracy: 0.89%
Average Loss: 0.3533

Noise Level: 0.05
Accuracy: 0.84%
Average Loss: 0.5002

Noise Level: 0.06
Accuracy: 0.78%
Average Loss: 0.6779
Accuracies for each noise level: [1.0, 0.9874739469065379, 0.9616813843791137, 0.9316174857393593, 0.8863742321193506, 0.8370104760860027, 0.7836564831066257]
Accuracies for each noise level: [tensor(0.0992, device='cuda:0'), tensor(0.1086, device='cuda:0'), tensor(0.1559, device='cuda:0'), tensor(0.2394, device='cuda:0'), tensor(0.3533, device='cuda:0'), tensor(0.5002, device='cuda:0'), tensor(0.6779, device='cuda:0')]


#Get embeddings

In [None]:
from drift_modules.get_embeddings import get_embeddings
from torchvision import models
model = loaded_food_model_c15
embeddings, labels = get_embeddings(model, './data/food15c100_percent/correct_test')

In [None]:
#Load embeddings and labels
from joblib import load
import torch
loaded_labels = load('/content/labels.joblib')
loaded_embeddings = load('/content/embeddings.joblib')

In [None]:
print("Loaded Embeddings Shape:", loaded_embeddings.shape)
print("Loaded Labels Shape:", loaded_labels.shape)


Loaded Embeddings Shape: torch.Size([3371, 512])
Loaded Labels Shape: torch.Size([3371])


#**Visualize embeddings using t-sne**

In [None]:
import torch
from sklearn.manifold import TSNE
import plotly.express as px
import pandas as pd


# Apply T-SNE to reduce the dimensionality of embeddings to 2D
tsne = TSNE(n_components=2, random_state=42)
tsne_embeddings = tsne.fit_transform(embeddings)

# Get class names based on class indices
class_names = class_names

# Map class indices to class names
class_labels = [class_names[label] for label in labels]

# Create a DataFrame for Plotly
df = pd.DataFrame({
    'x': tsne_embeddings[:, 0],
    'y': tsne_embeddings[:, 1],
    'label': class_labels
})

# Create an interactive scatter plot with class names as legends using Plotly
fig = px.scatter(df, x='x', y='y', color='label', labels={'label': 'Class'},
                 title='T-SNE Embeddings with Class Names as Legends')
fig.show()

#**Code for experiment**

In [None]:
import numpy as np
from scipy.spatial.distance import cosine, euclidean
from scipy.stats import ttest_ind
import plotly.graph_objects as go

# Define a function to add Gaussian noise to a PyTorch tensor
def add_noise(embeddings, noise_std=0.1):
    noise = torch.randn_like(embeddings) * noise_std
    noisy_embeddings = embeddings + noise
    return noisy_embeddings

# Function to calculate drift scores for all methods and return mean scores for each noise level
def calculate_drift_scores(original_embeddings, loaded_labels, noise_levels):
    drift_scores = {
        "cosine_distance": [],
        "euclidean_distance": [],
        "share_of_drifted_components": []
    }

    for noise_level in noise_levels:
        cosine_distances = []
        euclidean_distances = []
        share_drifted_components_list = []

        for i in range(len(original_embeddings)):
            noisy_embedding = add_noise(original_embeddings[i], noise_std=noise_level)

            cosine_distance = cosine(noisy_embedding.numpy().flatten(), original_embeddings[i].numpy().flatten())
            euclidean_distance = euclidean(noisy_embedding.numpy().flatten(), original_embeddings[i].numpy().flatten())

            _, p_value = ttest_ind(noisy_embedding.numpy().flatten(), original_embeddings[i].numpy().flatten())
            share_drifted_components = p_value

            cosine_distances.append(cosine_distance)
            euclidean_distances.append(euclidean_distance)
            share_drifted_components_list.append(share_drifted_components)

        mean_cosine_distance = np.mean(cosine_distances)
        mean_euclidean_distance = np.mean(euclidean_distances)
        mean_share_drifted_components = np.mean(share_drifted_components_list)

        drift_scores["cosine_distance"].append(mean_cosine_distance)
        drift_scores["euclidean_distance"].append(mean_euclidean_distance)
        drift_scores["share_of_drifted_components"].append(mean_share_drifted_components)

    return drift_scores




accuracies = accuracies

def plot_drift_scores_plotly(drift_scores, noise_levels, detector_name):
    scores = drift_scores[detector_name]
    accuracies_text = '<br>'.join([f'Noise Level {noise_levels[i]}: Accuracy {accuracies[i]}' for i in range(len(noise_levels))])

    # Create a list of annotations
    annotations = [
        dict(
            x=noise_levels[i],
            y=scores[i],
            xref="x",
            yref="y",
            text=f'Accuracy: {round(accuracies[i],2)}',
            showarrow=True,
            arrowhead=2,
            ax=0,
            ay=-40
        ) for i in range(len(noise_levels))
    ]

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=noise_levels, y=scores, mode='markers+lines', name=detector_name))
    fig.update_layout(
        title=f'{detector_name} Drift Scores vs. Noise Level',
        xaxis_title='Noise Level',
        yaxis_title=f'Mean {detector_name} Score',
        annotations=annotations
    )
    fig.show()

**Code for ROC AUC**

Train a classifier to descriminate between original and transformed (noisy) embeddings

In [None]:
import torch
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
import plotly.graph_objects as go

def add_noise(embeddings, noise_std=0.1):
    noise = torch.randn_like(embeddings) * noise_std
    noisy_embeddings = embeddings + noise
    return noisy_embeddings

def calculate_roc_auc_drift_scores(original_embeddings, noise_levels):
    roc_auc_scores = []
    for noise_level in noise_levels:
        noisy_embeddings = add_noise(original_embeddings, noise_std=noise_level)

        # Split data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(
            torch.cat([noisy_embeddings, original_embeddings]),
            torch.cat([torch.ones_like(noisy_embeddings[:, :1]), torch.zeros_like(original_embeddings[:, :1])]),
            test_size=0.1, random_state=42
        )

        # Train RandomForestClassifier
        clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
        clf.fit(X_train.cpu().numpy(), y_train.cpu().numpy().ravel())

        # Calculate ROC AUC score for the test data
        roc_auc_score_class = roc_auc_score(y_test.cpu().numpy().ravel(), clf.predict(X_test.cpu().numpy()))
        roc_auc_scores.append(roc_auc_score_class)
    return roc_auc_scores



def plot_scores_with_accuracies(accuracies, noise_levels, roc_auc_scores):

    fig = go.Figure()

    # Add individual embeddings' drift scores
    fig.add_trace(go.Scatter(x=noise_levels, y=roc_auc_scores, mode='markers+lines', name='ROC AUC Scores'))

    # Add accuracies to the plot
    fig.add_trace(go.Scatter(x=noise_levels, y=accuracies, mode='markers+lines', name='Accuracy'))

    fig.update_layout(
        title=f'roc_auc Drift Scores and Accuracies vs. Noise Level',
        xaxis_title='Noise Level',
        yaxis_title=f'Mean roc_auc Score / Accuracy',
        legend_title='Embedding',
        showlegend=True
    )

    fig.show()




#Visualize drift score per noise level added

**ROC AUC**

Classifier Model (ROC AUC): A classifier (Random Forest Classifier in this case) is trained to distinguish between noisy and original embeddings. ROC AUC (Receiver Operating Characteristic - Area Under the Curve) is used as a metric to evaluate the classifier's performance. ROC AUC score ranges from 0 to 1, where a score closer to 1 indicates the model's ability to accurately discriminate between the two classes, hence indicating drift. In this case, an ROC AUC score less than or equal to 0.5 with 0.0 noise added implies that the noisy data and original data are indistinguishable from each other. When no noise is added, the two datasets are identical, and there should be no differences to detect. Therefore, the classifier, in this case, fails to differentiate between the original and noisy data because there are no actual differences to identify.
ROC AUC score above 0.5 indicates different levels of the model's ability to differentiate between the embeddings with score of 1 as absolute certainrty of difference (drift) between embeddings.

In [None]:
#get roc_auc scores
noise_levels = [0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06]
roc_auc_scores = calculate_roc_auc_drift_scores(loaded_embeddings, noise_levels)

In [None]:
plot_scores_with_accuracies(accuracies, noise_levels, roc_auc_scores)

**Cosine Distance**

Cosine distance measures the cosine of the angle between two non-zero vectors. It quantifies the similarity in direction between two vectors, irrespective of their magnitudes. In this case, it calculates the cosine distance between the mean of noisy embeddings and the mean of original embeddings. Cosine distance ranges from 0 (similar) to 2 (dissimilar).

In [None]:
noise_levels = [0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06]
drift_scores = calculate_drift_scores(loaded_embeddings, loaded_labels, noise_levels)
plot_drift_scores_plotly(drift_scores, noise_levels, 'cosine_distance')

**Euclidean Distance**

Euclidean distance measures the straight-line distance between two points in space. In this context, it calculates the distance between the mean of noisy embeddings and the mean of original embeddings.Distance score ranges from 0 to infinity. A higher Euclidean distance indicates more significant drift between the embeddings.



In [None]:
#Plot euclidean distance
plot_drift_scores_plotly(drift_scores, noise_levels, 'euclidean_distance')

**Share of Drifted Embedding Components**
Share of Drifted Embedding Components (T-test): T-test is a statistical test used to determine if there is a significant difference between the means of two groups. In this context, it assesses whether the mean values of the components in noisy and original embeddings significantly differ. The p-value from the T-test is used, and if it's below a certain threshold (0.05 in this case), it indicates that some components have drifted.

In [None]:
plot_drift_scores_plotly(drift_scores, noise_levels, 'share_of_drifted_components')