# Detect data drift with the k-core distribution

We compute the distribution of the cosine distance of the k-the neighbor for each datapoint. Then we determine a suitable threshold based on a pre-defined percentile in the distribution.

More information about this play can be found in the Spotlight documentation: [Detect data drift with k-core](https://renumics.com/docs/playbook/drift_kcore)

For more data-centric AI workflows, check out our [Awesome Open Data-centric AI](https://github.com/Renumics/awesome-open-data-centric-ai) list on Github.


## tldr

In [None]:
#@title Install required packages with PIP

!pip install renumics-spotlight datasets

In [None]:
#@title Play as copy-n-paste functions

from sklearn.neighbors import NearestNeighbors
import pandas as pd
import numpy as np
import datasets
from renumics import spotlight
import requests

def compute_k_core_distances(df, k=8, embedding_name='embedding'):    
    features = np.stack(df[embedding_name].to_numpy())
    neigh = NearestNeighbors(n_neighbors=k, metric='cosine')
    neigh.fit(features)    
    distances, indices = neigh.kneighbors()
    
    df_out=pd.DataFrame()
    df_out['k_core_distance']=distances[:,-1]
    df_out['k_core_index']=indices[:, -1]   
   
    return df_out

## Step-by-step example on CIFAR-100

### Load CIFAR-100 from Huggingface hub and convert it to Pandas dataframe

In [None]:
dataset = datasets.load_dataset("renumics/cifar100-enriched", split="all")

df = dataset.to_pandas()


### Compute k-nearest neighbor distances 

In [None]:
df_kcore = compute_k_core_distances(df)
df = pd.concat([df, df_kcore], axis=1)

### Inspect candidates for data drift with Spotlight

> ⚠️ Running Spotlight in Colab currently has severe limitations (slow, no similarity map, no layouts) due to Colab restrictions (e.g. no websocket support). Run the notebook locally for the full Spotlight experience

In [None]:
df_show = df.drop(columns=['embedding', 'probabilities'])

# handle google colab differently
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    #visualization in Google Colab only works in chrome and does not support websockets, we need some hacks to visualize something
    df_show=df_show[:10000]
    df_show['embx'] =  [emb[0] for emb in df_show['embedding_reduced'] ]
    df_show['emby'] =  [emb[1] for emb in df_show['embedding_reduced'] ]
    port=50123
    spotlight.show(df_show, port=port, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding})  
    from google.colab.output import eval_js  # type: ignore
    print(str(eval_js(f"google.colab.kernel.proxyPort({port}, {{'cache': true}})")))

else:
    layout_file="drift_kcore.json"
    spotlight.show(df_show, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding}, layout=layout_file)