# Create image embeddings with Towhee

We use the [towhee library](https://github.com/towhee-io/towhee) to create an embedding for a an image dataset. 

More information about this play can be found in the Spotlight documentation: [Create image embeddings with the towhee library](https://renumics.com/docs/playbook/towhee-embedding)

For more data-centric AI workflows, check out our [Awesome Open Data-centric AI](https://github.com/Renumics/awesome-open-data-centric-ai) list on Github.

## tldr

In [None]:
#@title Install required packages with PIP
!pip install renumics-spotlight towhee datasets

In [None]:
#@title Play as copy-n-paste functions

import datasets
from towhee import pipeline, DataCollection
from renumics import spotlight
import pandas as pd


def towhee_embedding(df, modelname='towhee/image-embedding-swin-base-patch4-window7-224', image_name='image'):
    dc = DataCollection(df[image_name])
    embedding_pipeline = pipeline(modelname)
    dc_embedding = dc.map(embedding_pipeline)
    
    
    df_emb = pd.DataFrame()
    df_emb['embedding']=dc_embedding.to_list()

    return df_emb


## Step-by-step example on CIFAR-100

### Load CIFAR-100 from Huggingface hub and convert it to Pandas dataframe

In [None]:
dataset = datasets.load_dataset("renumics/cifar100-enriched", split="train")
df = dataset.to_pandas()

### Compute embedding with vision transformer from towhee

In [None]:
df_emb=towhee_embedding(df, modelname='towhee/image-embedding-swin-base-patch4-window7-224')
df = pd.concat([df, df_emb], axis=1)

### Reduce embeddings for faster visualization

In [None]:
import umap
import numpy as np
embeddings = np.stack(df['embedding'].to_numpy())
reducer = umap.UMAP()
reduced_embedding = reducer.fit_transform(embeddings)
df['embedding_reduced'] = np.array(reduced_embedding).tolist()

### Perform EDA with Spotlight

> ⚠️ Running Spotlight in Colab currently has severe limitations (slow, no similarity map, no layouts) due to Colab restrictions (e.g. no websocket support). Run the notebook locally for the full Spotlight experience

In [None]:
df_show = df.drop(columns=['embedding', 'probabilities'])

# handle google colab differently
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    #visualize subset in Google Colab
    port=50123
    spotlight.show(df_show[:10000], port=port, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding})
  
    from google.colab.output import eval_js  # type: ignore
    print(str(eval_js(f"google.colab.kernel.proxyPort({port}, {{'cache': true}})")))

else:
    spotlight.show(df_show, dtype={"image": spotlight.Image, "embedding_reduced": spotlight.Embedding})