# Tutorial - Visualize Embedding Spaces with FiftyOne

#### Author: Antonio Rueda-Toicen
**antonio.rueda.toicen 'at' hpi 'dot' de**


[![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)](http://creativecommons.org/licenses/by/4.0/)

This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

## Overview

In this notebook we learn how to visualize an embedding space using FiftyOne after applying [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) to reduce its dimensionality. We also use `sklearn` and `sklearn-extra` to produce clusters using the [K-means](https://en.wikipedia.org/wiki/K-means_clustering) algorithm.

## Image data
The folder with the image data and can be found [here](https://drive.google.com/drive/folders/1oZOMfxEYcrYctZSdx3NHO8KTc0ETrUFI?usp=drive_link). You can add it to your own Drive by right clicking on the folder name -> Organize -> Add Shortcut to Drive. Select the "All locations" tab -> My Drive and then create a folder called `art_recommendation`. This will allow you to access the data without having to download it.  

In [1]:
# Install fiftyone
!pip install fiftyone==1.4.1 > /dev/null


In [2]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [3]:
from pathlib import Path
import os
artist_name = 'Hokusai'
path = Path(f'/gdrive/MyDrive/art_recommendation/{artist_name}')

In [4]:
# here 'paintings' and 'paintings_embeddings.pickle' should both appear
os.listdir(path)

['paintings', 'Hokusai artworks', 'paintings_embeddings.pickle']

In [5]:
# Path where the images are
images_dir = path / "paintings"
images_dir

PosixPath('/gdrive/MyDrive/art_recommendation/Hokusai/paintings')

In [6]:
# Number of images
len(os.listdir(images_dir))

832

## Create a fiftyone dataset

If fiftyone, datasets are a collection of images to which we can add metadata (labels, quality metrics, embeddings, etc.).

In [9]:
import fiftyone as fo


dataset_name = dataset_name = f"{artist_name}_paintings"

# delete the dataset in case it exists already on the Colab instance
# (due to multiple evaluations of the code cell)
if fo.dataset_exists(dataset_name):
  fo.delete_dataset(dataset_name)

In [10]:
# this creates an empty dataset
dataset = fo.Dataset(dataset_name)
dataset

Name:        Hokusai_paintings
Media type:  None
Num samples: 0
Persistent:  False
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField

In [11]:
# We use the location of the images_dir to add samples to the dataset
dataset.add_dir(images_dir, dataset_type=fo.types.ImageDirectory)

 100% |█████████████████| 832/832 [189.4ms elapsed, 0s remaining, 4.4K samples/s]     


INFO:eta.core.utils: 100% |█████████████████| 832/832 [189.4ms elapsed, 0s remaining, 4.4K samples/s]     


['681656ce231c064dde263d59',
 '681656ce231c064dde263d5a',
 '681656ce231c064dde263d5b',
 '681656ce231c064dde263d5c',
 '681656ce231c064dde263d5d',
 '681656ce231c064dde263d5e',
 '681656ce231c064dde263d5f',
 '681656ce231c064dde263d60',
 '681656ce231c064dde263d61',
 '681656ce231c064dde263d62',
 '681656ce231c064dde263d63',
 '681656ce231c064dde263d64',
 '681656ce231c064dde263d65',
 '681656ce231c064dde263d66',
 '681656ce231c064dde263d67',
 '681656ce231c064dde263d68',
 '681656ce231c064dde263d69',
 '681656ce231c064dde263d6a',
 '681656ce231c064dde263d6b',
 '681656ce231c064dde263d6c',
 '681656ce231c064dde263d6d',
 '681656ce231c064dde263d6e',
 '681656ce231c064dde263d6f',
 '681656ce231c064dde263d70',
 '681656ce231c064dde263d71',
 '681656ce231c064dde263d72',
 '681656ce231c064dde263d73',
 '681656ce231c064dde263d74',
 '681656ce231c064dde263d75',
 '681656ce231c064dde263d76',
 '681656ce231c064dde263d77',
 '681656ce231c064dde263d78',
 '681656ce231c064dde263d79',
 '681656ce231c064dde263d7a',
 '681656ce231c

## Connect embeddings to the dataset

In [12]:
import pickle
with open(path /'paintings_embeddings.pickle', 'rb') as f:
    embeddings = pickle.load(f)


assert len(embeddings) == len(os.listdir(images_dir))


  embeddings = pickle.load(f)


In [13]:
# Every sample is an image, here we connect a sample to its embedding
for sample in dataset:
  sample["embedding"] = embeddings[Path(sample.filepath)]
  sample.save()

dataset.save()

In [14]:
dataset.view()

Dataset:     Hokusai_paintings
Media type:  image
Num samples: 832
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField
    embedding:        fiftyone.core.fields.VectorField
View stages:
    ---

## Principal Component Analysis

In [15]:
import fiftyone.brain as fob

embeddings = dataset.values('embedding')

pca_brain_key = f"{dataset_name}_pca"

# Check if brain key exists
if dataset.has_brain_run(pca_brain_key):
    # Delete the brain key
    dataset.clear_brain_runs(pca_brain_key)
    dataset.save()

# Compute 2D representation
results = fob.compute_visualization(
    dataset,
    embeddings=embeddings,
    num_dims=2,
    method="pca",
    brain_key=pca_brain_key,
    verbose=True,
    seed=51,
)

Generating visualization...


INFO:fiftyone.brain.visualization:Generating visualization...


## Clustering

In [21]:
import numpy as np
from sklearn.cluster import KMeans
import fiftyone as fo


# Choose the number of clusters
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(embeddings)

# Assign each sample its cluster label
for sample, label in zip(dataset, labels):
    sample["k_means_cluster"] = int(label)
    sample.save()

print("Clusters computed and stored in each sample's 'cluster' field.")


Clusters computed and stored in each sample's 'cluster' field.


In [22]:
# Save the dataset to keep views in sync
dataset.save()

## Launch the FiftyOne App

In [23]:
# Launch the FiftyOne App, if you get an error try going into Runtime -> Run All
# If that fails, go to Runtime -> Disconnect and Delete Runtime
session = fo.launch_app(dataset, auto=False)

Session launched. Run `session.show()` to open the App in a cell output.


INFO:fiftyone.core.session.session:Session launched. Run `session.show()` to open the App in a cell output.



Welcome to

███████╗██╗███████╗████████╗██╗   ██╗ ██████╗ ███╗   ██╗███████╗
██╔════╝██║██╔════╝╚══██╔══╝╚██╗ ██╔╝██╔═══██╗████╗  ██║██╔════╝
█████╗  ██║█████╗     ██║    ╚████╔╝ ██║   ██║██╔██╗ ██║█████╗
██╔══╝  ██║██╔══╝     ██║     ╚██╔╝  ██║   ██║██║╚██╗██║██╔══╝
██║     ██║██║        ██║      ██║   ╚██████╔╝██║ ╚████║███████╗
╚═╝     ╚═╝╚═╝        ╚═╝      ╚═╝    ╚═════╝ ╚═╝  ╚═══╝╚══════╝ v1.4.1

If you're finding FiftyOne helpful, here's how you can get involved:

|
|  ⭐⭐⭐ Give the project a star on GitHub ⭐⭐⭐
|  https://github.com/voxel51/fiftyone
|
|  🚀🚀🚀 Join the FiftyOne Discord community 🚀🚀🚀
|  https://community.voxel51.com/
|



INFO:fiftyone.core.session.session:
Welcome to

███████╗██╗███████╗████████╗██╗   ██╗ ██████╗ ███╗   ██╗███████╗
██╔════╝██║██╔════╝╚══██╔══╝╚██╗ ██╔╝██╔═══██╗████╗  ██║██╔════╝
█████╗  ██║█████╗     ██║    ╚████╔╝ ██║   ██║██╔██╗ ██║█████╗
██╔══╝  ██║██╔══╝     ██║     ╚██╔╝  ██║   ██║██║╚██╗██║██╔══╝
██║     ██║██║        ██║      ██║   ╚██████╔╝██║ ╚████║███████╗
╚═╝     ╚═╝╚═╝        ╚═╝      ╚═╝    ╚═════╝ ╚═╝  ╚═══╝╚══════╝ v1.4.1

If you're finding FiftyOne helpful, here's how you can get involved:

|
|  ⭐⭐⭐ Give the project a star on GitHub ⭐⭐⭐
|  https://github.com/voxel51/fiftyone
|
|  🚀🚀🚀 Join the FiftyOne Discord community 🚀🚀🚀
|  https://community.voxel51.com/
|



In [24]:
# Copy and paste the session url in a separate browser window
session.url

'https://5151-m-s-3jd2r6yrvwkgw-c.us-east1-1.prod.colab.dev?polling=true'