#### Generating embeddings for multi-modal embeddings

For our application it is necessary to store each item of data as a vector embedding of both its text and image, allowing us to look up either using a query or uploaded image

To perform this task, we need to use a model that encodes both of these things

It was mentioned that the CLIP model from OpenAI might perform well

*Radford, Alec, et al. "Learning transferable visual models from natural language supervision."*
*International conference on machine learning. PMLR, 2021.*

![Clip Model](./CLIP.png)

The way CLIP works is that it performs embeddings each on an image and the text. The interesting caveat is that for a pretrained CLIP model, both image and text should provide similar embeddings. This is because it's maximizing the cosine simularity between the correct text and image embeddings and minimizing the simularity against other images.

So which do we use?

For experimentation we have a few different ways we can encode our data:

1. Encode Image Only
2. Encode Description Only
3. Average the two embeddings
4. Encode the species and append it to the image encodings

The last one is a topic I may want to pursue further for reseach purposes. The intuition is that there should already be embeddings for species, but I have my doubts that the species is accurate. Perhaps these models don't distinguish between "puma" and "mountain lion". In theory, they should have similar embedding spaces since they are close in thier hierarchy. It could be worth fine-tuning these embeddings to account for a space that minimizes the distance between instances that are close in the hierarchy of thier species in addition to other losses. For now, we should evaluate how well these embeddings cluster - do multiple instances of the same species cluster together in the embedding space?



In [1]:
## example taken from https://github.com/mlfoundations/open_clip

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
model.eval()  # model in train mode by default, impacts some models with BatchNorm or stochastic depth active
tokenizer = open_clip.get_tokenizer('ViT-B-32')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
image = preprocess(Image.open("./CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])
text_features = model.encode_text(text)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)



In [3]:
print("picked class:")
print(text_probs)

print("Image Feature Length",len(image_features[0]))
print("Text Feature Length",len(text_features[0]))

print("Image embedding sample",image_features[0][0:10])
print("Text embedding diagram sample",text_features[0][0:10])
print("Text embedding dog sample",text_features[1][0:10])
print("Text embedding cat sample",text_features[2][0:10])

picked class:
tensor([[9.9950e-01, 4.1210e-04, 8.5326e-05]])
Image Feature Length 512
Text Feature Length 512
Image embedding sample tensor([-0.0186, -0.0976,  0.0265, -0.0389, -0.0357,  0.0116,  0.0391,  0.0195,
         0.0531,  0.0391])
Text embedding diagram sample tensor([-0.0033, -0.0355, -0.0484, -0.0160, -0.0174, -0.0327,  0.0169,  0.0123,
         0.0205, -0.0194])
Text embedding dog sample tensor([-0.0298, -0.0092, -0.0168, -0.0052, -0.0099,  0.0167, -0.0019, -0.0047,
        -0.0017,  0.0064])
Text embedding cat sample tensor([-0.0036, -0.0038,  0.0033, -0.0127, -0.0024,  0.0150,  0.0214, -0.0203,
        -0.0182,  0.0171])


While this example is decent at predicting the class that the text belongs to, we should be able to get away with running this on batches of images. For now, let's save time and make 3 encodings:
1. Species Name
2. Description
3. Text

In [4]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import data_prep
import json

In [22]:
with open("./data/sample_occurences.json", "rb") as f:
    jdata = f.read()
occurences = json.loads(jdata)
occurences

df = data_prep.parse_records(occurences)

[{'id': 4510345615, 'speciesKey': 2435099, 'scienceName': 'Puma concolor (Linnaeus, 1771)', 'simpleName': 'Puma', 'dataset': 'iNaturalist research-grade observations', 'media': '', 'media_location': 'https://inaturalist-open-data.s3.amazonaws.com/photos/344229334/original.jpeg'}, {'id': 4510345615, 'speciesKey': 2435099, 'scienceName': 'Puma concolor (Linnaeus, 1771)', 'simpleName': 'Puma', 'dataset': 'iNaturalist research-grade observations', 'media': '', 'media_location': 'https://inaturalist-open-data.s3.amazonaws.com/photos/344229310/original.jpeg'}, {'id': 4510345615, 'speciesKey': 2435099, 'scienceName': 'Puma concolor (Linnaeus, 1771)', 'simpleName': 'Puma', 'dataset': 'iNaturalist research-grade observations', 'media': '', 'media_location': 'https://inaturalist-open-data.s3.amazonaws.com/photos/344229315/original.jpeg'}, {'id': 4510345615, 'speciesKey': 2435099, 'scienceName': 'Puma concolor (Linnaeus, 1771)', 'simpleName': 'Puma', 'dataset': 'iNaturalist research-grade observa

Lets write the code to go through each image 1 or 2 at a time. This will simulate doing batches of images

In [23]:
df.iloc[0]

id                                                       4510345615
speciesKey                                                  2435099
scienceName                          Puma concolor (Linnaeus, 1771)
simpleName                                                     Puma
dataset                     iNaturalist research-grade observations
media                                                              
media_location    https://inaturalist-open-data.s3.amazonaws.com...
Name: 0, dtype: object

In [10]:
data_prep.echo()

Echo
