# LeonardoAI Challenge  

Measuring the similarity scores of text-image pairs.  

Done by: Trung Nguyen.  

This work relies on OpenAI's CLIP model.  
There are different variants of CLIP with varying model sizes and model accuracies. The rule of thumb is that the bigger size, CLIP provides better indications on the similarities for text-image pairs.  
The range yielded by this model is from 0 to 100. The higher score the more similar.  
It may be easier to see how similar a text and an image are by giving several candiates (images or text sequences) and compare (with softmax).  

### Load libraries, set global variables and load csv data  

In [6]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
import pandas as pd
import torch

BASE_PATCH32       =  "openai/clip-vit-base-patch32"
BASE_PATCH16       =  "openai/clip-vit-base-patch16"
LARGE_PATCH14      =  "openai/clip-vit-large-patch14"
LARGE_PATCH14_336  =  "openai/clip-vit-large-patch14-336"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
data_dir = "data"
model_name = LARGE_PATCH14

data = pd.read_csv(data_dir + "/challenge_set.csv")
data[:5]


Unnamed: 0,url,caption
0,https://cdn.leonardo.ai/users/85498bb1-9ae7-4b...,2 friendly real estate agent standing. one wit...
1,https://cdn.leonardo.ai/users/b5a9a19e-f630-4e...,"vector pattern, pastel colors, in style kawai ..."
2,https://cdn.leonardo.ai/users/925ced00-c573-43...,a young beautiful girl run away kitchen. got s...
3,https://cdn.leonardo.ai/users/61b0d7a9-8b0d-46...,Criança menino de 1 ano cabelo cacheado com as...
4,https://cdn.leonardo.ai/users/566cd98a-7e64-47...,A little girl wearing a red dress smiled. Play...


#### Add a new column to the Dataframe for the similarity score

In [7]:
data['score'] = [0.0] * data.shape[0]
data[:5]

Unnamed: 0,url,caption,score
0,https://cdn.leonardo.ai/users/85498bb1-9ae7-4b...,2 friendly real estate agent standing. one wit...,0.0
1,https://cdn.leonardo.ai/users/b5a9a19e-f630-4e...,"vector pattern, pastel colors, in style kawai ...",0.0
2,https://cdn.leonardo.ai/users/925ced00-c573-43...,a young beautiful girl run away kitchen. got s...,0.0
3,https://cdn.leonardo.ai/users/61b0d7a9-8b0d-46...,Criança menino de 1 ano cabelo cacheado com as...,0.0
4,https://cdn.leonardo.ai/users/566cd98a-7e64-47...,A little girl wearing a red dress smiled. Play...,0.0


#### Find all image files

In [8]:
from os import listdir

img_files = [f for f in listdir(data_dir) if f[-3:]=='png']
print("There are " + str(len(img_files)) + " images")


There are 51 images


#### Algorithm statistics file

In [9]:
import datetime
now = datetime.datetime.now()

def init_stats_file():
    stats_file = "stats.log"
    stats = open(stats_file, "a")

    stats.write( str(now) + ", " )
    return stats

stats = init_stats_file()


### Calculate similarity scores here and save to "results.csv", running stats are saved to "stats.log"  

Stats "Time" sums both CPU and GPU time, which differs from the waiting time. The waiting time is of the scale of several seconds (e.g. 7.7s as shown below).  

In [10]:
from PIL import Image
#import requests

from transformers import CLIPProcessor, CLIPModel
from os.path import join

if stats.closed:
    stats = init_stats_file()

model = CLIPModel.from_pretrained(model_name).to(device)
stats.write(f"Loaded model: {torch.cuda.max_memory_allocated(device=None) / (1024**3):.1f} GPU memory, ")
processor = CLIPProcessor.from_pretrained(model_name)

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()

for imfile in img_files:
    image = Image.open(join(data_dir, imfile))
    caption = data.loc[ data['url'].str.contains(imfile[:-4]), 'caption'].values[0]
    
    inputs = processor(text=caption, images=image, return_tensors="pt", padding="longest", truncation=True, max_length=77)

    outputs = model(**inputs.to(device))
    logits_per_image = outputs.logits_per_image # this is the image-text similarity score
    data.loc[ data['url'].str.contains(imfile[:-4]), 'score'] = logits_per_image[0,0].item()

end.record()
stats.write(f"Running: {torch.cuda.max_memory_allocated(device=None) / (1024**3):.1f} GPU memory, ")

data.to_csv("results.csv", index=False)

# Waits for everything to finish running
torch.cuda.synchronize()
stats.write("Time: " + str(start.elapsed_time(end)) + " seconds\n")
stats.close()

`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.
