# image similarity with youtube thumbnails

## theory

the `imagenet` competition was a task to classify a large series of images into a number of fine-grained categories. the intermediate layers of the competing networks were found to be informative vectors for a wide variety of image recognition tasks, roughly analogous to how word vectors formed from intermediate network layers capture latent features that represent word meaning. these "feature vectors" are sort of like coordinates that locate the images in an "image space", where similar images are closer together.

we can leverage this to find similar images by embedding a set of images and using cosine similarity to compare a source to a target image. in layman's terms, cosine similarity tells us if the vectors are 'pointing in the same direction' which indicates that they are in similar points in the possible image embedding space (...roughly).

we will use the YouTube API to get some video thumbnails of related videos for testing.

### notes

the 'most similar image' function is not reversible; image A may find image B as closest, but image B may say image C is closest. this is because C may be closer to B than A, but in an 'opposite' direction:

`image_D <---|---|---|---|---> image_A <---|---|---> image_B <---|---> image_C <---|---|---> image_F`

In [1]:
import io
import json
import keras
import numpy as np
import os
import pickle
import re
import requests
import scipy
import shutil
import tempfile
import time
import youtube

from IPython.display import Image, display
from IPython.core.display import HTML
from keras.preprocessing import image

Using TensorFlow backend.


### image vectorizer class
`ImageVectorizer` class for obtaining image vectors from pretrained imagenet networks.  
the vectors are the intermediate layer output of the network, before the dense classification layers, representing the latent image features.  
by default, this uses the VGG16 network; pass the `model` and `preprocess_input` from the desired model to try other models.

see https://keras.io/applications for details

### references  
https://github.com/dcdulin/image-classifier/blob/master/imagenetClassifier.py (primary source)  
https://stackoverflow.com/questions/37751877/downloading-image-with-pil-and-requests (for better image loading)  
https://keras.io/applications/#vgg16  

In [2]:
class ImageVectorizer:
    """vectorizes image using pretrained Keras model"""
    
    def __init__(self, model=None, preprocess=None):
        """by default, use VGG16 model"""
        if model is not None:
            self.model=model
        else:
            self.model = keras.applications.vgg16.VGG16(include_top=False, 
                                                        weights='imagenet', 
                                                        input_tensor=None, 
                                                        input_shape=None, 
                                                        pooling='max')
            
        if preprocess is not None:
            self.preprocess=preprocess_fn
        else:
            self.preprocess=keras.applications.vgg16.preprocess_input
            
        self.img_links = []
        self.img_vectors = []
        self.img_titles = []
            
            
    def download_image(self, image_url):
        """
        downloads and processes image from url
        returns image vector, or None if didn't get response
        """
        img_arr = None
        buffer = tempfile.SpooledTemporaryFile(max_size=1e9)
        r = requests.get(image_url, stream=True)
        if r.status_code == 200:
            downloaded = 0
            filesize = int(r.headers['content-length'])
            for chunk in r.iter_content():
                downloaded += len(chunk)
                buffer.write(chunk)
            # load image to buffer
            # original source's approach led to disk thrashing and problems reading files
            buffer.seek(0)
            img = image.load_img(io.BytesIO(buffer.read()), target_size=(224,224))
            buffer.close()
            img_arr = image.img_to_array(img)
        return img_arr

    def vectorize(self, image_url):
        """vectorize single image with model"""
        img_arr = self.download_image(image_url)
        if img_arr is not None:
            img_arr = np.expand_dims(img_arr, axis=0)
            processed_img = self.preprocess(img_arr)
            preds = self.model.predict(processed_img)
            return preds[0]
        else:
            return None

    def vectorize_links(self, image_url_list, image_title_list=None, debug=True):
        """vectorize set of images with model"""
        # reset first
        self.img_links = []
        self.img_vectors = []
        self.img_titles = []
        
        if image_title_list is not None:
            self.img_titles = image_title_list
            
        vects = []
        goods = []
        for url in image_url_list:
            time.sleep(0.1) # so not continually hitting links
            v = self.vectorize(url)
            if v is not None:
                vects.append(v)
                goods.append(url)
            else:
                if debug:
                    print('error:', url)
                    
        self.img_links = goods
        self.img_vectors = vects
        
        return goods, vects
    
    # cosine similarity
    def _cosine_similarity(self, vector1, vector2):
        """cosine similarity"""
        return 1.-scipy.spatial.distance.cosine(vector1, vector2)
    
    def view_image(self, idx):
        """display image from index"""
        if idx > len(self.img_vectors)-1:
            print("index out of bounds!")
            return
        i = Image(url=self.img_links[idx])
        display(i)
        return
    
    def image_similarity(self, idx):
        """compare images by cosine similarity and display the best match"""
        if len(self.img_vectors) < 0:
            print("please run vectorize_links() first!")
            return
        if idx > len(self.img_vectors)-1:
            print("index out of bounds!")
            return
        test_img = self.img_links[idx]
        test_vct = self.img_vectors[idx]
        sims_vct = []
        for i, vct in enumerate(self.img_vectors):
            if i != idx:
                sims_vct.append(self._cosine_similarity(test_vct, vct))
            else:
                sims_vct.append(0.0)
        best_idx  = np.argmax(sims_vct)
        o = Image(url=self.img_links[idx])
        t = Image(url=self.img_links[best_idx])
        display(o, t)
        if len(self.img_titles) > best_idx:
            return self.img_titles[best_idx]
        else:
            return

### supporting functions

`youtube.API` will be used for scraping images for this example.  
You need a (free) API key for this: https://developers.google.com/youtube/registering_an_application

see: https://github.com/rohitkhatri/youtube-python

In [3]:
# i've hidden my key for this demo
api_key = pickle.load(open('googleAPIkey.pkl', 'rb'))

In [4]:
api = youtube.API(client_id='', client_secret='', api_key=api_key)

### let's go

In [5]:
# video ID
# this is the last part of the youtube URL
# we'll use [MV] IU(아이유) _ BBIBBI(삐삐)
# https://www.youtube.com/watch?v=nM0xDI5R50E
video_id = 'nM0xDI5R50E'

In [6]:
# get some related video URLs
r = api.get('search', type='video', q=video_id, maxResults=50, part='snippet', regionCode='us', relevanceLanguage='en', key=api_key)
urls = [i['snippet']['thumbnails']['medium']['url'] for i in r['items']]
ttls = [i['snippet']['title'] for i in r['items']]

In [7]:
# test
for i in range(3):
    print(ttls[i])
    print(urls[i])

[MV] IU(아이유) _ BBIBBI(삐삐)
https://i.ytimg.com/vi/nM0xDI5R50E/mqdefault.jpg
IU(아이유) _ BBIBBI(삐삐) MV | BTS JUNGKOOK CRUSH? | REACTION!!!
https://i.ytimg.com/vi/GyNFeZ6cgU0/mqdefault.jpg
KPOP IDOLS Dancing &Singing To BbiBbi - IU 👯
https://i.ytimg.com/vi/eKo60MfiKwc/mqdefault.jpg


In [8]:
# create a vectorizer
# this will download the image weights if you have not done so already
vgg = ImageVectorizer()

In [9]:
# download and vectorize each image
img_links, img_vects = vgg.vectorize_links(urls, debug=True)

In [10]:
# check the data shape
img_vects[0].shape

(512,)

In [36]:
# view a single image
vgg.view_image(28)

In [12]:
# compare by passing index of image to check
# seems like the 0th item is always the original search query...?
vgg.image_similarity(0)

In [21]:
# same template
vgg.image_similarity(31)

In [16]:
# reaction videos
vgg.image_similarity(48)

In [34]:
# pastel title screens
vgg.image_similarity(45)

In [17]:
# both feature a character with their arms crossed
vgg.image_similarity(10)

In [19]:
# computer screens
vgg.image_similarity(40)