# Word2VisualVec for Sentence Representation

This note answers the following two questions:
1. How to load a trained Word2VisualVec model?
2. How to predict visual features from a new sentence?

## 0. Setup

Use the following script to download and extract a Word2VisalVec model trained on flickr30k.
Notice that please refer to [here](https://github.com/danieljf24/w2vv#required-data) to download the dataset 


```shell
ROOTPATH=$HOME/trained_w2vv_model
mkdir -p $ROOTPATH && cd $ROOTPATH

# download and extract the pre-trained model
wget http://lixirong.net/data/w2vv-tmm2018/flickr30k_trained_model.tar.gz
tar zxf flickr30k_trained_model.tar.gz
```

In [1]:
import os
import keras
from basic.common import readPkl
from w2vv_pred import W2VV_MS_pred, pred_mutual_error_ms
from util.text import encode_text
from util.text2vec import get_text_encoder
from util.util import readImgSents 
from simpleknn.bigfile import BigFile
from util.losser import get_losser
from util.evaluation import i2t

Using TensorFlow backend.


In [2]:
use_flickr = False

model_name = "flickr30k_trained_model" if use_flickr else "1000chars_description_trained_model"
trainCollection = "flickr30kenctrain" if use_flickr else 'data_w2vvtrain'
testCollection='data_w2vvtest'

## 1. Load a trained Word2Visual model

In [3]:
model_path = os.path.join(os.environ['HOME'],'trained_w2vv_model/' + model_name)
abs_model_path = os.path.join(model_path, 'model.json')
weight_path = os.path.join(model_path, 'best_model.h5')
predictor = W2VV_MS_pred(abs_model_path, weight_path)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
08/12/2019 13:28:06 INFO [w2vv_pred.pyc.W2VV_MS_pred] loaded a trained Word2VisualVec model successfully


## 2. Precision of prediction on test dataset

In [4]:
# setup multi-scale sentence vectorization
opt = readPkl(os.path.join(model_path, 'option.pkl'))
# opt.n_caption = 2

rootpath=os.path.join(os.environ['HOME'],'VisualSearch')
rnn_style, bow_style, w2v_style = opt.text_style.strip().split('@')
text_data_path = os.path.join(rootpath, trainCollection, "TextData", "vocabulary", "bow", opt.rnn_vocab)
bow_data_path = os.path.join(rootpath, trainCollection, "TextData", "vocabulary", bow_style, opt.bow_vocab)
w2v_data_path = os.path.join(rootpath, "word2vec", opt.corpus,  opt.word2vec)

text2vec = get_text_encoder(rnn_style)(text_data_path)
bow2vec = get_text_encoder(bow_style)(bow_data_path)
w2v2vec = get_text_encoder(w2v_style)(w2v_data_path)

08/12/2019 13:28:06 INFO [util/text2vec.pyc.Index2Vec] initializing ...
08/12/2019 13:28:06 INFO [util/text2vec.pyc.BoW2VecFilterStop] initializing ...
08/12/2019 13:28:06 INFO [util/text2vec.pyc.BoW2VecFilterStop] 50105 words
08/12/2019 13:28:06 INFO [util/text2vec.pyc.AveWord2VecFilterStop] initializing ...
[BigFile] 1743364x500 instances loaded from /home/oleh/VisualSearch/word2vec/flickr/vec500flickr30m


In [5]:
# similarity function
losser = get_losser(opt.simi_fun)()

In [6]:
# # img2vec
# img_feats_path = os.path.join(rootpath, testCollection, 'FeatureData', opt.img_feature)
# img_feats = BigFile(img_feats_path)

# test_sent_file = os.path.join(rootpath, testCollection, 'TextData','%s.caption.txt' % testCollection)
# img_list, sents_id, sents = readImgSents(test_sent_file)
# all_errors = pred_mutual_error_ms(img_list, sents, predictor, text2vec, bow2vec, w2v2vec, img_feats, losser, opt=opt)


# # compute performance
# (r1i, r5i, r10i, medri, meanri) = i2t(all_errors, n_caption=opt.n_caption)
# print "Image to text: %.1f, %.1f, %.1f, %.1f, %.1f" % (r1i, r5i, r10i, medri, meanri)

Image to text(flickr run) : 45.6, 72.1, 81.5, 2.0, 13.3

Image to text(up to 1000 words, vocab from flickr): 1.2, 3.3, 6.4, 115.0, 122.5

Image to text(entire article, flickr vocab): 0.4, 2.1, 3.7, 362.0, 405.4

## 3. Specific Output Example

### Read Data

In [7]:
import numpy as np
import string
import json
import shutil

import os
from os import listdir, mkdir
from os.path import isfile, isdir, join, exists, abspath
from keras.preprocessing import image
import regex as re

In [8]:
def _remove_punctuation(text):
    return re.sub(ur"\p{P}+", "", text)

def _getJSON(path):
    with open(path) as json_file:
        return json.loads(json.load(json_file))

def _getTextFeatures(text_path):
    data = _getJSON(text_path)
    text = _remove_punctuation(data['text'].replace("\n", " "))
    text = text[:1000].rsplit(' ', 1)[0]
    # onyshchak: only checking first 1000 characters, will need to extract summary propely
    return {
        'id': data['id'],
        'text': text,
        "title": data['title']
    }

def _getImagesMeta(path):
    return _getJSON(path)['img_meta']

def _getValidImagePaths(article_path):
    img_path = join(article_path, 'img/')
    return [join(img_path, f) for f in listdir(img_path) if isfile(join(img_path, f)) and f[-4:].lower() == ".jpg"]

def _dump(path, data):
    with open(path, 'w', encoding='utf8') as outfile:
        json.dump(data, outfile, indent=2, ensure_ascii=False)

def GetArticleData(article_path):
    article_data = _getTextFeatures(join(article_path, 'text.json'))
    article_data["img"] = _getImagesMeta(join(article_path, 'img/', 'meta.json'))
    
    return article_data

def ReadArticles(data_path, offset=0, limit=None):
    print("Reading in progress...")
    article_paths = [join(data_path, f) for f in listdir(data_path) if isdir(join(data_path, f))]
    limit = limit if limit else len(article_paths) - offset
    
    articles = []
    for i in range(offset, offset + limit):
        path = article_paths[i]
        if (i - offset + 1) % 251 == 0: print(i - offset, "articles have been read")
        article_data = GetArticleData(path)
        articles.append(article_data)
        if len(articles) >= limit: break  # useless?
        
    print(limit, "articles have been read")
    return articles

In [9]:
%%time
articles = ReadArticles('../data/', offset=0, limit=None)

Reading in progress...
(250, 'articles have been read')
(501, 'articles have been read')
(752, 'articles have been read')
(1003, 'articles have been read')
(1254, 'articles have been read')
(1505, 'articles have been read')
(1756, 'articles have been read')
(2007, 'articles have been read')
(2258, 'articles have been read')
(2509, 'articles have been read')
(2760, 'articles have been read')
(3011, 'articles have been read')
(3262, 'articles have been read')
(3513, 'articles have been read')
(3764, 'articles have been read')
(4015, 'articles have been read')
(4266, 'articles have been read')
(4517, 'articles have been read')
(4768, 'articles have been read')
(5019, 'articles have been read')
(5270, 'articles have been read')
(5521, 'articles have been read')
(5638, 'articles have been read')
CPU times: user 48.9 s, sys: 5.8 s, total: 54.7 s
Wall time: 59.4 s


### Well-performing case of 'Maserati MC12' article

In [10]:
images = {i["filename"]: i for a in articles for i in a['img']}
images = np.array([x for x in images.values() if "features" in x])

In [11]:
img_features = np.array([x["features"] for x in images], dtype=np.float32)

In [12]:
import random
import hashlib
from urllib import quote
from IPython.display import display, Image

random.seed(1234)

def get_matched_article_id(img_features, articles):
    for tries in range(50):
        i = int(random.random() * len(articles))
        page = articles[i]
        text = page["text"]
        print(i, page['title'])
        
        rnn_vec, bow_w2v_vec = encode_text(opt, text2vec, bow2vec, w2v2vec, text)
        predicted_features = predictor.predict_one(rnn_vec, bow_w2v_vec).reshape(1, -1)

        similarity = np.array(losser.calculate(predicted_features, img_features)[0])
        true_img = [x["filename"] for x in page["img"]]
        for x in images[similarity.argsort()[:10]]:
            if x["filename"] in true_img:
                print("FOUND", x["filename"])
                print(x["url"])
                return i
    return -1

def get_url(img_title, size=600):
    img_name = img_title.replace("\"", "")
    for forbidden in ':*?/\\ ':
        img_name = img_name.replace(forbidden, '_')
        
    img_name = img_name.encode('utf-8')
    url_prefix = "https://upload.wikimedia.org/wikipedia/commons/thumb/"
    md5 = hashlib.md5(img_name).hexdigest()
    sep = "/"
    
    img_name = quote(img_name)
    url = url_prefix + sep.join((md5[0], md5[:2], img_name)) + sep + str(size) + "px-" + img_name
    if url[-4:] != ".jpg" and url[-4:] != "jpeg":
        url += ".jpg"
        
    return url

In [13]:
matched_article_id = get_matched_article_id(img_features, articles)

(5448, u'Vision in White')

(2484, u'LSWR N15 class')
(42, u'Hurricane Fabian')
(5136, u'George Moore (novelist)')
(5295, u'Asylum confinement of Christopher Smart')
(3282, u'Lost Luggage (video game)')
(3786, u'St Kilda, Scotland')
(473, u'The Blind Leading the Blind')
(4321, u'Triturus')
(1335, u'George W. Romney')
(173, u'The Shape of Things to Come (Lost)')
(4447, u'Smooth toadfish')
(1951, u'Aaliyah (album)')
(3514, u'Tropical Storm Henri (2003)')
(3471, u'Subfossil lemur')
(837, u'Maserati MC12')
('FOUND', u'd3470178117bd4313f031202c61a9ae8.jpg')
https://en.wikipedia.org/wiki/File%3AMaserati_MC12_36643138.jpg


In [14]:
# page = [x for x in articles if x["title"] == "Barack Obama"][0]
page = articles[matched_article_id]
text = page["text"]
# text = page["img"][1]["description"]
print(text)
rnn_vec, bow_w2v_vec = encode_text(opt, text2vec, bow2vec, w2v2vec, text)
predicted_features = predictor.predict_one(rnn_vec, bow_w2v_vec).reshape(1, -1)
predicted_features

The Maserati MC12 Tipo M144S is a limited production twoseater sports car produced by Italian car maker Maserati to allow a racing variant to compete in the FIA GT Championship The car entered production in 2004 with 25 cars produced A further 25 were produced in 2005 making a total of 50 cars available for customers each of which was presold for €600000 US$670541 With the addition of 12 cars produced for racing only a total of 62 of these cars were ever produced  Maserati designed and built the car on the chassis of the Enzo Ferrari but the final car is much larger and has a lower drag coefficient The MC12 is longer wider and taller and has a sharper nose and smoother curves than the Enzo Ferrari which has faster acceleration better braking performance shorter braking distance and a higher top speed The top speed of the Maserati MC12 is 330 kilometres per hour 205 mph whereas the top speed of the Enzo Ferrari is 350 kilometres per hour 2175 mph  The MC12 was developed to signal


array([[10.183257 ,  3.6151009,  6.8254724, ...,  5.8204355,  8.846998 ,
         6.8047595]], dtype=float32)

In [15]:
similarity = np.array(losser.calculate(predicted_features, img_features)[0])
# res = res + 1
similarity

array([-0.70614822, -0.59112877, -0.75591392, ..., -0.75365775,
       -0.6493101 , -0.8246509 ])

Double-checking that `similarity` and `img_features` have the same order

In [16]:
print(similarity[:3])
print(similarity[-3:])

[-0.70614822 -0.59112877 -0.75591392]
[-0.75365775 -0.6493101  -0.8246509 ]


In [17]:
print(losser.calculate(img_features[:3], predicted_features))
print(losser.calculate(img_features[-3:], predicted_features))

[[-0.7061482240913182], [-0.5911287668748487], [-0.7559139236459015]]
[[-0.7536577499856756], [-0.6493101040659499], [-0.8246509048925484]]


Double-checking that `images` and `img_features` have the same order

In [18]:
get_features = lambda img: np.array(img['features']).astype(np.float32)
all([(get_features(images[i]) == img_features[i]).all() for i in range(len(images))])

True

* 1 double check that we have the same order, because similarities are very big and results bad
* 5 then if doesnt work, train on single image per article (the most relevant one)
* 4 finish with text2text similarity (the last priority)
* 2 identify article with high precision and check images (is it for real?)
* 3 check that we have the same precision

In [19]:
min(similarity), max(similarity)

(-0.8969546457728494, -0.19520208983797338)

In [20]:
page[u"title"]

u'Maserati MC12'

Real images on `Maserati MC12` Wikipedia page

In [21]:
for x in page["img"]:
    img_url = get_url(x['title'])
    display(Image(url=img_url))

Top-10 ranked images predicted by the model for `Maserati MC12` page

In [22]:
print(similarity[similarity.argsort()[:10]])

[-0.89695465 -0.89627005 -0.89617854 -0.8949657  -0.89255532 -0.89211503
 -0.89199204 -0.89196118 -0.89149196 -0.89148102]


In [23]:
for x in images[similarity.argsort()[:10]]:
    img_url = get_url(x['title'])
    display(Image(url=img_url))

Note: in case of this article, when taking description of its images as an input, performance is poor

### Random Article Performance

In [24]:
def wiki_predict(page, topK=10):
    text = page["text"]
    rnn_vec, bow_w2v_vec = encode_text(opt, text2vec, bow2vec, w2v2vec, text)
    predicted_features = predictor.predict_one(rnn_vec, bow_w2v_vec).reshape(1, -1)

    similarity = np.array(losser.calculate(predicted_features, img_features)[0])
    true_img_url = [get_url(x["title"]) for x in page["img"]]
    pred_img = images[similarity.argsort()[:topK]]
    pred_img_url = [get_url(x["title"]) for x in pred_img]
    
    return true_img_url, pred_img_url

In [25]:
obama_page = [x for x in articles if x["title"] == "Barack Obama"][0]
true_img_url, pred_img_url = wiki_predict(obama_page)

# Barack Obama's images
for x in pred_img_url:
    display(Image(url=x))

**TODO:** Make sure you checking on examples from **test** subset

## 3. Predict visual features of a novel sentence

In [26]:
sent='a dog is playing with a cat'
rnn_vec, bow_w2v_vec = encode_text(opt,text2vec,bow2vec,w2v2vec,sent)
predicted_text_feat = predictor.predict_one(rnn_vec,bow_w2v_vec)
print len(predicted_text_feat)
print predicted_text_feat

2048
[ 6.1648006  5.2095037  8.546985  ...  7.519165  10.404728   7.7131257]
