# Word2VisualVec for Sentence Representation

This note answers the following two questions:
1. How to load a trained Word2VisualVec model?
2. How to predict visual features from a new sentence?

## 0. Setup

Use the following script to download and extract a Word2VisalVec model trained on flickr30k.
Notice that please refer to [here](https://github.com/danieljf24/w2vv#required-data) to download the dataset 


```shell
ROOTPATH=$HOME/trained_w2vv_model
mkdir -p $ROOTPATH && cd $ROOTPATH

# download and extract the pre-trained model
wget http://lixirong.net/data/w2vv-tmm2018/flickr30k_trained_model.tar.gz
tar zxf flickr30k_trained_model.tar.gz
```

In [1]:
import os
import keras
from basic.common import readPkl
from w2vv_pred import W2VV_MS_pred, pred_mutual_error_ms
from util.text import encode_text
from util.text2vec import get_text_encoder
from util.util import readImgSents 
from simpleknn.bigfile import BigFile
from util.losser import get_losser
from util.evaluation import i2t

Using TensorFlow backend.


In [2]:
use_flickr = False

model_name = "flickr30k_trained_model" if use_flickr else "1000chars_description_trained_model"
trainCollection = "flickr30kenctrain" if use_flickr else 'data_w2vvtrain'
testCollection='data_w2vvtest'

## 1. Load a trained Word2Visual model

In [3]:
model_path = os.path.join(os.environ['HOME'],'trained_w2vv_model/' + model_name)
abs_model_path = os.path.join(model_path, 'model.json')
weight_path = os.path.join(model_path, 'best_model.h5')
predictor = W2VV_MS_pred(abs_model_path, weight_path)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
08/12/2019 09:40:17 INFO [w2vv_pred.pyc.W2VV_MS_pred] loaded a trained Word2VisualVec model successfully


## 2. Precision of prediction on test dataset

In [4]:
# setup multi-scale sentence vectorization
opt = readPkl(os.path.join(model_path, 'option.pkl'))
# opt.n_caption = 2

rootpath=os.path.join(os.environ['HOME'],'VisualSearch')
rnn_style, bow_style, w2v_style = opt.text_style.strip().split('@')
text_data_path = os.path.join(rootpath, trainCollection, "TextData", "vocabulary", "bow", opt.rnn_vocab)
bow_data_path = os.path.join(rootpath, trainCollection, "TextData", "vocabulary", bow_style, opt.bow_vocab)
w2v_data_path = os.path.join(rootpath, "word2vec", opt.corpus,  opt.word2vec)

text2vec = get_text_encoder(rnn_style)(text_data_path)
bow2vec = get_text_encoder(bow_style)(bow_data_path)
w2v2vec = get_text_encoder(w2v_style)(w2v_data_path)

08/12/2019 09:40:17 INFO [util/text2vec.pyc.Index2Vec] initializing ...
08/12/2019 09:40:17 INFO [util/text2vec.pyc.BoW2VecFilterStop] initializing ...
08/12/2019 09:40:17 INFO [util/text2vec.pyc.BoW2VecFilterStop] 50105 words
08/12/2019 09:40:17 INFO [util/text2vec.pyc.AveWord2VecFilterStop] initializing ...
[BigFile] 1743364x500 instances loaded from /home/oleh/VisualSearch/word2vec/flickr/vec500flickr30m


In [5]:
# similarity function
losser = get_losser(opt.simi_fun)()

In [6]:
# img2vec
img_feats_path = os.path.join(rootpath, testCollection, 'FeatureData', opt.img_feature)
img_feats = BigFile(img_feats_path)

test_sent_file = os.path.join(rootpath, testCollection, 'TextData','%s.caption.txt' % testCollection)
img_list, sents_id, sents = readImgSents(test_sent_file)
all_errors = pred_mutual_error_ms(img_list, sents, predictor, text2vec, bow2vec, w2v2vec, img_feats, losser, opt=opt)


# compute performance
(r1i, r5i, r10i, medri, meanri) = i2t(all_errors, n_caption=opt.n_caption)
print "Image to text: %.1f, %.1f, %.1f, %.1f, %.1f" % (r1i, r5i, r10i, medri, meanri)

Image to text(flickr run) : 45.6, 72.1, 81.5, 2.0, 13.3

Image to text(up to 1000 words, vocab from flickr): 1.2, 3.3, 6.4, 115.0, 122.5

Image to text(entire article, flickr vocab): 0.4, 2.1, 3.7, 362.0, 405.4

## 3. Specific Output Example

In [7]:
import numpy as np
import string
import json
import shutil

import os
from os import listdir, mkdir
from os.path import isfile, isdir, join, exists, abspath
from keras.preprocessing import image
import regex as re

In [8]:

def _remove_punctuation(text):
    return re.sub(ur"\p{P}+", "", text)

def _getJSON(path):
    with open(path) as json_file:
        return json.loads(json.load(json_file))

def _getTextFeatures(text_path):
    data = _getJSON(text_path)
    text = _remove_punctuation(data['text'].replace("\n", " "))
    text = text[:1000].rsplit(' ', 1)[0]
    # onyshchak: only checking first 1000 characters, will need to extract summary propely
    return {
        'id': data['id'],
        'text': text,
        "title": data['title']
    }

def _getImagesMeta(path):
    return _getJSON(path)['img_meta']

def _getValidImagePaths(article_path):
    img_path = join(article_path, 'img/')
    return [join(img_path, f) for f in listdir(img_path) if isfile(join(img_path, f)) and f[-4:].lower() == ".jpg"]

def _dump(path, data):
    with open(path, 'w', encoding='utf8') as outfile:
        json.dump(data, outfile, indent=2, ensure_ascii=False)

def GetArticleData(article_path):
    article_data = _getTextFeatures(join(article_path, 'text.json'))
    article_data["img"] = _getImagesMeta(join(article_path, 'img/', 'meta.json'))
    
    return article_data

def ReadArticles(data_path, offset=0, limit=None):
    print("Reading in progress...")
    article_paths = [join(data_path, f) for f in listdir(data_path) if isdir(join(data_path, f))]
    limit = limit if limit else len(article_paths) - offset
    
    articles = []
    for i in range(offset, offset + limit):
        path = article_paths[i]
        if (i - offset + 1) % 251 == 0: print(i - offset, "articles have been read")
        article_data = GetArticleData(path)
        articles.append(article_data)
        if len(articles) >= limit: break  # useless?
        
    print(limit, "articles have been read")
    return articles

In [9]:
%%time
articles = ReadArticles('../data/', offset=0, limit=None)

Reading in progress...
(250, 'articles have been read')
(501, 'articles have been read')
(752, 'articles have been read')
(1003, 'articles have been read')
(1254, 'articles have been read')
(1505, 'articles have been read')
(1756, 'articles have been read')
(2007, 'articles have been read')
(2258, 'articles have been read')
(2509, 'articles have been read')
(2760, 'articles have been read')
(3011, 'articles have been read')
(3262, 'articles have been read')
(3513, 'articles have been read')
(3764, 'articles have been read')
(4015, 'articles have been read')
(4266, 'articles have been read')
(4517, 'articles have been read')
(4768, 'articles have been read')
(5019, 'articles have been read')
(5270, 'articles have been read')
(5521, 'articles have been read')
(5638, 'articles have been read')
CPU times: user 48.9 s, sys: 5.98 s, total: 54.9 s
Wall time: 57.3 s


In [10]:
images = {i["filename"]: i for a in articles for i in a['img']}
images = np.array([x for x in images.values() if "features" in x])

In [11]:
img_features = np.array([x["features"] for x in images], dtype=np.float32)

In [32]:
page = [x for x in articles if x["title"] == "Barack Obama"][0]
text = page["text"]#page["img"][0]["description"]
print(text)
rnn_vec, bow_w2v_vec = encode_text(opt, text2vec, bow2vec, w2v2vec, text)
predicted_features = predictor.predict_one(rnn_vec, bow_w2v_vec).reshape(1, -1)
predicted_features

Barack Hussein Obama II  January 20 2009 born August 4 1961 is an American attorney and politician who served as the 44th president of the United States from 2009 to 2017 A member of the Democratic Party he was the first African American to be elected to the presidency He previously served as a US senator from Illinois from 2005 to 2008 and an Illinois state senator from 1997 to 2004  Obama was born in Honolulu Hawaii After graduating from Columbia University in 1983 he worked as a community organizer in Chicago In 1988 he enrolled in Harvard Law School where he was the first black president of the Harvard Law Review After graduating he became a civil rights attorney and an academic teaching constitutional law at the University of Chicago Law School from 1992 to 2004 He represented the 13th district for three terms in the Illinois Senate from 1997 until 2004 when he ran for the US Senate He received national attention in 2004 with his March primary win his wellreceived July Democratic


array([[6.832839, 3.157705, 8.74132 , ..., 8.02501 , 9.083664, 7.874597]],
      dtype=float32)

In [33]:
print(page["img"][0]["url"])

https://en.wikipedia.org/wiki/File%3A58th_Presidential_Inaugural_Ceremony_170120-D-BP749-1327.jpg


In [34]:
similarity = np.array(losser.calculate(predicted_features, img_features)[0])
# res = res + 1
similarity

array([-0.69561876, -0.59700312, -0.73438151, ..., -0.73371196,
       -0.65384362, -0.86627162])

Double-checking that `similarity` and `img_features` have the same order

In [15]:
print(similarity[:3])
print(similarity[-3:])

[-0.70493889 -0.65125447 -0.75555891]
[-0.75000157 -0.72129319 -0.882415  ]


In [17]:
print(losser.calculate(img_features[:3], predicted_features))
print(losser.calculate(img_features[-3:], predicted_features))

[[-0.7049388876239739], [-0.6512544742938519], [-0.7555589063030781]]
[[-0.7500015651930437], [-0.721293186013536], [-0.8824149962743585]]


Double-checking that `images` and `img_features` have the same order

In [18]:
get_features = lambda img: np.array(img['features']).astype(np.float32)
all([(get_features(images[i]) == img_features[i]).all() for i in range(len(images))])

True

* 1 double check that we have the same order, because similarities are very big and results bad
* 5 then if doesnt work, train on single image per article (the most relevant one)
* 4 finish with text2text similarity (the last priority)
* 2 identify article with high precision and check images (is it for real?)
* 3 check that we have the same precision

In [35]:
min(similarity), max(similarity)

(-0.9058710065953819, -0.2193203322630154)

In [36]:
page[u"title"]

u'Barack Obama'

Real images on `Barack Obama` Wikipedia page

In [29]:
for x in page["img"]:
    print(x['url'])

https://en.wikipedia.org/wiki/File%3A58th_Presidential_Inaugural_Ceremony_170120-D-BP749-1327.jpg
https://en.wikipedia.org/wiki/File%3ABarackObamaportrait.jpg
https://en.wikipedia.org/wiki/File%3ABarack_Obama_Iraq_2006.jpg
https://en.wikipedia.org/wiki/File%3ABarack_Obama_addresses_joint_session_of_Congress_2009-02-24.jpg
https://en.wikipedia.org/wiki/File%3ABarack_Obama_and_Bill_Clinton_%28cropped1%29.jpg
https://en.wikipedia.org/wiki/File%3ABarack_Obama_and_Matteo_Renzi_October_2016%2C_1.jpg
https://en.wikipedia.org/wiki/File%3ABarack_Obama_playing_basketball_with_members_of_Congress_and_Cabinet_secretaries_2.jpg
https://en.wikipedia.org/wiki/File%3ABarack_Obama_talks_with_Benjamin_Netanyahu_%288637772147%29.jpg
https://en.wikipedia.org/wiki/File%3ABarack_Obama_visiting_victims_of_2012_Aurora_shooting.jpg
https://en.wikipedia.org/wiki/File%3ABarack_Obama_welcomes_Shimon_Peres_in_the_Oval_Office.jpg
https://en.wikipedia.org/wiki/File%3ABlackhawksWhiteHouse2010.jpg
https://en.wikipedia

Top-10 ranked images predicted by the model for `Barack Obama` page

In [37]:
print(similarity[similarity.argsort()[:10]])

[-0.90587101 -0.90522574 -0.90418146 -0.90394407 -0.8990243  -0.89894226
 -0.89850592 -0.89818072 -0.89746584 -0.89725399]


In [38]:
for x in images[similarity.argsort()[:10]]:
    print(x['url'])

https://en.wikipedia.org/wiki/File%3ACM_Punk_2.jpg
https://en.wikipedia.org/wiki/File%3AClintonSenate.jpg
https://en.wikipedia.org/wiki/File%3AMcCain25April2007Portsmouth.jpg
https://en.wikipedia.org/wiki/File%3ANational_Prayer_Service_Obama_Inauguration.jpg
https://en.wikipedia.org/wiki/File%3ARIAN_archive_837790_Valentina_Tereshkova_and_Neil_Armstrong.jpg
https://en.wikipedia.org/wiki/File%3AThe-Dream_performing.jpg
https://en.wikipedia.org/wiki/File%3AGough_and_Margaret_Whitlam_-_Holt%27s_memorial_service.jpg
https://en.wikipedia.org/wiki/File%3AG8_leaders_watching_football.jpg
https://en.wikipedia.org/wiki/File%3AMichael_Buffer_Fight_For_Children_Washington_DC_Nov_2007.JPG
https://en.wikipedia.org/wiki/File%3AUrsula_K_Le_Guin.JPG


## 3. Predict visual features of a novel sentence

In [39]:
sent='a dog is playing with a cat'
rnn_vec, bow_w2v_vec = encode_text(opt,text2vec,bow2vec,w2v2vec,sent)
predicted_text_feat = predictor.predict_one(rnn_vec,bow_w2v_vec)
print len(predicted_text_feat)
print predicted_text_feat

2048
[ 6.1648006  5.2095037  8.546985  ...  7.519165  10.404728   7.7131257]
