## Text Search Demo

In [1]:
%load_ext autoreload
%autoreload 2

In [8]:
from winnow.text_search.main_utils import get_search_engine,query_signatures,load_search_space_from_signatures,VideoSearch

The simplest form to text the text search module is to have it running with a set of extracted frames and signatures.

In [9]:
# As generated by the project's feature extraction pipeline
SIGNATURES_FOLDER = "/home/felipe/ai/toptal/benetech/related_repos/winnow_text/data/syria/video_signatures"
FRAMES_FOLDER = "/home/felipe/ai/toptal/benetech/related_repos/winnow_text/data/syria/frames"

In [10]:
vs = VideoSearch(SIGNATURES_FOLDER)

In [16]:
files, distances,query_data = vs.query('people in the streets')

cosine_sim execution time: 0.032 seconds



### List of files and distances aligned and sorted

In [17]:
list(zip(files,distances))[:4]

[('45687b6c401743acbd1af67ce720f689.ogv', 0.02912245),
 ('f64024c0d2a94efeb8fbfe2e860e3e9e.mp4', 0.02550951),
 ('b1f6d81fa22449289720015281b05bc8.webm', 0.024763409),
 ('c9d8c700aa8d40b692882986f9ebd146.ogv', 0.024474367)]

### Context information about how the query was processed and fed into our model

In [18]:
query_data

{'original_query': 'people in the streets',
 'tokens': ['people', 'streets'],
 'clean_tokens': ['people', 'streets'],
 'human_readable': 'people streets',
 'score': 1.0}

In [19]:
distances[:5]

(0.02912245, 0.02550951, 0.024763409, 0.024474367, 0.02307131)

Clean tokens shows what actually gets passed into the text encoder. 

## What happens if a query contains out of vocab words / typos?

In [20]:
files, distances,query_data = vs.query('street people')

cosine_sim execution time: 0.018 seconds



In [21]:
query_data

{'original_query': 'street people',
 'tokens': ['street', 'people'],
 'clean_tokens': ['street', 'people'],
 'human_readable': 'street people',
 'score': 1.0}

Model ignores* the word Cairo, since it's not in the vocabulary.

note: depending on the model configuration, out of vocab words might be mapped to multiple "out_of_vocab" entities (that would vary according to strategy used (word2Vec,bow,word hashing)

In [22]:
files, distances,query_data = vs.query('Agfaqgfagagagagagag')
query_data,distances[:5]

cosine_sim execution time: 0.011 seconds



({'original_query': 'Agfaqgfagagagagagag',
  'tokens': ['agfaqgfagagagagagag'],
  'clean_tokens': [],
  'human_readable': '<NA>',
  'score': 0.0},
 (0.18101189, 0.15912527, 0.1558112, 0.14925787, 0.14747253))

In [23]:
files, distances,query_data = vs.query('Agfxqgfagagxgagagag')
query_data,distances[:5]

cosine_sim execution time: 0.011 seconds



({'original_query': 'Agfxqgfagagxgagagag',
  'tokens': ['agfxqgfagagxgagagag'],
  'clean_tokens': [],
  'human_readable': '<NA>',
  'score': 0.0},
 (0.18101189, 0.15912527, 0.1558112, 0.14925787, 0.14747253))

In [24]:
files, distances,query_data = vs.query('Baafalfalfjlajfaljflafjl')
query_data,distances[:5]

cosine_sim execution time: 0.022 seconds



({'original_query': 'Baafalfalfjlajfaljflafjl',
  'tokens': ['baafalfalfjlajfaljflafjl'],
  'clean_tokens': [],
  'human_readable': '<NA>',
  'score': 0.0},
 (0.18101189, 0.15912527, 0.1558112, 0.14925787, 0.14747253))

"Dirty" queries should return the same results in general.

## Does word order matter?

In [25]:
files, distances,query_data = vs.query('red flag')

cosine_sim execution time: 0.017 seconds



In [26]:
query_data

{'original_query': 'red flag',
 'tokens': ['red', 'flag'],
 'clean_tokens': ['red', 'flag'],
 'human_readable': 'red flag',
 'score': 1.0}

In [27]:
distances[:5]

(0.12298576, 0.12203802, 0.121082336, 0.11644756, 0.11514723)

In [28]:
files, distances,query_data = vs.query('flag red')
query_data

cosine_sim execution time: 0.014 seconds



{'original_query': 'flag red',
 'tokens': ['flag', 'red'],
 'clean_tokens': ['flag', 'red'],
 'human_readable': 'flag red',
 'score': 1.0}

In [29]:
distances[:5]

(0.113494426, 0.11275999, 0.111292556, 0.106624216, 0.105996415)

The model takes advantage of sequencial information, so queries with the same tokens in different order should get different results.

## Low Level Api

Example of how a "probing" function could be implemented.

In [71]:
from winnow.text_search.main_utils import get_signature_activations,get_query_context,load_model
from glob import glob
import os
import numpy as np

In [72]:
model = load_model()

In [58]:
sample_signatures_file_paths = glob(os.path.join(SIGNATURES_FOLDER,"**.npy"))

In [86]:
# signature for a video known to have fire
sig = [x for x in sample_signatures_file_paths if "b83cb57" in x]

In [89]:
sample_signature = np.array([np.load(sig[0])])
sample_signature.shape

(1, 500)

In [90]:
query_for_fire = get_signature_activations("fire",sample_signature,model)

cosine_sim execution time: 0.000 seconds



In [92]:
concepts = "fire hospital drugs street people desert plane tank weapons smoke sky bombs children blood"

In [93]:
normalized_data = get_query_context(concepts)

In [94]:
meaningful_tokens = normalized_data['clean_tokens']

In [None]:
sims = np.array([get_signature_activations(x,sample_signature,model) for x in meaningful_tokens])
normalized = (sims - (-1)) / (1 - (-1))

In [101]:
dict(zip(meaningful_tokens,normalized[:,0]))

{'fire': 0.65911376,
 'hospital': 0.48177385,
 'drugs': 0.49717602,
 'street': 0.5205738,
 'people': 0.4893952,
 'desert': 0.5021971,
 'plane': 0.4601108,
 'tank': 0.5076987,
 'weapons': 0.5362744,
 'smoke': 0.54367775,
 'sky': 0.54045117,
 'bombs': 0.54835665,
 'children': 0.48668715,
 'blood': 0.49972978}