# Query expansion

In this notebook we will experiment with few different methods expanding users initial query.

In [5]:
from gensim.models import KeyedVectors
import numpy as np
import os
print(os.getcwd())

C:\Users\skralj\Desktop\envirolens\word-embeddings\notebooks


In [11]:
model_path = '../data/fasttext/cc.en.300.vec'
we_en = KeyedVectors.load_word2vec_format(model_path)

###### Size of the vocabulary

In [18]:
len(we_en.vocab)

2000000

###### Timing the computing time for each query

Our model contains 2 million words. We are interested in the time it takes the to find similar words to our inital query.

In [36]:
from time import time

def calculate_computing_time(words, model):
    """
    Given a list of words and word embedding (model) this function returns the most similar 
    words to this word as well as the time it took to compute that result.
    """
    start_time = time()
    result = model.most_similar(positive=words)
    
    return result, time() - start_time

In [37]:
calculate_computing_time(['deforestation'], we_en)

([('Deforestation', 0.8420504331588745),
  ('de-forestation', 0.7416400909423828),
  ('deforesting', 0.721744179725647),
  ('reforestation', 0.6759119629859924),
  ('deforested', 0.6683492660522461),
  ('overexploitation', 0.6402685642242432),
  ('clear-cutting', 0.633387565612793),
  ('desertification', 0.6263412833213806),
  ('forestation', 0.6201666593551636),
  ('re-forestation', 0.6115913987159729)],
 0.13412880897521973)

In [38]:
import random

all_english_words = list(we_en.vocab.keys())
random_words = random.sample(all_english_words, 50)

# calculate average time it takes to compute the result
total_time = 0
for word in random_words:
    result, time_needed = calculate_computing_time(word, we_en)
    total_time += time_needed

print(total_time/len(random_words))

0.13202642440795898


In [39]:
total_time = 0
total_tries = 100
for _ in range(total_tries):
    random_pick = random.sample(random_words, 4)
    result, time_needed = calculate_computing_time(word, we_en)
    total_time += time_needed

print(total_time/total_tries)

0.13227663040161133


## Testing query expansion

Compute most similar words to our query and return results. Those results should be above some given treshold value. For example if we get top 10 most similar words and top result has cosine similarity of 0.28932, then that word is probably not that similar to our word, nor are any other.

In [40]:
query = ['polluted', 'air']
print(we_en.most_similar(positive=query))

[('pollution', 0.6597203016281128), ('unpolluted', 0.6336122751235962), ('poluted', 0.6277227401733398), ('polluting', 0.6188726425170898), ('pollutants', 0.6181062459945679), ('smog-filled', 0.6178839802742004), ('non-polluted', 0.6104838848114014), ('pollute', 0.6028103828430176), ('unbreathable', 0.5904800295829773), ('smoggy', 0.5843187570571899)]


In [43]:
query = ['moving', 'water']
print(we_en.most_similar(positive=query))

[('flowing', 0.5391411781311035), ('water.As', 0.5355413556098938), ('water.But', 0.5353800654411316), ('water.Now', 0.5329791903495789), ('water.This', 0.5317260026931763), ('move', 0.5306128263473511), ('water--the', 0.5285866856575012), ('water--and', 0.5281214714050293), ('water.With', 0.5265851020812988), ('water.The', 0.5243165493011475)]
