# Using Word2Vec Embedding to extend POC Guesser

We can use Gensim to make a more powerful version of our Proof-of-Concept. The main limitation will be guesses that lie outside of the training corpora; we might tackle this later with FastText. For now, let's see if we can make less of a toy version using the Google News Skip-Gram model with 300-feature embeddings (requires ~2GB).

Make sure to update SSL Certificate to download if required: 
```
pip install -U certifi

/Applications/Python 3.X/Install Certificates.command
```

In [4]:
import os
import gensim
import gensim.downloader
import gensim.models

saved_path_name = "word2vec-google-news-300_c"
limit = 200_000

if not os.path.exists(saved_path_name):
    print("Checking cache and downloading")
    google_news_wv = gensim.downloader.load("word2vec-google-news-300")
    google_news_wv.save_word2vec_format(saved_path_name)
    print("Saved to disk in C format")
    del google_news_wv

print(f"Loading {limit} from saved")
google_news_wv = gensim.models.KeyedVectors.load_word2vec_format(saved_path_name, limit=limit)

Loading 200000 from saved


In [5]:
print(len(google_news_wv.index_to_key))
for index, word in enumerate(google_news_wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(google_news_wv.index_to_key)} is {word}")

200000
word #0/200000 is </s>
word #1/200000 is in
word #2/200000 is for
word #3/200000 is that
word #4/200000 is is
word #5/200000 is on
word #6/200000 is ##
word #7/200000 is The
word #8/200000 is with
word #9/200000 is said


Doing stuff with the whole wordset is really slow, slower than I'd like at least. It also requires more memory than I have on my machine. I could see how effective caching stuff is but I think it would make more sense to reduce computation cost and space.

We can try prioritizing similarity; I also think it would make sense to focus on words which are determined to have a higher probability by the guesser

In [65]:
word = "hot"

# This takes forever

# similarity_sum = 0
# for i, other_word in enumerate(google_news_wv.index_to_key):
#     similarity_sum += google_news_wv.similarity(other_word, word)
#     if i % 1000 == 0:
#         print(i)

# Instead, let's prioritize by similarity
similarity_sum = 0
words = []
similarities = []
for word, similarity in google_news_wv.most_similar(positive=[word], topn=10):
    words.append(word)
    similarity_sum += abs(similarity)
    similarities.append(similarity)
log_similarity_sum = math.log(similarity_sum)
word_log_probabilities = [math.log(abs(google_news_wv.similarity(word, other_word))) - log_similarity_sum  for other_word in words]
print(words[:5])
print(similarities[:5])
print([math.exp(lgp) for lgp in word_log_probabilities[:5]])
print(words[-5:])
print(similarities[-5:])
print([math.exp(lgp) for lgp in word_log_probabilities[-5:]])
print(google_news_wv.similarity("hot", "cold"))
print(google_news_wv.similarity("hot", "warm"))
print(google_news_wv.similarity("hot", "spicy"))
print(google_news_wv.similarity("hot", "Prescription_Solutions"))
print(google_news_wv.similarity("hot", "Pharmaceutical_Research"))
print(google_news_wv.similarity("hot", "GNI"))
print(google_news_wv.similarity("hot", "dog"))
print(google_news_wv.similarity("hot", "trunk"))
print(google_news_wv.similarity("boat", "fuzzy"))

['Hot', 'hottest', 'hotter', 'sizzling', 'scorching']
[0.6659734845161438, 0.6050904989242554, 0.5794501900672913, 0.5456804633140564, 0.5294975638389587]
[0.05301143256911484, 0.05908263408238489, 0.08296832545223916, 0.07099674806561757, 0.08010270170819725]
['cool', 'heated', 'cooled', 'cools', 'toasty']
[0.515114963054657, 0.5010007619857788, 0.4956398010253906, 0.4776151478290558, 0.47559136152267456]
[0.09783667977494581, 0.06985106247671215, 0.07783893994317491, 0.07441751782190638, 0.18550624028290613]
0.46021384
0.43215373
0.38504532
-0.21174502
-0.17452702
-0.16243233
0.20240277
0.04439323
-0.0021879696
