# Using Word2Vec Embedding to extend unsupervised guesser POC

We can use Gensim to make a more powerful version of our unsupervised Proof-of-Concept. Let's see if we can make less of a toy version using the Google News Skip-Gram model with 300-feature embeddings (requires ~2GB).

In [11]:
import word2vec_loader as wv_loader

limit = 200_000
print("Loading {limit} keys")
google_news_wv = wv_loader.load_word2vec_keyedvectors(wv_loader.GOOGLE_NEWS_PATH_NAME, limit)

Loading {limit} keys


# Modelling a stronger Guesser

Now that we have a bit of a grasp on how the Google News Word2Vec model is compatible with our Decrypto words, let's build a stronger guesser and compare some different probability schemes.

We'll start with some naive strategies that simply manipulate the cosine similarity. I expect these to perform poorly for 2 reasons.

One is that the cosine similarity doesn't really correspond to something probablistic, so in a way we are using it more as a heuristic. This could backfire because it doesn't really take word context or word frequency into account.

Another more subtle reason is that the cosine similarity is symmetric. This implies that the probability of using a clue for a keyword is the same as the probability of using the keyword as a clue for the clue word if it was the keyword (that was a mouthful). We know from Baye's Theorem this isn't quite true, because it doesn't take the probabilities/frequencies of each individual word into account, nor the density of similar neighbors each word has in the vector space.

Importantly, let's not forget to use log probabilities/heuristics due to our design choice in Guesser Proof-Of-Concept.

In [12]:
import math

# naive heuristics

def log_square_cosine_similarity(clue, keyword):
    similarity = google_news_wv.similarity(clue, keyword)
    return 2 * math.log(abs(similarity))

def log_normalized_cosine_similarity(clue, keyword):
    similarity = google_news_wv.similarity(clue, keyword)
    normalized_similiarity = (1 + similarity) / 2
    return math.log(normalized_similiarity)

In [13]:
import decryptogame as dg
import synthetic_datamuse as sd

# load datasets to form clues

def legal(keyword, word):
    return word not in keyword and word in google_news_wv

official_words = list(map(wv_loader.official_keyword_to_word, dg.official_words.english.words))

print("Loading meaning dataset")
meaning_dataset = await sd.load_dataset_from_path(sd.MEANING_DATASET_PATH, "words?ml", official_words)
meaning_dataset = sd.filter_illegal_cluewords(legal, meaning_dataset)


print("Loading triggerword dataset")
triggerword_dataset = await sd.load_dataset_from_path(sd.TRIGGERWORD_DATASET_PATH, "words?rel_trg", official_words)
triggerword_dataset = sd.filter_illegal_cluewords(legal, triggerword_dataset)


print("Done!")



Loading meaning dataset
Loading triggerword dataset
Done!


In [16]:
# compare heuristics
from dataclasses import dataclass
from functools import partial
from itertools import permutations
from random_variable_guesser import Guess, RandomVariable, max_log_expected_probability_guess

# create sets of tests and evaluate performance of each heuristic

keyword_card_length = 4
clue_length = 3

@dataclass
class Test:
    clue: tuple[str]
    code: tuple[int]


@dataclass
class TestSet:
    keyword_card: tuple[str]
    cluecodepairs: list[Test]

@dataclass
class Result:
    test: Test
    guess: Guess

@dataclass
class ResultSet:
    keyword_card: tuple[str]
    results: list[Result]



def codewords(keyword_card, code):
    return  [wv_loader.official_keyword_to_word(keyword_card[i]) for i in code]

all_possible_codes = list(permutations(range(keyword_card_length), clue_length))


def generate_clue_set(clue_from_codewords_func, keyword_card, codes=all_possible_codes):
    tests = [Test(clue_from_codewords_func(codewords(keyword_card, code)), code) for code in codes]
    return TestSet(keyword_card, tests)

num_clue_sets = 100

keyword_card_generator = dg.generators.RandomKeywordCards(card_lengths=[keyword_card_length], seed=100)
test_keyword_cards = [keyword_card for _, [keyword_card] in zip(range(num_clue_sets), keyword_card_generator)]

meaning_clue_from_code = partial(sd.clue_from_codewords, meaning_dataset)
meaning_clue_sets = [generate_clue_set(meaning_clue_from_code, keyword_card) for keyword_card in test_keyword_cards]

triggerword_clue_from_code = partial(sd.clue_from_codewords, triggerword_dataset)
triggerword_clue_sets = [generate_clue_set(triggerword_clue_from_code, keyword_card) for keyword_card in test_keyword_cards]

# len(all_possible_codes) * num_clue_sets =  total guesses for each dataset

def get_result_set(strat_func, clue_set: TestSet) -> ResultSet:
    results = []
    random_vars = [RandomVariable({wv_loader.official_keyword_to_word(keyword): 0.0}) for keyword in clue_set.keyword_card]
    for cluecodepair in clue_set.cluecodepairs:
        guess = max_log_expected_probability_guess(strat_func, random_vars, cluecodepair.clue)
        results.append(Result(cluecodepair, guess))
    return ResultSet(clue_set.keyword_card, results)



for strat_func in [log_square_cosine_similarity, log_normalized_cosine_similarity]:
    print(strat_func.__name__)

    meaning_result_sets = [get_result_set(strat_func, meaning_clue_set) for meaning_clue_set in meaning_clue_sets]
    percent_correct = 100 * sum(result.guess.code == result.test.code for result_set in meaning_result_sets for result in result_set.results) / (len(meaning_clue_sets) * len(all_possible_codes))
    print(f"meaning clue set correct guess correct: {percent_correct}%")

    triggerword_result_sets = [get_result_set(strat_func, meaning_clue_set) for meaning_clue_set in triggerword_clue_sets]
    percent_correct = 100 * sum(result.guess.code == result.test.code for result_set in triggerword_result_sets for result in result_set.results) / (len(triggerword_clue_sets) * len(all_possible_codes))
    print(f"triggerword clue set percent correct: {percent_correct}%")

log_square_cosine_similarity
meaning clue set correct guess correct: 71.25%
triggerword clue set percent correct: 69.25%
log_normalized_cosine_similarity
meaning clue set correct guess correct: 72.5%
triggerword clue set percent correct: 70.75%


Our naive guessers are performing very well! If they were to be guessing randomly, or always producing the same code, we should expect that they only get about 1 in 24 correct, or about 4.17% correct! This is far better than I was expecting; we may be able to get a lot more performance with better heuristics.

A couple more probabilistic heuristics would be perhaps to take the cosine similarity as a proportion of similarity to all words in the dataset. Another might be to take the the number of words with a cosine distance from the keyword that is greater than the clue distance from the keyword as a proportion of all words in the dataset. Not only are these more probabilistic in nature, but the more subtle trait of being asymmetric in the previously described sense.

Tuning these metric with parameters that depend on frequency may also be of relevance. Since we don't have that data through Gensim, [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law) may be useful.

We should look into the naive guesser's performance a bit better, though. Does it have a false sense of confidence for incorrect guesses? Is it simililarly unsure about all of its guesses? Are the clues it is getting wrong reasonable to get wrong, or is there an obvious pattern that it is missing?