# Probabilistic word embeddings

## Anna Potapenko, Artem Popov, Konstantin Vorontsov

This is an example of Probabilistic Word Embeddings construction. You can read about Probabilistic Word Embeddings in our paper "Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks", AINL-2017.

## 1. Preparing the collection 

We use Omer Levy's script hyperword: https://bitbucket.org/omerlevy/hyperwords

This repository contains tools for transforming document collection to a list of words pairs and their cooccurences. 
Some hyperparameters are usually pre-set or obscured in other packages and hyperword script allows to tune them all manually. 

On the Ubuntu system you can create .sh file with required commands. Our file is looking like this:

```
CORPUS=enwiki_clean.txt
mkdir w5.sub

python hyperwords/my_corpus2pairs.py --win 5 --sub 1e-5 ${CORPUS} > w5/pairs
scripts/pairs2counts.sh w5/pairs > w5/counts 
python hyperwords/counts2vocab.py w5/counts

```

Than you have to create vowpal wabbit file:

In [1]:
from collections import defaultdict
import numpy as np

In [None]:
pseudo_docs = defaultdict(str)

with open('w5/counts') as f:
    for line in f:
        count, word1, word2 = line.strip().split()
        pseudo_docs[word1] += word2 + ':' + str(count) + ' '

In [None]:
# vowpal wabbit file with the collection
output = open('vw_coocurence_counts', 'w')

for key in sorted(pseudo_docs.keys()):
    s = key + ' |  ' + pseudo_docs[key] + '\n'
    output.write(s)
output.close()

## 2. Training topic model

Let’s learn topic model on the coocurence collection using BigARTM library.

In [4]:
import artm
import glob

Firstly, we convert data to the BigARTM batches format. 

In [2]:
# vowpal wabbit file with the collection
file_name = 'vw_coocurence_counts'

# folder with batches
batches_path = 'batches'

In [None]:
if len(glob.glob(batches_path + "/*.batch")) < 1:
    batch_vectorizer = artm.BatchVectorizer(data_path=file_name, 
                                            data_format='vowpal_wabbit', batch_size = 1000, 
                                            target_folder=batches_path)
else:
    batch_vectorizer = artm.BatchVectorizer(data_path=batches_path, 
                                            data_format='batches')

Create the dictionary to initialize the model:

In [None]:
my_dictionary = artm.Dictionary()

if len(glob.glob(batches_path + "/*.dict")) < 1:
    my_dictionary.gather(data_path=batches_path)
    my_dictionary.save(dictionary_path=batches_path + '/wiki_dictionary')

my_dictionary.load(dictionary_path=batches_path + '/wiki_dictionary.dict')

Create the model:

In [None]:
model = artm.ARTM(num_topics=400,
                  dictionary=my_dictionary,
                  reuse_theta=False,
                  cache_theta=False,
                  num_document_passes=10,
                  theta_columns_naming='title',
                  num_processors=7,
                  scores=[artm.PerplexityScore(name='PerplexityScore',
                                               dictionary=my_dictionary),
                          artm.SparsityPhiScore(name='SparsityPhiScore'),
                          artm.SparsityThetaScore(name='SparsityThetaScore'),
                          artm.TopicKernelScore(name='TopicKernelScore', probability_mass_threshold=0.3),
                          artm.TopTokensScore(name='TopTokensScoreText', num_tokens=30)])

In [None]:
model.initialize(my_dictionary)

Fit the model:

In [None]:
# params of the online algorithm
update_after = range(7, batch_vectorizer.num_batches, 7) + [batch_vectorizer.num_batches]
j = np.arange(1, len(update_after) + 1)
rho = np.power(1 + j, -0.5) 
decay_weight = 1 - rho 
apply_weight = rho * batch_vectorizer.num_batches

In [None]:
# online algorithm is recommended to use 0-5 iterations
for i in range(2):
    model.fit_online(batch_vectorizer=batch_vectorizer, async=True,
                     update_after=update_after, decay_weight=decay_weight, apply_weight=apply_weight)

# offline algorithm is recommended to use 20-30 iterations
for i in range(20):
    model.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=1)

Get Probabilistic Word Embedding for each word $w$:

In [None]:
phibayes_phi = model.get_phi(model_name = model.model_nwt)
pwe_vectors1 = phibayes_phi.div(phibayes_phi.sum(axis=1), axis=0)

Another way to get vector:

In [None]:
phibayes_theta = model.transform(batch_vectorizer)
pwe_vectors2 = phibayes_theta.div(phibayes_theta.sum(axis=1), axis=0)

## 3. Evaluation

After fitting we can evaluate representations on a similarity task:

In [3]:
def most_similar(df, positive_query_list, negative_query_list = [], top = 10):
    vec = df.loc[positive_query_list[0]]
    for query in positive_query_list[1:]:
        vec *= df.loc[query]
    for query in negative_query_list:
        vec /= (df.loc[query])
    sim = sklearn.metrics.pairwise.cosine_similarity(df, vec.reshape(1, -1)).flatten()
    ind = np.argpartition(sim, -top)[-top:]
    topind = ind[np.argsort(-sim[ind])]
    for i in topind:
        print df.index[i]
    return df.index[topind[0]]


def find_similarity(df, word1, word2):
    vec1 = df.loc[word1]
    vec2 = df.loc[word2]
    return sklearn.metrics.pairwise.cosine_similarity(vec1.reshape(1, -1), vec2.reshape(1, -1))[0][0]


def find_similarity_lk(df, word1, word2):
    vec1 = df.loc[word1]
    vec2 = df.loc[word2]
    return sklearn.metrics.pairwise.linear_kernel(vec1.reshape(1, -1), vec2.reshape(1, -1))[0][0]


def find_similarity_hellinger(df, word1, word2): #one more sqrt and division by sqrt(2) omitted, minus added
    vec1 = df.loc[word1]
    vec2 = df.loc[word2]
    return -np.sum((np.sqrt(vec1) - np.sqrt(vec2)) ** 2)


def load_human_ratings(path, df):
    human_ratings = {}
    added = 0
    whole = 0
    with codecs.open(path, encoding='utf-8') as fin:
        for line in fin:
            (word1, word2, sim) = line.split()
            whole += 1
            if word1 in df.index and word2 in df.index:
                human_ratings[(word1, word2)] = float(sim)
                added += 1
    #print added, "pairs from", whole, "are covered by model vocabulary and saved for evaluation."
    return human_ratings


def evaluate_sim_task(human_ratings, df, mode='none'):
    if mode == 'cos':
        model_ratings = {key : find_similarity(df, key[0], key[1]) for key in human_ratings}
    elif mode == 'hellinger':
        model_ratings = {key : find_similarity_hellinger(df, key[0], key[1]) for key in human_ratings}
    elif mode == 'lk':
        model_ratings = {key : find_similarity_lk(df, key[0], key[1]) for key in human_ratings}
    else:
        print "Wrong mode!!!"
    (sorted_keys, sorted_model) = zip(*sorted(model_ratings.items(), key=lambda x: -x[1]))
    return scipy.stats.spearmanr(sorted_model, [human_ratings[k] for k in sorted_keys])

In [4]:
# list of similarity datasets 
# you can get them from hyperword script

list_of_paths = ['hyperwords/testsets/ws/ws353_similarity.txt',
                'hyperwords/testsets/ws/ws353_relatedness.txt',
                'hyperwords/testsets/ws/ws353.txt',
                'hyperwords/testsets/ws/bruni_men.txt',
                'hyperwords/testsets/ws/radinsky_mturk.txt']

In [None]:
for path in list_of_paths:
    human_sims = load_human_ratings(path, phibayes)
    key = path[-path[::-1].find('/'):-4]
    answ = evaluate_sim_task(human_sims, pwe_vectors, mode='lk')[0]
    print(key, answ)