# Create Embeddings
To speed up the D-ETM training process, the construction of the word embedding matrices is placed in this separate notebook. It includes the creation of (smaller) embedding matrices for Word2Vec, GloVe and fastText. They include embeddings for bigrams and trigrams.

## Directories & Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount= True)

Mounted at /content/drive


In [None]:
%cd '/content/drive/My Drive/Thesis/Topic-Modeling/'
data_path = 'Data/Technology-Data/processed/preprocessed/texts.txt'

/content/drive/My Drive/Masterarbeit/Topic-Modeling


In [None]:
!pip install fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████████████████████████████████| 71kB 3.3MB/s eta 0:00:011
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3040949 sha256=8fbf608c8da48d313a299fcf09b9d8bdc9b65d1db8794dba5ddca36bff67eb7a
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee91c154b75231136cc3a3321ab0e30f592
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.2


In [None]:
import fasttext
import fasttext.util
import gensim
import numpy as np
import os
import time
import utils
import os

## Settings

In [None]:
emb_size = 300

## Get Vocabulary

In [None]:
vocab, full, train, valid, test = utils.get_data('Data/Technology-Data/processed/final/grouped_years/min_df_50')

In [None]:
len(vocab)

18863

## Word2Vec

In [None]:
docs = []
with open(data_path) as articles:
    a = articles.readlines()
    for line in a:
        tokens = line.split()
        docs.append(tokens)

In [None]:
start_time = time.time()
model_word2vec = gensim.models.Word2Vec(docs,
                                        min_count=50,
                                        sg=1,
                                        size=emb_size,
                                        iter=100,
                                        workers=6,
                                        negative=10,
                                        window=5)

calc_time = time.time() - start_time
print("--- %s seconds ---" % calc_time)

--- 11837.076303720474 seconds ---


In [None]:
word2vec_file = 'Data/Embeddings/Word2Vec/Word2Vec_{}'.format(emb_size)

In [None]:
model_word2vec.save(word2vec_file + '.model')

In [None]:
with open(word2vec_file + '.txt','w') as f:
    for word in list(model_word2vec.wv.vocab):
        vector = list(model_word2vec.wv.__getitem__(word))
        vector_str = " ".join(['%.9f' % val for val in vector])
        f.write(word + ' ' + vector_str + '\n')

## GloVe

Load original cased (!) GloVe vectors (https://nlp.stanford.edu/projects/glove/, 11/2020)

In [None]:
glove_orig = {}
with open('Data/Embeddings/GloVe/glove.840B.300d.txt', 'rb') as f:
    for l in f:
        line = l.decode().split()
        word = line[0]
        try:
            vect = np.array(line[1:]).astype(np.float)
        except ValueError:
            if len(line[1:])==301: # a few GloVe lines contain more than 1 string token
                vect = np.array(line[2:]).astype(np.float)
        glove_orig[word] = vect

In [None]:
len(glove_orig)

2195884

Create data-/task-specific GloVe embeddings by iterating through the vocabulary. Account for bigrams and trigrams by averaging word vectors

In [None]:
glove_embeddings = {}
words_no_glove_emb = []
for v in vocab:
    try:
        tokens = v.split('_')
        if len(tokens) == 3:
            glove_embeddings[v] = np.mean(np.array([glove_orig[tokens[0]],glove_orig[tokens[1]],glove_orig[tokens[2]]]), axis=0)
        elif len(tokens) == 2:
            glove_embeddings[v] = np.mean(np.array([glove_orig[tokens[0]],glove_orig[tokens[1]]]), axis=0)
        else:
            glove_embeddings[v] = glove_orig[v]
    except KeyError:
        words_no_glove_emb.append(v)
        
print('No embedding available for:', ', '.join(words_no_glove_emb))

No embedding available for: ride-hailing, Onlineblog, book-apps, DeepMind, wouldo, GDPR, H.264, Netbytes, USB-C, 000-strong, Margrethe_Vestager, wouldigital, Chromecast, Technobile, OneDrive, 802.11b, cryptocurrencies, Gamesblog, e-envoy, WannaCry, alt-right, book-app, Google+, areilly, Brexit, Waymo, gamesblog, OnePlus, Sky+, Facebook-owned


Save data/task-specific GloVe word embeddings

In [None]:
with open('Data/Embeddings/GloVe/GloVe_300.txt', 'w') as f:
  for word in glove_embeddings.keys():
      vec = ['%.9f' % val for val in glove_embeddings[word]]
      vec_str = " ".join(vec)
      emb_line = word + " " + vec_str + '\n'
      f.write(emb_line)

## fastText
https://fasttext.cc/, https://fasttext.cc/docs/en/crawl-vectors.html (11/2020)

In [None]:
#fasttext.util.download_model('en', if_exists='ignore') # download the model if it has not been done before

In [None]:
fastText_model = fasttext.load_model('Data/Embeddings/fastText/cc.en.300.bin')
fastText_model.get_dimension()
if emb_size < 300:
    fasttext.util.reduce_model(fastText_model, emb_size)
fastText_model.get_dimension()    



300

In [None]:
print('Neighbouring Example for ai: ', fastText_model.get_nearest_neighbors("ai"))
print('Neighbouring Example for AI: ', fastText_model.get_nearest_neighbors("AI"))

Neighbouring Example for ai:  [(0.6824304461479187, 'ai.'), (0.658112645149231, 'lu'), (0.6302372813224792, "'hl"), (0.6282903552055359, 'Ai'), (0.6257882118225098, ',o'), (0.6223296523094177, 'il'), (0.6184976696968079, 'iav'), (0.6163771748542786, 't-i'), (0.6162423491477966, ',y'), (0.615196704864502, 'uai')]
Neighbouring Example for AI:  [(0.7377923130989075, 'A.I.'), (0.6971256732940674, 'AIs'), (0.6795729994773865, 'AI.'), (0.6684409976005554, 'A.I'), (0.6131572723388672, 'non-AI'), (0.6108188629150391, 'AI-driven'), (0.5958052277565002, 'AI-'), (0.5889711976051331, '-AI'), (0.5875733494758606, 'ANs'), (0.5865451693534851, 'AI-based')]


In [None]:
print('Neighbouring Example for Apple: ', fastText_model.get_nearest_neighbors("Apple"))
print('Neighbouring Example for apple: ', fastText_model.get_nearest_neighbors("apple"))

Neighbouring Example for Apple:  [(0.7653286457061768, 'it.Apple'), (0.7601998448371887, 'Appple'), (0.7555220127105713, '.Apple'), (0.7497225999832153, 'Apple.The'), (0.7486265897750854, 'Apple.Apple'), (0.7334216237068176, 'Apple.I'), (0.7238132357597351, 'Apple.'), (0.7206739187240601, 'APple'), (0.709655225276947, 'Apple-'), (0.7031973600387573, '-Apple')]
Neighbouring Example for apple:  [(0.7626952528953552, 'apples'), (0.7096020579338074, 'apple-'), (0.6859333515167236, 'apple.I'), (0.6751999855041504, 'apple.'), (0.6751177906990051, 'non-apple'), (0.6668474674224854, 'pear'), (0.6600887179374695, 'apple.The'), (0.642498791217804, 'apples.'), (0.6265839338302612, 'honeycrisp'), (0.610177755355835, 'apple-pear')]


In [None]:
print('Analogy Example for PS1, PS2, 4G: ', fastText_model.get_analogies("PS1", "PS2", "4G"))

Analogy Example for PS1, PS2, 4G:  [(0.664808988571167, '3G'), (0.5942748785018921, '4g'), (0.5793185234069824, '5G'), (0.5652352571487427, 'LTE'), (0.5615310072898865, '2G'), (0.5493701696395874, '3G-'), (0.5400305390357971, '4GLTE'), (0.5382450222969055, 'hspa'), (0.5359112024307251, 'lte'), (0.5342647433280945, '3g')]


Create a subset of data-/task-specific fastText embeddings by iterating through the vocabulary. Account for bigrams and trigrams by applying fastText's get_sentence_vector():

In [None]:
words_no_ft_emb = []
fasttext_embeddings = {}

for word in vocab:
    try:  
      # Bigrams, Trigrams:
        if '_' in word:
            fasttext_embeddings[word] = fastText_model.get_sentence_vector(word.replace('_',' '))
        else:
            fasttext_embeddings[word] = fastText_model.get_word_vector(word)

    except Exception as e:
        words_no_ft_emb.append(word)

print('Number of words for which no fastText embedding could not be received:', len(words_no_ft_emb))
if len(words_no_ft_emb)<15:
    print('Corresponding words:', words_no_ft_emb)
del fastText_model

Number of words for which no fastText embedding could not be received: 0


In [None]:
fasttext_filename = 'Data/Embeddings/fastText/fastText_{}.txt'.format(emb_size)
with open(fasttext_filename, 'w') as f:
    for word in fasttext_embeddings.keys():
        vec = ['%.9f' % val for val in fasttext_embeddings[word]]
        vec_str = " ".join(vec)
        emb_line = word + " " + vec_str + '\n'
        f.write(emb_line)