In [None]:
# import packages
import os
import pandas as pd
import ast
import emoji
import gensim
from gensim.models.word2vec import Word2Vec
import csv

# Improve query terms using word embeddings

This notebook exemplifies how we used word frequencies and word2vec to improve our query terms. We ran this notebook for all three datasets, but to keep the appendix concise we only show the example for one of the datasets here.

### Importing data

In [None]:
# set working directory
os.chdir(r'C:\Users\maril\Documents\20-21 KU\block 4\DM\twitter\preprocess_word2vec')

In [None]:
# import data
df = pd.read_csv('de_preprocess.csv', encoding='utf8')
print(df.shape)
df.head()

In [None]:
# function to turn the tokenized list into a readable format
def string_list(text):
    
    # we transform the string representation of the list into an actual list
    text = ast.literal_eval(text)
    return text

In [None]:
# apply function
df['preprocess_token'] = df['preprocess_token'].apply(string_list)
df.head()

## Get most frequent words

In [None]:
# count the word frequency

# empty dict to store the word frequency information in
freq_dict = dict()

# iterate through the list of token in the 'pre_process' column
for tweet in df['preprocess_token']:
    
    # iterate through the words in each list
    for word in tweet:
        
        # if the word is not yet dict, add the word as the key and 1 as its value
        if word not in freq_dict:
            freq_dict[word] = 1
        
        # if word is already in the dict, update its value by 1
        else:
            freq_dict[word] += 1
            
# order dict: sort the dictionary by its values
freq_dict = dict(sorted(freq_dict.items(), key=lambda x: x[1], reverse=True))

# display the 200 most common words
# .items() returns the dict items as tuples, we turn that into a list in order to retrieve the first 150 words
# then we turn the list back into a dict (because I think that looks nicer)
dict(list(freq_dict.items())[:200])

After preprocessing the tweet text, we check the most common words in order to see if there are words we missed in our query so far.

## Most frequent words: Results
Among the most frequent 200 words, there are a lot of words that are Covid-19 vaccine specific. We manually read through the list and select all keywords that seem relevant and specific enough. An example for the German query term list is noted below.

In [None]:
# save most relevant frequent words in a list
freq_keywords = ['impfung', 'geimpft', 'impfstoff', 'biontech', 'impfzentrum', 'impfen', 'impftermin', 'moderna', 
                 'pfizer', 'astrazeneca', 'impfpass', 'dosis', 'nebenwirkungen', 'impf', 'stiko', 'impfungen', 
                 'zweitimpfung', 'impfpflicht', 'astra', 'impftermine', 'prio', 'impfzwang', 'impfschutz', 'mrna', 
                 'geimpfte', 'impfstoffe',  'priorisierung', 'erstimpfung',  'coronaimpfung',  'geimpften',  'dosen',  
                 'impfzentren',  'impfneid',  'j&j',  'thrombose', 'curevac',  'erstimpfungen',  'impfdosen']

## Word2vec

### Python implementation gensim

We use the package ``gensim`` and its implementation of word2vec. 

There are two ways to approach the word2vec training:

**OPTION 1**
Use 
```python 
Word2Vec(dataset, **keywordarguments)
``` 
to set all parameters, build the vocabulary and train the model all in one go.

**OPTION 2** 
Split the different steps (set parameters, build vocab and train the model) to get a better overview of what is happening where and how different settings influence the outcomes. We specify:
```python
model = Word2Vec(**keywordarguments)
model.build_vocab(**keywordarguments)
model.train(dataset, **keywordarguments)
```

When passing the same hyperparameters, both options yield exactly the same results. They are included here because both of these approaches appear in tutorials and Stackoverflow; and it is an important point to not mix them up.

**Using word2vec**

Generally, we are interested in the word similarities:
```python
model.wv.most_similar(keyword, **keywordarguments)
```

There are a lot of other things we can do with the results, e.g. access the vectors, but since they are not that relevant for our query building purposes, we won't go into detail.

In [None]:
# check how many words we have in the corpus

# set counter to 0
c = 0

# iterate through the token list in the 'preprocess' column
for word_list in df['preprocess_token']:
    
    # iterate through the tokens in the list
    for word in word_list:
        
        # update the counter for each word
        c+=1

# amount of words we have
c

In [None]:
# set hyperparameters

# initiate and train the model
w2v_model = Word2Vec(df['preprocess_token'], 
                     min_count=15,
                     window=3,
                     vector_size=20,
                     negative=10,
                     epochs=20)

In [None]:
# check manually if the results make sense: you can put any keyword from the corpus and look up the most similar words
w2v_model.wv.most_similar(['impfung'], topn=12)

In [None]:
# get the most similar words to the word for specific terms

# we decided to check synonyms for the most frequent, vaccine-relevant words that I saved in the list
# 'freq_keywords' above

for keyword in freq_keywords:
    display(keyword,
            w2v_model.wv.most_similar([keyword], topn=8))

This is again an example for the German corpus, but we conducted these steps for all three datasets. It is important to note, however, that the Danish corpus was so small that word2vec did not produce any useful results here.

In [None]:
# saving all relevant keywords from the 'most similar' terms provided through word2vec in a list

similar_keywords = ['titer', 'impfdosis', 'impfungen', 'coronaimpfung', 'kreuzimpfung', 'durchgeimpft', 'geimpften', 
                    'impfstoffe', 'impfstoffes', 'impfstoffen', 'vakzine', 'impfstoffs', 'vakzin', 'impfdosen', 
                    'biontec', 'moderna', 'comirnaty', 'vaxzevria', 'astrazeneka', 'mrna', 'bnt', 'impfteam', 
                    'impfzentrums', 'biontech', 'biontec', 'pfizer', 'comirnaty', 'bnt', 'astrazeneca', 
                    'astrazeneka', 'moderna', 'covaxin', 'johnsonjohnson', 'astra', 'vaxzevria', 'astrazenaca', 
                    'astrazenica', 'johnson&johnson', 'impfausweis', 'impfpaß', 'impfnachweis', 'impfstatus', 
                    'zweitimpfung', 'erstimpfung', 'impfdosis', 'zweitgeimpft', 'impfschema', 'nebenwirkungen', 
                    'nebenwirkung', 'impfnebenwirkungen', 'impfreaktion', 'vaxxer', 'impfkommission', 
                    'schutzimpfungen', 'impfstoffdosen', 'impfquote', 'impfung', 'impfstoffe', 'erstimpfung', 
                    'dosis', 'impfabstand', 'zweitdosis', 'erstgeimpften', 'impfzwang', 'zwangsimpfung', 
                    'astrazeneca', 'astraz', 'vaxzevria', 'astrazeneka', 'impftermin', 'impfberechtigt', 'priogruppe', 
                    'impfprio', 'impfpflicht', 'zwangsimpfung', 'impfprivilegien', 'impfwirkung', 'geimpften', 
                    'geimpfter', 'impfstoffes', 'impfstoffs', 'impfpriorisierung', 'impfenschuetzt', 
                    'impfenrettetleben', 'impfenschützt', 'coronaschutzimpfung', 'aermelhoch', 'ungeimpften', 
                    'verimpft', 'impflinge', 'j&j', 'vaxzevria', 'blutgerinnsel', 'thrombosen', 'sinusvenenthrombose', 
                    'sanofi', 'impfstopp']

In [None]:
# creating a final keyword list (both old and new keywords)

# change to the place the keyword csv file is saved in
os.chdir(r'C:\Users\maril\Documents\20-21 KU\block 4\DM\twitter')

# read keyword file
# empty list to stored the updated (old) keywords in
old_keywords = []

with open('keywords_updated.csv', 'r', encoding='utf8') as infile:
    reader = csv.reader(infile)
    
    # iterating through the reader object and adding all keywords to the old_keywords list
    for row in reader:
        old_keywords.append(row[0])

In [None]:
# combine all the keywords and remove duplicates
final_keywords = []

for f in freq_keywords:
    if f not in final_keywords:
        final_keywords.append(f)
    
for w in similar_keywords:
     if w not in final_keywords:
        final_keywords.append(w)
    
for o in old_keywords:
     if o not in final_keywords:
        final_keywords.append(o)

print(final_keywords)

In [None]:
# save the final_keywords list to a csv file
with open('final_keywords.csv', 'w', newline="", encoding='utf8') as outfile:
    writer = csv.writer(outfile)
    
    for word in final_keywords:
        writer.writerow([word])

---

**Keywords & Parameters**

Overview of keywords which can be passed:
  
* ``sentences=None``: The data the model is trained on. Has to be an iterable (such as a list of tokens). If this argument is not passend, the model is left uninitialized.
* ``corpus_file=None`` Path to file containing the corpus (this can be used if the vocabulary is build separately from training the model). Either pass the path to the corpus file or the ``sentences`` parameter; but not both.
* ``vector_size=100``: Size of the word vectors. At some point, adding more dimension diminishes the quality (Mikolov et al., 2019, p. 6), and the vector size is usually recommended to be around 100-300.
* ``alpha=0.025``: Initial learning rate.
* ``window=5``: Amount of words before and after the word in question which are to be taken into account.
* ``min_count=5``: We don't want very infrequent words to be taken into account. All words with a frequency below the value specified will be ignored.
* ``max_vocab_size=None`` Limits the RAM during vocabulary building. Not relevant for our puroses.
* ``sample=0.001``* To make training faster, the number of training examples is decreased according to how frequently words appear in the corpus. That means that very frequent words are more likely to be delete as training examples (in some instances).
* ``seed=1``: Used to ensure reproducibility. The default is 1 and we'll leave it at that.
* ``workers=3``: Used for parallel computing. This specifies the number CPU cores to use for training in parallel. 
* ``min_alpha=0.0001`` The learning rate (specified by the alpha parameter) will linearly drop to ``min_alpha``during training.
* ``sg=0``: Which model to use: 0 = CBOW (Continuous Bag of Words), 1 = Skip-gram.
* ``hs=0``: How the training works: 0 = negative sampling will be used, 1 = hierarchical softmax will be used.
* ``negative=5``: If negative sampling is used, this parameter specifies how many noise words should be drawn.
* ``ns_exponent=0.75``: Value shapes the negative sampling distribution.
* ``cbow_mean=1``: Idk what this is.
* ``hashfxn=<built-in function hash>``: Hash function can be provided to increase training reproducibility. Probably not super important for our purposes, but it is an option.
* ``epochs=5``: Number of iterations (epochs) that are run over the entire corpus. The keyword was formerly called ``iter`` and this old keyword appears in a lot of tutorials, so be cautious that the name changed.
* ``null_word=0``: Keyword is not explained in the documentation. 
* ``trim_rule=None``: Rule specifying how to handle infrequent words. The default is: discard if word count < min_count; this is a straightforward and usefule rule and we will therefore stick with the default.
* ``sorted_vocab=1`` The default sorts the vocabulary in descending frequency before assigning word indexes. The documentation of this keyword is sparse, but it seems that for the keyword does not make a difference for the quality of the results, so we will stick with the default here.
* ``batch_words=10000``: Batch size.
* ``compute_loss=False``: Computes and stores the loss value if the keyword is set to ``True``.
* ``callbacks=()`` If you want to print the loss after each epoch, you have to add callbacks.
* ``comment=None``: Not specified in the documentation and I couldn't find anything on Google either.
* ``max_final_vocab=None``: Limits the target size of the vocabulary.