<a href="https://colab.research.google.com/github/CharlieOlives/MAI/blob/main/linguisticsai_practical2_2324.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linguistics and artificial intelligence: practical session 2 (24 October 2023)

Note: you are not required to submit any assignments for this practical session.

For these exercises, we will make use of the interactive Python environment provided by Google Colab. The environment makes it possible to run Python code within your web browser. It will be helpful if you already know some basic Python (as provided in the first lectures of the course 'Scripting Languages', for example). But even you don't know anything about programming, you should be able to follow along with the examples.

You are highly recommended to use Chrome as a browser to run this notebook. In other browers (firefox, safari, ..), you are likely to run into problems with certain parts (such as the visualization interface below).

## Word embeddings

In this practical session, we'll automatically induce word embeddings based on an extract from Wikipedia. To do so, we'll make use of the Python library called `gensim`. `gensim` is a vector space modeling and topic modeling
toolkit for Python, and it contains an efficient implementation of
the word2vec algorithms for word embedding induction.

Word2vec consists of two different algorithms: skipgram (sg) and
continuous-bag-of-words (cbow). The underlying prediction task of the
former is to estimate the context words from the target word; the
prediction task of the latter is to estimate the target word from the
sum of the context words. More information can be found in the course
slides and in the background material of the course.

We'll import the Word2Vec class from the `gensim` library using the command below. We'll also import the `logging` module, which will provide us with information during training, and the `gzip` module, which allows us to read compressed files.

In [None]:
import logging
import gzip

from gensim.models import Word2Vec

The following command will download a sample from the English Wikipedia, consisting of 1 million lines (about 26 million words). Note that our corpus has already been tokenized (i.e. punctuation marks have been separated from the words).

In [None]:
!wget http://ccl.kuleuven.be/Courses/linguistics_and_ai/data/wiki_sample_en.txt.gz

We can inspect the beginning of our corpus using the command below:

In [None]:
!zcat wiki_sample_en.txt.gz|head -n20

The next command initializes the logging module, which will provide information about our training procedure.

In [None]:
#remove any default config
for handler in logging.root.handlers[:]:
   logging.root.removeHandler(handler)

#initialize logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

Next, we define a corpus class that will allow us to sequentially iterate over the sentences in our corpus.

In [None]:
class PlainTextCorpus(object):
    def __init__(self, fileName):
        self.fileName = fileName

    def __iter__(self):
        for line in gzip.open(self.fileName, 'rt'):
            line = [w.lower() for w in line.split()]
            yield line

And we instantiate that class using the location of our corpus.

In [None]:
sentences = PlainTextCorpus('wiki_sample_en.txt.gz')

We're all set to start our training procedure, using the command below. We'll go through our corpus, and induce embeddings with a vector size of 100 (`gensim`'s default). In order to have reasonable estimates, we'll only take into account words that appear with a frequency of at least 250 in our corpus. We'll sequentially go through our corpus 5 times (in neural network lingo, one traversal is called an *epoch*). Note that our corpus is fairly small; we'd typically use data that is one to two orders of magnitude larger than the toy data we're using here. Still, the corpus will allow us to come up with reasonable word embedding representations.

The following command will take a couple of minutes to finish, so sit back and enjoy watching the training procedure.

In [None]:
model = Word2Vec(sentences, min_count=250)

Once our training procedure is finished, we'll have a model that contains embeddings for a vocabulary of 8120 words, as can be confirmed with the command below.

In [None]:
len(model.wv.index_to_key)

It is now straightforward to compute the most similar words according to our embedding model. Use the command below to compute the most similar words to the word *piano*.

In [None]:
model.wv.most_similar('piano')

During the lecture, we saw that you can do analogy computations by carrying out vector arithmetic. Let's try that out. Note that the `Word2Vec` module encodes the computations slightly differently. For a computation like *brussels - belgium + france*, you need to provide as arguments a list of the positive terms (*brussels* and *france*), and a list of the negative terms (*belgium*).

In [None]:
model.wv.most_similar(positive=['brussels', 'france'], negative=['belgium'])

Let's also try out the ubiquitous example:

In [None]:
model.wv.most_similar(positive=['king', 'woman'], negative=['man'])

***note:*** Neural networks are non-deterministic, so my output is different than that of other people. Is because model trains slightly different. Multiple epochs helps when you have little data.

Finally, we'll explore a visualization of the embeddings we created. To do so, we'll make use of the `TensorBoard Projector`, the visualization module that comes with Google's `Tensorflow` library for neural network modeling. First we'll import the necessary modules (including the `os` module, which allows us to perform some system commands).

In [None]:
import os
import tensorflow as tf
from tensorboard.plugins import projector

%load_ext tensorboard

We can now convert our embeddings into the right format, so that they can be used by `TensorBoard`. If you're interested in how that works, you can inspect the comments provided in the code below.

In [None]:
# set up a logs directory for Tensorboard
log_dir='/logs/embeddings/'
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

# save vocabulary line by line
with open(os.path.join(log_dir, 'metadata.tsv'), "w") as f:
  for w in model.wv.index_to_key:
    f.write("{}\n".format(w))

# save the embedding vectors in a variable
weights = tf.Variable(model.wv.get_normed_vectors())

# create a checkpoint for the embeddings
checkpoint = tf.train.Checkpoint(embedding=weights)
checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))

# Some more configuration setup
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'metadata.tsv'
projector.visualize_embeddings(log_dir, config)

We're all set. We can now load `TensorBoard`, and use the `Projector` to explore the embeddings we created.

Note: if you get the error `No dashboards are active for the current dataset`, this means the interface has not yet properly loaded. Just press the play button for the same command again in order to properly load it.

Function of running epochs: voor woorden met verschillende betekenissen maar zelfde schrijfwijze (bank bijvoorbeeld), iedere keer dat het netwerk deze tegenkomt ziet hij deze anders. Vb 1 keer als 'geld', andere keer als 'zetel'. Iedere extra keer past hij de vector van het woord bank aan zodat deze beter de twee betekenissen kan aannemen.

In [None]:
# Run tensorboard with the log data
%tensorboard --logdir /logs/embeddings/

The embeddings you have trained will now be displayed in a 3D visualization in the `TensorBoard Projector`. Using the panel on the right, you can search for words to find the closest neighbors to a particular word. For example, try searching for *basketball*. You are likely to see similar neighbors such as *soccer* and *football*. Note that you can use your mouse to move through the 3D space, and click on the circles to visualize the neighbors of a particular word in space.

The visualization you're seeing is a projection of your original embedding space (of 100 dimensions) to three dimensions, using a dimensionality reduction technique called Principal Component Analysis (PCA). The technique projects the original space down to three dimensions, preserving as much of the variance in the original space as possible.

### Exercise 1

Try searching for ambiguous words, such as *record* or *sentence*. What does the neighbouring space look like? Can you see clusters of related senses? You can click the `Isolate .. points` button in the top right in order to visualize only the nearest neighbours, in order to gain a clearer view of the neighbouring space. Also try out a different dimensionality reduction algorithm, such as `UMAP` (this is also a dimensionality reduction technique, but works a bit different: it tries to preserve the nearest neighbours within the original space as closely as possible; you could also try `t-SNE` - which uses a similar idea - but this technique is quite a bit slower, so be prepared to wait).

### Exercise 2

By default, the `Word2Vec` class within the `gensim` library uses *continuous bag of words* (cbow) to induce the embeddings. Change the arguments, so that your embeddings are constructed using the *skip gram* method. You can find more information about the `Word2Vec` class (including the possible arguments) in `gensim`'s [API reference](https://radimrehurek.com/gensim/models/word2vec.html).

***Solution exercise 1:***

Run it in FireFox for the visualization. Doesn't work on Chrome. Select 'Projector' for visualization.

***Solution exercise 2:***





In [None]:
## YOUR CODE HERE: zie oplossing! TA stond ervoor
model = Word2VEc()