## Start by copying this into your Google Drive!!

Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course - Tutorial Document Representation
Author: Gijs Wijngaard and Jan Scholtes
Version: 2024-2025



Welcome to the tutorial about document representation. In this notebook you will go over a number of different methods to represent language. We start with simple representations of how to convert text into numbers. Afterwards, we focus on tf-idf and let you compute tf-idf yourself. Then, you will work with Word2Vec models, and we finish off with transformers and sentence transformers.

---



## Simple representations
We start first with ways to get to numbers from data.
Say we have the following sentence:

In [None]:
sentence = "the quick brown fox jumps over the lazy dog"

We can start with several ways to represent this sentence. We count the occurrence of multiple words together. This is what we call a
n-gram. With the counting of two words together, we call it a bigram.

In [None]:
splitted = sentence.split(" ")
[bigram for bigram in zip(splitted, splitted[1:])]

With the grouping of 3 words together, we call it a trigram.

In [None]:
[trigram for trigram in zip(splitted, splitted[1:], splitted[2:])]

We can also just count each word in our sentence. The whole list of words and their occurrence is what we call a *bag-of-words*. The occurrence of each word is also called the *term frequency* (tf)

In [None]:
{word: sentence.count(word) for word in splitted}

Now as we can see, the word *the* scores higher than the rest in our word count. However, words such as *the* and *and* are not that important for algorithms: they do not say so much what the sentence is about. In contrast to words such as *fox* and *dog* for example.

<a name="dataset"></a>
## Dataset
We first start with collecting a dataset. In this tutorial, we use a [movie review dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/) from NLTK. This dataset contains 1000 positive movie reviews, and 1000 negative movie reviews. We can use this dataset for sentiment analysis: let the machine recognize words that are negative or positive to classify the movie review correctly as negative or positive.

In [None]:
import nltk
nltk.download('movie_reviews')
nltk.download('words')
from nltk.corpus import words, movie_reviews as mr
nltk_words = set(words.words())

We first remove the punctuation from all the words, and afterwards we count the most common words.

In [None]:
import string
from collections import Counter
def remove_punct(word):
    word = word.translate(str.maketrans('', '', string.punctuation))
    return word if word in nltk_words else ''
all_words = Counter(filter(remove_punct, mr.words()))
all_words.most_common(10)

The same problem we have here. Words such as *the* and *a* are the most common amongst the movie reviews of our dataset. However, to do something with the movie review, such as classifying it, we should give a lower probability to these words, as they do not say much about the content itself.

In [None]:
documents = [(list(filter(remove_punct, mr.words(f))), mr.categories(f)) for f in mr.fileids()]
print("Total number of documents:", len(documents))
print("Total number of words in first document:", len(documents[0][0]))

## tf-idf

With tf-idf we can give a more weighted value of relevance of a word (or term) in a text. The tf-idf score increases with the number of occurrences within a document and increases with the rarity of the term in the collection.

Remember how we calculate the tf-idf score:

$$w_{t,d} = \log(1+\text{tf}_{t,d}) \times \log_{10}(\frac{N}{\text{df}_{t}})$$


Lets start with calculating the term frequency (tf). Now, we calculated the number of words for all documents. However, to calculate the tf-idf score we need to calculate the term-frequency for each term per document. Thus, we need to loop over the documents and count the occurrences of the terms per document.

In [None]:
tf = [Counter(words) for words, category in documents]
tf[0].most_common(10) # Most common terms for the first document

Now lets also calculate the document frequency (df). This is a bit more involved, since we need to calculate for each word for how many documents that word occurs. We can do that with something like this.
We make the documents into sets (a collection of unique words) to speed up the calculation. Instead of O(n) we get O(1). Then, we loop over all words, and retrun 1 for each document the word occurs in. We sum these 1's to get a count of all documents.

In [None]:
setted_docs = [set(doc) for doc, category in documents]
df = {word: sum([1 for doc in setted_docs if word in doc]) for word in all_words.keys()}
list(df.items())[:10]

### Exercise 1
> Implement the tf-idf score for each word per document yourself.
You may use `numpy` to calculate `log` and `log10`.

> Hint: remember that you can access keys of a dictionary with `.keys()`, values with `.values()` and a tuple of both with `.items()`.

In [None]:
# ANSWER HERE

### Exercise 2
a. Using the list of tf-idf scores you computed above, get words with the highest valued tf-idf score for both the negative and the positive reviews.

b. What do you notice? Write down in text what you see. Do you see a difference between both lists of 50 words? Are there also words that are the same? Could we train a classifier that given the tf-idf score of the words in a document could predict correctly whether the review was positive or negative?

In [None]:
# ANSWER HERE

WRITE ANSWER HERE

## Word2Vec
In the previous section we have seen we can represent documents by its words by focussing on words that are least occurring in documents but occurring a lot in a specific document. In this section, we will focus on word representations. We will train a Word2Vec model from scratch, by using the same dataset as before. In this way, we try to compare the two datasets and see if we can find differences between words in a negative setting vs words in a positive setting.

Word2Vec learns its word embeddings by looking inside the documents and checking the nearby words. The core idea behind this is that similar words are nearby in a sentence.
The most common implementation for Word2Vec in Python is the one by [gensim](https://radimrehurek.com/gensim/models/word2vec.html). We can compute the embeddings by passing our documents as sentences to the model. Then to get an embedding, we just index the models word vectors with our needed embedding:

In [None]:
from gensim.models import Word2Vec
model = Word2Vec(sentences=[doc for doc, cat in documents])
word_vectors = model.wv
word_vectors['the']

We can find the most similar vector nearby a word using `most_similar`.

In [None]:
word_vectors.most_similar('king')

And we can even do arithmetic with it. The most famous example of this is the `king + man - woman = queen` analogy. By adding the vector of king and man to each other, and subtracting the vector of woman, we should get the queen vector. Lets try!

In [None]:
word_vectors.most_similar(positive=['king','woman'],negative=['man'])

We get queen as the second most similar vector. We only trained our word2vec model on our reviews dataset which is a small dataset for word2vec standards, so that makes sense.

Lastly, lets plot the data. For this, we need to represent our vectors as a 2-d space. For this, we need a dimensionality reduction technique, such as PCA or t-SNE. We use t-SNE (invented by someone who did the same master as you are doing!). It might take a while to compute the vectors below:

In [None]:
from sklearn.manifold import TSNE
import numpy as np
tsne = TSNE(n_components=2, random_state=0)
vectors = tsne.fit_transform(np.asarray(model.wv.vectors))
x, y = zip(*vectors)

In [None]:
len(x), len(y)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 12))
plt.scatter(x, y)

## Pretrained Word2Vec
Word2Vec actually works best when using a pretrained word vectors. This means that we would not put in data in the model to train a good representation, but we rely on external researchers that have already trained such a system on so much data the word vectors have a good representation already.

We now will use glove vectors. We can import such model like so. It might take a while to download them.

In [None]:
import gensim.downloader
glove = gensim.downloader.load('glove-wiki-gigaword-50')

In [None]:
glove["king"]

### Exercise 3
> Using all our `documents`, get the `glove` pretrained word vector for every word, take the average over all word vectors for each document and train a simple binary classifier from [scikit-learn](https://scikit-learn.org/) such as `LogisticRegression` or `SVC`(support vectors machine) on the averaged vectors per document with the classes (y value) being whether that review was positive or negative.

> Remember, split the data in training and test sets first. For example in a 80/20 split. Both datasets should have about the same number of positive and negative reviews (use can use [this function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)). In scikit-learn, use `.fit()` to fit the training data, then use `.predict()` to test on test data. You can use [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to test the accuracy of the model.

In [None]:
# ANSWER HERE

## Bias in Word2Vec
One of the problems with Word2Vec (and with machine learning in general) is that there is lots of biases assumed by the model. Examples of biases that can be harmful when using these algorithms include gender bias and ethnicity bias. Lets check for example what happens if we take the female equivalent of `doctor`:

In [None]:
glove.most_similar(positive=['doctor','woman'],negative=['man'])

### Exercise 4
> Think of other examples of bias in word2vec (check the slides for ideas). Also explain why these types of biases are bad/harmful.

ANSWER HERE

In [None]:
# COMPUTE VECTORS HERE

## Transformers

We arrive at the state-of-the-art, Transformers models! Although in another course we go deeper into Transformers itself, in this section we will go through representing our dataset as vectors. We do this again with the use of a pretrained model, for example BERT. We use the `transformers` library from HuggingFace to download the model, and use it on our data. Lets install the library first and import the model.

In this section we will use Sentence Transformers library, which is a popular way of calculating embeddings for sentences using transformer models. See the documentation of the library [here](https://www.sbert.net/).
In essence, this library basically also uses BERT-based models, but uses a mean pooling algorithm to average the embeddings out over its tokens. Its also more efficient, it would take some time using BERT to compute all the embeddings for every document in our reviews dataset, Sentence Transformers is optimized to do such task.

We start with downloading the library using `pip` and importing a model.

In [None]:
!pip install -qq sentence-transformers
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

We can encode any sentence like this:

In [None]:
sentence_embedding = sentence_model.encode("the quick brown fox jumps over the lazy dog")
sentence_embedding.shape

We now get a vector of 384 instead of a matrix of 11 by 768. This makes it much easier to deal with.

### Exercise 5
> We now again do sentiment classification, this time with Sentence Transformers. Convert the documents in your dataset by passing them all in your `encode()` function of a `sentence_model`. Then, using this matrix use the models you defined at Exercise 3 with this matrix as input. Again use the `accuracy_score` function on the test set, like you did in Exercise 3, to test how good these models perform.

In [None]:
# ANSWER HERE

> Do you see a difference between the accuracy had at Exercise 3 and the accuracy here? Why do you think this is the case? How can we even further improve the accuracy?

ANSWER HERE

#Submission
Please share your Colab notebook by clicking File on the top-left corner. Click under Download on Download .ipynb and upload that file to Canvas.