# Non-Programming Exercise Week 4 - Solution

## Task 1
The tf-idf (term frequency - inverse document frequency) is a measure of how important a specific word in a document $doc$ is to the document itself, within the context of a collection of documents $D$. 

It has been used in a variety of NLP-applications. In this task, you will calculate some tf-idf values for words in short texts by hand, in order to gain some intuition on how it works. 

The term frequency is calculated using the following formula: 


${\sf tf}(w) = \frac{C_w}{C_t}$

where w is the word in question, $C_w$ is the total count of the word in the document and $C_t$ is the total word count of the document.

Likewise, the inverse document frequency is calculated by this formula:

${\sf idf}(w)=\log(\frac{N}{|\{doc \in D:w \in doc\}|})$

where $N$ is the total number of documents. The denominator here denotes the number of documents in the collection that contain the word $w$, and the whole fraction is then logarithmically scaled using the natural logarithm.

To calculate the final tf-idf value, the two values simply need to be multiplied. This causes the tf to be scaled by the idf, so that words that appear in many or all documents are weighted less than words that appear in few documents.

You can find three short texts below. For each word in the texts, calculate the tf-idf value. Make sure to make each step of your calculation clear (if you repeat steps, you only need to write them down once). You may use a calculator or write some code to get to the results.

### Text 1

There are three trees growing in the garden. <br>
The trees are very big. <br>
They cast big shadows.

### Text 2

Three kids play in the garden. <br>
The shadows of the kids grow big in the evening.

### Text 3
The people are growing vegetables in the gardens.

Example calculation for 'growing' in Text 3: <br>
$C_w = 1$<br>
$C_t = 8$<br>
$N = 3$<br>
$|\{doc \in D:w \in doc\}| = 2$<br>
<br>
=> ${\sf tfidf}({\sf 'growing'}_{t3}) = \frac{1}{8}*\log(\frac{3}{2}) = 0.05$<br>
<br>
TF-IDFs:<br>

Text 1:<br>
there: 0.06<br>
are: 0.05<br>
three: 0.02<br>
trees: 0.13<br>
growing: 0.02<br>
in: 0.0<br>
the: 0.0<br>
garden: 0.02<br>
very: 0.06<br>
big: 0.05<br>
they: 0.06<br>
cast: 0.06<br>
shadows: 0.02<br>

Text 2:<br>
three: 0.03<br>
kids: 0.14<br>
play: 0.07<br>
in: 0.0<br>
the: 0.0<br>
garden: 0.03<br>
shadows: 0.03<br>
of: 0.07<br>
grow: 0.07<br>
big: 0.03<br>
evening: 0.07<br>

Text 3:<br>
the: 0.0<br>
people: 0.14<br>
are: 0.05<br>
growing: 0.05<br>
vegetables: 0.14<br>
in: 0.0<br>
gardens: 0.14<br>


### Questions
Some of the words could have a tf-idf value of 0. Why is this the case? What does that mean conceptually?

Did you spot any problems with this application of tf-idf? (Hint: Look at what kinds of linguistic properties are captured or ignored)


If a word appears in all of the documents, it will have a tf-idf value of 0, because $log(\frac{3}{3}) = 0$. This means that a word that appears in every document of a collection is not considered to be important regarding any text when using the tf-idf metric. <br>
The tf-idf measure cannot distinguish between homographs (words spelled the same, but with different meanings) such as 'growing' in text 1 and three. However, it does make a difference between different forms of a word, even if that is not desirable. For example, it treats 'garden' and 'gardens' as separate words.

## Task 2
An application for the tf-idf values could be to determine the document which is most similar to a given document in a collection. This can be useful, for example, for automatic reading recommendations. <br>
For this task, one could use a bag-of-words representation of a document. A bag-of words approach assumes that a document is characterised by the words it contains. <br>
Thus, we are going to represent each text from Task 1 as a vector and compare them to each other using the cosine similarity:

$\cos(\theta) = \frac{A \cdot B}{\|A\|*\|B\|}$


where $A$ and $B$ are vectors representing the texts to be compared. The higher $\cos(\theta)$ is, the smaller the angle between the verctors becomes, which can be interpreted as the two texts being more similar.

The vector-representation for each text is a vector with dimension N, where N equals the number of words in the entire vocabulary containing all words that occur in any of the documents in the collection. Each cell of a text-vector corresponds to the tf-idf value of a word in the combined vocabulary with respect to the text the vector should represent.<br>
Important: The entries in each vector have to correspond to the same words in all vectors, so, for example, if the first entry of the vector for text 1 contains the tf-idf value for the word "the", the first entry of all other text-vectors must also refer to "the". <br>

Calculate the cosine similarity of text 1 and text 2, and text 1 and text 3. Determine which of the texts 2 and 3 is closer to text 1.


Vector text 1: [0.06, 0.05, 0.13, 0, 0.06, 0.02, 0.06, 0, 0, 0, 0, 0.06, 0, 0.02, 0.02, 0.0, 0.02, 0.0, 0, 0.05, 0]
Vector text 2: [0, 0.03, 0, 0, 0, 0, 0, 0.14, 0, 0.07, 0.07, 0, 0, 0.03, 0.03, 0.0, 0.03, 0.0, 0.07, 0, 0.07]
Vector text 3: [0, 0, 0, 0.14, 0, 0.05, 0, 0, 0.14, 0, 0, 0, 0.14, 0, 0, 0.0, 0, 0.0, 0, 0.05, 0]

Cosine-similarity between texts 1 and 2: 0.0751<br>
Cosine-similarity between texts 1 and 3: 0.0728 <br>
Therefore, text 2 is closer to text 1 than text 3. The difference here is rather small, which is mainly due to the texts (and therefore also the total vocabulary) being very short.

## Task 3

In the lecture, you heard about word embeddings and the Word2Vec algorithm.

1. In your own words, describe what word embeddings are.
2. Briefly describe how Word2Vec works and what it is used for.
3. What are the advantages of representing the words in a text as word embeddings instead of tf-idf values as done in the previous two tasks?

Word embeddings are a vector representation of words. They can have any number of dimensions, but in practice, they usually have around 300 dimensions. The goal of word embeddings is to numerically capture the meaning of a word, by giving it a vector representation that is close to words that are similar in meaning, but far away from words that are very different in the word-vector space. Which is vital for computational language processing, since computers cannot infer meaning from a words written form. <br>
<br>
Word2Vec is an algorithm that constructs word embeddings for the words in a corpus. For this task, a linear neural network with three layers is used; an input layer with as many input nodes as there are words in the vocabulary of the whole corpus, a hidden layer, for which the number of nodes corresponds to the number of dimensions the embeddings should have, and an output layer, with the same number of nodes as the input layer. To start training the network, first, the size of a context window must be chosen, which defines how far a word can be from another to still be considered to lie in its context. For the next step, one must choose whether to use the skip-gram or CBOW approach. For skip gram, the model is tasked to predict the words in the context window of an input word, while CBOW does the opposite, meaning that the model should predict a word from a given context. The embeddings for each word can then either be extracted from the weights that connect the input and the hidden layer, or from the weights that connect the hidden and the output layer. <br>
<br>
One advantage of word emeddings is that their representation in a text does not depend on the specific text and collection of documents used in the corpus, and the can therefore be transferred to other NLP applications easily. Additionally, they allow for a much more precise representation of a word's meaning, as they can have an arbitrary number of dimensions.