# Lab 2. Text Representations

After completing this Lab, you will be able to:
- Read the dataset
- Preprocess the dataset
- Build TF-IDF text representations
- Use TF-IDF to find similar texts
- Use TF-IDF features for text classification
- (Optional) Use TF-IDF for text summarization

In [None]:
from pathlib import Path
from string import punctuation
from collections import Counter

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
%matplotlib inline

We are going to work with [MPST (Movie Plot Synapses with Tags)](https://ritual.uh.edu/mpst-2018/) dataset. This dataset consists of plot symopses for 14,828 movies, each plot having a set of genre tags. You can also [preview the dataset on Kaggle](https://www.kaggle.com/cryptexcode/mpst-movie-plot-synopses-with-tags).

The dataset is in the CSV format.

Load the dataset into pandas dataframe.

In [None]:
dataset = pd.read_csv('mpst_full_data.csv')

As usual, we need to implement some preprocessing pipeline. For now, we will tokenize the texts, lowercase it, and remove the punctuation. Optionally, we can also lemmatize the texts but we will skip this step for now.

In [None]:
def preprocess(texts):
    # YOUR CODE STARTS HERE
    
    # YOUR CODE ENDS HERE

Preprocess the plot synopses and add them to our pandas dataframe.

In [None]:
tokenized_synopsis = preprocess(dataset['plot_synopsis'])

In [None]:
dataset = dataset.assign(tokenized_synopsis=pd.Series(tokenized_synopsis).values)

Select only the synopses used for training.

In [None]:
train_dataset = dataset[dataset['split'] == 'train'].reset_index()

In [None]:
len(train_dataset)

First step in building the text representations is to create a vocabulary for our collection of texts. In NLP, a vocabulary is very often accompanied by a mapping from words to indices. For example, if we have the following collection of texts:

```[['the', 'cat', 'sat', 'on', 'the', 'mat'], ['who', 'will', 'feed', 'the', 'cat']]```

the vocabulary is:

```{'the', 'cat', 'sat', 'on', 'mat', 'who', 'will', 'feed'}```

Then, the mapping can be as simple as:

```{'the': 0, 'cat': 1, 'sat': 2, 'on': 3, 'mat': 4, 'who': 5, 'will': 6, 'feed': 7}```

Finally, if we to transform our texts with the mapping, we get:

```[[0, 1, 2, 3, 0, 4], [5, 6, 7, 0, 1]]```

Usually, the vocabulary is ordered by the word frequency, so the words like "the", "a", "to" have lower indices. We can also remove some rare words from the vocabulary to reduce it's size.

For our task, first create a frequency list from all the texts in the training set. Then, create a mapping from words to indices only taking the words that appear more then five times.

In [None]:
# YOUR CODE STARTS HERE

# YOUR CODE ENDS HERE

In [None]:
len(word2idx)

Now, we are going to calculate the first step in the TF-IDF representation. TF stands for term frequency, and it is just number of times each word appears in each text.

The frequency of the word $t$ in the document $d$ is:

$$tf_{t,d} = count(t, d)$$

We can also take a $\log_{10}$ of word counts to make the numbers smaller:

$$tf_{t, d} = \log_{10}(count(t, d) + 1)$$

To compute a TF matrix, we can first create an matrix of size $|V|x|D|$, where $|V|$ is the size of the vocabulary and $|D|$ is the number of documents in the collection. Then, we go through each word in each document and increase the corresponding item in the matrix by one using out vocabulary mapping.

If we return to our example with the cat on the mat, $|V| = 8$ and $|D| = 2$. The final TF matrix will be:

|      | d1 | d2 |
|------|----|----|
| the  | 2  | 1  |
| cat  | 1  | 1  |
| sat  | 1  | 0  |
| on   | 1  | 0  |
| mat  | 1  | 0  |
| who  | 0  | 1  |
| will | 0  | 1  |
| feed | 0  | 1  |

In [None]:
def get_tf(texts, word2idx):
    # YOUR CODE STARTS HERE
    
    # YOUR CODE ENDS HERE

In [None]:
tf = get_tf(train_dataset['tokenized_synopsis'], word2idx)

Now, for the second term: IDF. IDF stands for inverse document frequency.

First, we need to get the DF or the document frequency. We can do it by counting the number of documents each word appeared in. To find it, we don't need to do through the text collection again, since we can easily extract it from the TF matrix. Think on how to do it.

Finally, the inverse document frequency, or IDF is a $\log_{10}$ of number of documents divided by the document frequency:

$$idf_t = \log_{10}(\frac{|D|}{df_t})$$

Most of the times, some kind of smoothing is applied to IDF. For example, sklearn package smoothes it by artificially adding a document that contains each word in the vocabulary:

$$idf_t = \log_{10}(\frac{1 + |D|}{1 + df_t}) + 1$$

Use the smoothed IDF formula to complete the function below:

In [None]:
def get_idf(tf):
    # YOUR CODE STARTS HERE
    
    # YOUR CODE ENDS HERE

In [None]:
idf = get_idf(tf)

The last step is to assemble everything together. The final TF-IDF term is a simple multiplication of TF and IDF.
Now, the words that are characteristic of a document will have a high TF-IDF scores and non-relevant words will have a TF-IDF score close to zero.

Also, now each column represents a document. This way, we have created a document representation. But how do we use it now to find similar documents? 

To find the similarity between two documents, we can use the assumption that the documents are similar if their vectors have similar direction. We can quantify it by calculating a __cosine similarity__ between two vectors. From linear algerbra, we know that the dot product of two unit vectors is the cosine of the angle between them. So, the cosine of 1 means that the two vectors are proportional and cosine of 0 means that they are orthogonal. This way, the documents that have bigger cosine similarity are more similar to each other.

Complete the function below. Calculate the final TF-IDF matrix and the normalize each document vector by dividing every element of a vector by the length of the vector.

In [None]:
def get_tfidf(tf, idf):
    # YOUR CODE STARTS HERE
    
    # YOUR CODE ENDS HERE

In [None]:
tfidf = get_tfidf(tf, idf)

Make a fucntion that calculates the dot product between a query vector and corpus matrix, and return the indices of top-k documents sorted by the cosine similarity.

In [None]:
def find_similar(query, corpus, k=5):
    # YOUR CODE STARTS HERE
    
    # YOUR CODE ENDS HERE

Let's try to find the top-5 similar texts to the first text in out collection. If we did everything correctly, the first text should be the most similar since it have the same vector :)

In [None]:
top_similar = find_similar(tfidf[:, 0], tfidf)
top_similar

We can now print the most similar plot synopses:

In [None]:
def print_similar(query, top_similar):
    print("Query: ")
    print(query['title'])
    print('---')
    print(query['tags'])
    print('---')
    print(query['plot_synopsis'])
    print('\n\n')
    for i, similar in top_similar.iterrows():
        print(f"Similar #{i}: ")
        print(similar['title'])
        print('---')
        print(similar['tags'])
        print('---')
        print(similar['plot_synopsis'])
        print('\n\n')

In [None]:
print_similar(train_dataset.loc[0], train_dataset.iloc[top_similar])

Now, what if we want to find similar plot synopses for another movie, not seen in the training set? Easy!

We just calculate the TF matrix for the new synopsis using the same preprocessing and vocabulary mappings as for the training set. And then we just multiply it with the IDF from the training set.

Go to the IMDB website and find a plot synopsis of your favorite movie. Then find the most similar movies from our train collection.

In [None]:
test_title = "Don't Look Up"
test_tags = ['comedy', 'drama', 'sci-fi']
test_text = """Kate Dibiasky, an astronomy grad student at Michigan State University, discovers the existence of an unidentified comet. Her professor, Dr. Randall Mindy, calculates that the trajectory of the comet crosses that of the Earth and that an impact will take place in about six months, killing all life in the process. Accompanied by scholar Teddy Oglethorpe, Kate and Randall travel to the White House to present their findings, but are met with apathy from U.S. President Janie Orlean and her staff, including her son, Chief of Staff Jason. The attempt to inform the population through a television program also fails, though Kate's on-camera antics go viral online. When Orlean becomes involved in a sex scandal, she announces the threat of the comet to divert attention. The news is finally spread by the media and the launch of a spaceship that can hit and divert the comet, saving the planet, is announced. However, the operation is canceled mid-flight when Peter Isherwell, a tech billionaire and prominent funder of Orlean, discovers that the comet is composed of trillions of dollars worth of precious minerals that have become scarce on Earth. The White House plans to commercially exploit the comet by crushing it to reduce its size and recovering the fragments. Kate and Teddy immediately abandon the operation in protest, while Randall submissively becomes a prominent voice in advocating for the comet's commercial opportunities, as well as starting an affair with talk show host Brie Evantee. The world becomes ideologically divided between those who demand the total destruction of the comet, those who decry unjustified alarmism, and those who deny that a comet even exists. Meanwhile, Kate returns home to Michigan and begins a relationship with a boy named Yule. After his wife June discovers his infidelity, Randall becomes angered and voices his frustrations on live television, launching into a rant criticizing Orlean's administration for downplaying the impending apocalypse and questions humanity's indifference, before leaving the operation and reconciling with Kate. Orlean and Isherwell's plan to recover the comet's materials fails, leaving them, along with a group of wealthy Americans, to flee in a spaceship designed to find the nearest Earth-like planet. However, they accidentally leave Jason behind in the process. Before leaving, Orlean offers Randall a place on the ship, but he turns her down, choosing to spend his last moments in the company of Kate, his family, Yule, and Teddy. The comet finally hits the planet, killing everyone. In a mid-credits scene set twenty-two thousand years later, the presidential ship lands on a lush alien planet. Its passengers wake up from cryogenic sleep and take a look at the surrounding environment only to immediately be attacked and killed by the planet's wild animals. In a post-credits scene, Jason is shown to have survived the extinction of life on Earth, wondering if his mother is still coming back, and documents the aftermath on his phone."""

test_dataset = {
    'title': test_title,
    'tags': test_tags,
    'plot_synopsis': test_text
}

In [None]:
test_tokenized = preprocess([test_text])
test_tf = get_tf(test_tokenized, word2idx)

In [None]:
test_tfidf = get_tfidf(test_tf, idf)

In [None]:
test_top_similar = find_similar(test_tfidf[:, 0], tfidf)

In [None]:
print_similar(test_dataset, train_dataset.iloc[test_top_similar])

Luckily, we don't have to implement TF-IDF from scrach every time. sklearn package has a very convenient [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfVectorizer) that is very easy to use.

This time, we will initialize it with our vocabulary mappings and a dummy tokenizer so that we can compare it with our implementation. You can see what different arguments do in the documentation.

In [None]:
vectorizer = TfidfVectorizer(
    vocabulary=word2idx, 
    tokenizer=lambda x: x, 
    preprocessor=lambda x: x, 
    token_pattern=None
)

Then, our TF-IDF is just one line away!

In [None]:
X = vectorizer.fit_transform(train_dataset['tokenized_synopsis'])

We can also use the [`cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html?highlight=cosine#sklearn.metrics.pairwise.cosine_similarity) function from sklearn to find similar documents.

Just pay attention that sklearn implementation returns TF-IDF matrix of $|D|x|V|$, whereas our implementation returned $|V|x|D|$!

In [None]:
cosine_similarity(X[0], X)

Let's compare the sklearn implementation with ours. The results may be a bit different from what we had, but this is normal.

In [None]:
test_X = vectorizer.transform(test_tokenized)

In [None]:
test_sklearn_top_similar = np.argsort(cosine_similarity(test_X, X)[0])[::-1][:5]

In [None]:
print_similar(test_dataset, train_dataset.iloc[test_sklearn_top_similar])

Now, you can use these document representations as input features to a machine learning algorithm. This is what you are going to do in your Homework :)

### Optional exercise

It is also possible to use TF-IDF to make a text summarization tool. To do that, we can split a document into sentences and calculate the TF-IDF matrix for it, just this time, we will have sentences instead of documents. 

Then, remember that important words have high TD-IDF score. You can use it to calculate the importance of each sentence by summing up all the scores. Finally, use some threshold to pick the most important sentences. This will be your document summary.

Try to take any plot synopsis and make its summary using TF-IDF.