# Data-Driven Research Assignment 2: Topic Modeling
This notebook contains the second, collaborative, graded assignment of the 2023 Data-Driven Research course. In this assignment you'll use a topic modeling tool in order to uncover the ''topics'' of a large set of reviews of popular films. 

To complete the assignment, complete **Part 1, Part 2, Part 3 and Part 4** of the **Your Model** section at the end.

This is a collaborative assignment. In the text cell below, please include all the names of your group members.

If you used code or a solution from the internet (such as StackOverflow) or another external resource, please make reference to it (in any format). Unattributed copied code will be considered plagiarism and therefore fraud.


**Authors of this answer:** Leonards Leimanis

# 1. Introduction

You'll use a Topic Modelling tool from Gensim, a popular library for topic modelling in Python, though these days mainly known for its implementation of Word2Vec to train word embeddings (dense representations). Using this library, you will model topics based on reviews of popular films. The reviews are stored in plain text files, organized by film and rating. The aim of this exercise is to familiarize you with the topic modeling process and its output and to get insight in what kinds of topics are modeled.

# 2. Preparation

This assignment comes with the following files:


1.   The reviews of the films. This is the data in which we want to find topics. They are found in the movie2k/txt_sentoken directory. There are then two types: negative reviews (neg directory) and positive reviews (pos directory). The reviews are already tokenized.
2.   Stopword list files. They are found in the stopwords directory.

Let's start by loading the movie reviews from the files (I'll do it for you):

In [53]:
import os

def load_reviews(folder_path):
    reviews = [] #Make a list to put the reviews in
    reviewnames = [] # Make a list to put the review filenames in (to be able to look them up later)
    tokens = 0 #Make a counter for the number of tokens
    
    for file in os.listdir(folder_path):
        #Loop through all the text files in the folder, each containing one review
        
        if not file.endswith('.txt'):  #Only read text files
            continue

        file_path = os.path.join(folder_path, file)

        #Open the text file and read its contents
        with open(file_path, encoding='utf-8') as infile:
            review = infile.read()
        reviewnames.append(file)
            
        # Turn the string with the review into a list of words (this is easy because it is already tokenized)
        review = review.split()
        # And add it to the list
        reviews.append(review)
        # To count the number of tokens processed so far
        tokens = tokens + len(review)

    print(f"Loaded reviews from {folder_path} containing {tokens} tokens in total.") 
    return reviews, reviewnames
        
folder_path = "movie2k/txt_sentoken"
    
movie_reviews_pos, movie_reviewnames_pos = load_reviews(folder_path + "/pos") #Load the positive reviews
movie_reviews_neg, movie_reviewnames_neg = load_reviews(folder_path + "/neg") #Load the negative reviews

movie_reviews = movie_reviews_pos + movie_reviews_neg #Combine the lists of positive and negative reviews into one
movie_reviewnames = movie_reviewnames_pos + movie_reviewnames_neg #The same for the list of filenames

Loaded reviews from movie2k/txt_sentoken/pos containing 787051 tokens in total.
Loaded reviews from movie2k/txt_sentoken/neg containing 705630 tokens in total.


If you are working on Google Colab, you will probably have to change the path to the files to something that Google Colab has access to. For example, you could put the files on your Google Drive and then load them from there, as we did in Coding the Humanities. For more details about how to work with files in Python and load them from Google Drive, have a look at the Coding the Humanities course notebook on Files: https://github.com/bloemj/2023-coding-the-humanities/blob/main/notebooks/4_ReadingAndWritingFiles.ipynb

How to load files off Google Drive is explained at the beginning there.

## Preprocessing

Now that we have loaded the text, you might want to perform some pre-processing steps to be able to create a better bag-of-words model in which all forms of a word are mapped to a single number. For example, you could remove the punctuation characters, or you could perform lemmatization or stemming, which we discussed in the lecture. This would be the place to do it by writing a preprocessing function that accepts a list of movie reviews as its argument and returns a preprocessed list of movie reviews. Feel free to use your knowledge of text normalization from Coding the Humanities or the functions you wrote then. Here is some information on how to perform stemming with NLTK: https://www.nltk.org/howto/stem.html

You can also try other forms of preprocessing, if you are able to do it.

Make sure to also keep the unmodified reviews, so you can compare the results with preprocessing and without preprocessing.

**Part 1: Preprocessing**

You can also skip this part for now - it is not required to perform the topic modelling, but you will get better results.

In [54]:
from nltk.stem import *
import string

preprocessed_movie_reviews = []

removable_characters = [',', '(', ')', '.', '?', '!', ':', '~', '`', ';', '"', "'", '»', "<p>", "</p>", "<P>", "</P>"]

for review in movie_reviews:
    normalized = []
    for token in review:
        if token in string.punctuation:
            continue
        else:
            normalized.append(token)
    preprocessed_movie_reviews.append(normalized)


In [55]:

print(movie_reviews[0])

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', "they're", 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', "there's", 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.', 'for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'80s", 'with', 'a', '12-part', 'series', 'called', 'the', 'watchmen', '.', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'researched', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'saying', 'michael', 'jackson', 'is', 'starting', 'to', 'look', 'a', 'little', 'odd', '.', 'the', 'book', '(', 'or', '"', 'graphic', 'novel', ',', '"', 'if

In [56]:
print(preprocessed_movie_reviews[0])

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', 'whether', "they're", 'about', 'superheroes', 'batman', 'superman', 'spawn', 'or', 'geared', 'toward', 'kids', 'casper', 'or', 'the', 'arthouse', 'crowd', 'ghost', 'world', 'but', "there's", 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', 'for', 'starters', 'it', 'was', 'created', 'by', 'alan', 'moore', 'and', 'eddie', 'campbell', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'80s", 'with', 'a', '12-part', 'series', 'called', 'the', 'watchmen', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'researched', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'saying', 'michael', 'jackson', 'is', 'starting', 'to', 'look', 'a', 'little', 'odd', 'the', 'book', 'or', 'graphic', 'novel', 'if', 'you', 'will', 'is', 'over', '500', 'pages', 'long', 'and', 'includes', 'nearly', '30', 'more', 'that', 'co

# 3. Topic Modelling using Gensim

Gensim offers an implementation of Latent Dirichlet Allocation (LDA), the most popular topic modelling algorithm, which we discussed in the lecture. If you are working on Google Colab, it is normally already installed there. Otherwise, you can install it with `pip install --upgrade gensim` or if you are using Conda, `conda install -c conda-forge gensim`.

Let's load it, and some other things we use:

In [57]:
import gensim
import gensim.corpora as corpora
import gensim.models as models
import itertools
from operator import itemgetter
print(gensim.__version__)

4.3.1


## Constructing the bag-of-words model

The `gensim.corpora.Dictionary()` class allows you to map words to numbers, which is what we need to make a bag-of-words model. In particular, the doc2bow() function converts a collection of words to a bag-of-words representation:

In [58]:
movie_dictionary = corpora.Dictionary(movie_reviews)
movie_bow_corpus = [movie_dictionary.doc2bow(d) for d in movie_reviews]

Let's see what happened:

In [59]:
print('Number of unique tokens in the dataset:', len(movie_dictionary))

#Checking the first 11 words in the bag-of-words model
print('\nThe first 11 words in the bag-of-words model:')
print(dict(itertools.islice(movie_dictionary.token2id.items(), 12)))

#Checking the first 100 words of the first review
print('\nThe start of the first review:')
print(movie_reviews[0][:100])
#And the filename of that review is...
print('\nThe filename of the first review:')
print(movie_reviewnames[0])

#Which words are used in that review?
print('\nMost frequent words in the first review:')
for i, freq in sorted(movie_bow_corpus[0], key=itemgetter(1), reverse=True)[:20]:
    print(movie_dictionary[i], "-->", freq)
print("...")

Number of unique tokens in the dataset: 50920

The first 11 words in the bag-of-words model:
{'"': 0, "'80s": 1, '(': 2, ')': 3, ',': 4, '-': 5, '.': 6, '00': 7, '102': 8, '12-part': 9, '1888': 10, '2': 11}

The start of the first review:
['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', "they're", 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', "there's", 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.', 'for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'80s", 'with', 'a', '12-part', 'series', 'called', 'the', 'watchmen', '.', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'r

## The topic model

Now, we can train our LDA model on this bag-of-words data by using `gensim.models.ldamodel.LDAModel()`.

This model can take various parameters that specify what kind of model gets made. Some important ones:


* num_topics: how many topics do we want? In what follows, we set the number of topics to 5, because we want to have a few topics that we can interpret, but the number of topics is data and application-dependent;
* id2word: our bag-of-words dictionary needed to map ids to strings;
* passes: how often we iterate over the entire corpus (default = 1). In general, the more passes, the higher the accuracy. This number is also called epochs in Artificial Intelligence and Machine Learning.

Let's first make a model that finds 5 topics, and tries 25 times to improve its estimate. This code may take a while to run, as it is the process that creates the topic model. If it takes too long, you can reduce the number of passes, but the topics might be worse.

In [60]:
reviews_ldamodel = models.ldamodel.LdaModel(movie_bow_corpus, num_topics=5, id2word = movie_dictionary, passes=25)

And let's have a look! An easy way to inspect the created topics is by using the `show_topics()` method, which prints the most representative word for each topic along with their probability.

In [61]:
reviews_ldamodel.show_topics(num_words=8) #Show the top 8 words for each topic

[(0,
  '0.051*"," + 0.049*"the" + 0.045*"." + 0.028*"a" + 0.024*"of" + 0.023*"and" + 0.021*"to" + 0.017*"is"'),
 (1,
  '0.000*"," + 0.000*"." + 0.000*"to" + 0.000*"the" + 0.000*"of" + 0.000*"a" + 0.000*"and" + 0.000*"""'),
 (2,
  '0.001*"caveman\'s" + 0.001*"valentine_" + 0.001*"_the" + 0.001*"mouse" + 0.001*"romulus" + 0.000*"ghostface" + 0.000*"ghostface\'s" + 0.000*"homeless"'),
 (3,
  '0.056*"," + 0.054*"the" + 0.040*"." + 0.026*"and" + 0.024*"a" + 0.023*"of" + 0.021*"to" + 0.018*"is"'),
 (4,
  '0.048*"." + 0.045*"the" + 0.044*"," + 0.023*"a" + 0.021*"to" + 0.020*"and" + 0.019*"of" + 0.014*"is"')]

There we go, we have a topic model. However, you can probably see that it is far from perfect and some uninteresting 'words' appear there. Now, it is your turn to make it better!

## Your model

**Part 1: Preprocessing**

Show the effect of your preprocessing by also making a topic model for your preprocessed_movie_reviews. First, you make a bag-of-words model and then the LdaModel, as above. Feel free to go back to your preprocessing code above and update it based on what you saw from the show_topics function applied to the initial model.

Try to make a model with 8 topics, and show the top 8 words for each topic. **Assign the model to a new variable with a sensible name** (avoid overwriting the previous models).

Also for the dictionary and corpus, **give the variables different and expressive names to avoid overwriting the other ones**. Otherwise, you will get confused between your different topic models.

In [62]:
preprocessed_movie_dictionary = corpora.Dictionary(preprocessed_movie_reviews)
preprocessed_movie_bow_corpus = [preprocessed_movie_dictionary.doc2bow(d) for d in preprocessed_movie_reviews]

print('Number of unique tokens in the dataset:', len(preprocessed_movie_dictionary))

#Checking the first 11 words in the bag-of-words model
print('\nThe first 11 words in the bag-of-words model:')
print(dict(itertools.islice(preprocessed_movie_dictionary.token2id.items(), 12)))

#Checking the first 100 words of the first review
print('\nThe start of the first review:')
print(preprocessed_movie_reviews[0][:100])
#And the filename of that review is...
print('\nThe filename of the first review:')
print(movie_reviewnames[0])

#Which words are used in that review?
print('\nMost frequent words in the first review:')
for i, freq in sorted(preprocessed_movie_bow_corpus[0], key=itemgetter(1), reverse=True)[:20]:
    print(preprocessed_movie_dictionary[i], "-->", freq)
print("...")

preprocessed_reviews_ldamodel = models.ldamodel.LdaModel(preprocessed_movie_bow_corpus, num_topics=5, id2word = preprocessed_movie_dictionary, passes=25)

preprocessed_reviews_ldamodel.show_topics(num_words=8) #Show the top 8 words for each topic

Number of unique tokens in the dataset: 50893

The first 11 words in the bag-of-words model:
{"'80s": 0, '00': 1, '102': 2, '12-part': 3, '1888': 4, '2': 5, '30': 6, '500': 7, 'a': 8, 'abberline': 9, 'ably': 10, 'about': 11}

The start of the first review:
['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', 'whether', "they're", 'about', 'superheroes', 'batman', 'superman', 'spawn', 'or', 'geared', 'toward', 'kids', 'casper', 'or', 'the', 'arthouse', 'crowd', 'ghost', 'world', 'but', "there's", 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', 'for', 'starters', 'it', 'was', 'created', 'by', 'alan', 'moore', 'and', 'eddie', 'campbell', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'80s", 'with', 'a', '12-part', 'series', 'called', 'the', 'watchmen', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'researched', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would

[(0,
  '0.015*"the" + 0.009*"and" + 0.008*"a" + 0.006*"to" + 0.006*"of" + 0.004*"is" + 0.004*"wars" + 0.004*"her"'),
 (1,
  '0.027*"the" + 0.022*"and" + 0.017*"a" + 0.012*"of" + 0.011*"to" + 0.010*"i" + 0.009*"is" + 0.008*"that"'),
 (2,
  '0.053*"the" + 0.030*"and" + 0.028*"a" + 0.026*"of" + 0.023*"to" + 0.022*"is" + 0.017*"in" + 0.014*"his"'),
 (3,
  '0.060*"the" + 0.031*"of" + 0.027*"a" + 0.026*"and" + 0.021*"to" + 0.015*"is" + 0.014*"in" + 0.009*"as"'),
 (4,
  '0.060*"the" + 0.030*"a" + 0.026*"and" + 0.025*"of" + 0.025*"to" + 0.019*"is" + 0.017*"in" + 0.013*"that"')]

**Part 2: Stopwords**

The topics you saw so far are probably mostly made up of stopwords such as "the". As discussed in the lecture, our results will probably be more interesting if we get rid of them.

We have included 3 generic lists of stopwords: the default list of the tool Mallet, a shorter frequent word list used in search applications (Snowball stemmer), and the top 10,000 words based on Google n-grams (in frequency order, select as many lines as you want). Gensim and NLTK also have stopword lists.

Make a function that accepts the path to a stopwords file (e.g. `stopwords/standard-mallet-en.txt`), and returns a list of stopwords.

In [63]:
import os

def load_stopwords(filename): 
    with open(filename, 'r') as file:
        contents = file.read()
        
        stopword_list = contents.split()
    
    return stopword_list

stopword_list = load_stopwords("stopwords/standard-mallet-en.txt")

Then, make a function that takes a stopword list and a list of reviews (e.g. `preprocessed_movie_reviews`). The function should remove all stopwords from all the reviews, returning a list of the reviews without stopwords. This code may be a bit slow if you have many stopwords, since there is a lot of data to process.

In [64]:
def filter_stopwords(stopword_list, movie_reviews):

    filtered_movie_reviews = []
    for review in movie_reviews:
        normalized = []
        for token in review:
            if token in stopword_list:
                continue
            else:
                normalized.append(token)
        filtered_movie_reviews.append(normalized)
        
    return filtered_movie_reviews

filtered_movie_reviews = filter_stopwords(stopword_list, preprocessed_movie_reviews)

Lastly, let's make another topic model with this filtered data! Again, you make a bag-of-words model and then the LdaModel, as above.

Try to make a model with 8 topics, and show the top 8 words for each topic. Assign the model to a new variable with a sensible name (avoid overwriting the previous models).

In [65]:
filtered_movie_dictionary = corpora.Dictionary(filtered_movie_reviews)
filtered_movie_bow_corpus = [filtered_movie_dictionary.doc2bow(d) for d in filtered_movie_reviews]

print('Number of unique tokens in the dataset:', len(filtered_movie_dictionary))

#Checking the first 11 words in the bag-of-words model
print('\nThe first 11 words in the bag-of-words model:')
print(dict(itertools.islice(filtered_movie_dictionary.token2id.items(), 12)))

#Checking the first 100 words of the first review
print('\nThe start of the first review:')
print(filtered_movie_reviews[0][:100])
#And the filename of that review is...
print('\nThe filename of the first review:')
print(movie_reviewnames[0])

#Which words are used in that review?
print('\nMost frequent words in the first review:')
for i, freq in sorted(filtered_movie_bow_corpus[0], key=itemgetter(1), reverse=True)[:20]:
    print(filtered_movie_dictionary[i], "-->", freq)
print("...")

filtered_reviews_ldamodel = models.ldamodel.LdaModel(filtered_movie_bow_corpus, num_topics=5, id2word = filtered_movie_dictionary, passes=25)

filtered_reviews_ldamodel.show_topics(num_words=8) #Show the top 8 words for each topic

Number of unique tokens in the dataset: 50393

The first 11 words in the bag-of-words model:
{"'80s": 0, '00': 1, '102': 2, '12-part': 3, '1888': 4, '2': 5, '30': 6, '500': 7, 'abberline': 8, 'ably': 9, 'absinthe': 10, 'accent': 11}

The start of the first review:
['films', 'adapted', 'comic', 'books', 'plenty', 'success', "they're", 'superheroes', 'batman', 'superman', 'spawn', 'geared', 'kids', 'casper', 'arthouse', 'crowd', 'ghost', 'world', "there's", 'comic', 'book', 'hell', 'starters', 'created', 'alan', 'moore', 'eddie', 'campbell', 'brought', 'medium', 'level', 'mid', "'80s", '12-part', 'series', 'called', 'watchmen', 'moore', 'campbell', 'researched', 'subject', 'jack', 'ripper', 'michael', 'jackson', 'starting', 'odd', 'book', 'graphic', '500', 'pages', 'long', 'includes', '30', 'consist', 'footnotes', 'words', "don't", 'dismiss', 'film', 'source', 'past', 'comic', 'book', 'thing', 'find', 'stumbling', 'block', "hell's", 'directors', 'albert', 'allen', 'hughes', 'hughes', 'br

[(0,
  '0.014*"film" + 0.011*"movie" + 0.007*"it\'s" + 0.004*"time" + 0.004*"good" + 0.004*"story" + 0.003*"character" + 0.003*"--"'),
 (1,
  '0.014*"film" + 0.012*"movie" + 0.006*"it\'s" + 0.004*"good" + 0.004*"bad" + 0.004*"time" + 0.003*"plot" + 0.003*"character"'),
 (2,
  '0.014*"film" + 0.004*"movie" + 0.003*"it\'s" + 0.003*"star" + 0.003*"time" + 0.003*"characters" + 0.003*"story" + 0.003*"good"'),
 (3,
  '0.013*"film" + 0.006*"movie" + 0.006*"it\'s" + 0.004*"story" + 0.004*"good" + 0.004*"time" + 0.003*"character" + 0.003*"life"'),
 (4,
  '0.015*"film" + 0.007*"movie" + 0.005*"it\'s" + 0.003*"characters" + 0.003*"time" + 0.003*"good" + 0.003*"character" + 0.003*"story"')]

**Part 3: Experimentation**

Are these general stopword lists sufficient? We are working in the movie review domain, meaning that we may have other uninformative stopwords than in the general domain, such as the word 'movie'. Some key experimentation is to add specific stopwords for the movie review domain, which would occur frequently in all (or most) of the clusters. Note that removing words will not just hide these words, but lead to (even very) different topics and different top ranked reviews.

**Make your own domain-specific stopwords file** by taking one of the existing ones and adding your own stopwords (make sure that the stopword file is saved as a plain text file). Think about what stopwords are in this domain (e.g., the word film is not a stopword in general, but it will occur in essentially every film review).

Re-use the functions you previously made to load your own stopwords file and filter the movie reviews. Then, make another topic model with your new filtering and show the top 8 words for each topic.

In [66]:
my_stopword_list = load_stopwords("stopwords/my_movie_stopwords.txt")
domainfiltered_movie_reviews = filter_stopwords(my_stopword_list, preprocessed_movie_reviews)

In [106]:
domainfiltered_movie_dictionary = corpora.Dictionary(domainfiltered_movie_reviews)
domainfiltered_movie_bow_corpus = [domainfiltered_movie_dictionary.doc2bow(d) for d in domainfiltered_movie_reviews]

print('Number of unique tokens in the dataset:', len(domainfiltered_movie_dictionary))

#Checking the first 11 words in the bag-of-words model
print('\nThe first 11 words in the bag-of-words model:')
print(dict(itertools.islice(domainfiltered_movie_dictionary.token2id.items(), 12)))

#Checking the first 100 words of the first review
print('\nThe start of the first review:')
print(domainfiltered_movie_reviews[0][:100])
#And the filename of that review is...
print('\nThe filename of the first review:')
print(movie_reviewnames[0])

#Which words are used in that review?
print('\nMost frequent words in the first review:')
for i, freq in sorted(domainfiltered_movie_bow_corpus[0], key=itemgetter(1), reverse=True)[:20]:
    print(domainfiltered_movie_dictionary[i], "-->", freq)
print("...")

domainfiltered_reviews_ldamodel = models.ldamodel.LdaModel(domainfiltered_movie_bow_corpus, num_topics=5, id2word = domainfiltered_movie_dictionary, passes=25)

domainfiltered_reviews_ldamodel.show_topics(num_words=8) #Show the top 8 words for each topic

Number of unique tokens in the dataset: 50369

The first 11 words in the bag-of-words model:
{"'80s": 0, '00': 1, '102': 2, '12-part': 3, '1888': 4, '2': 5, '30': 6, '500': 7, 'abberline': 8, 'ably': 9, 'absinthe': 10, 'accent': 11}

The start of the first review:
['adapted', 'comic', 'books', 'plenty', 'success', "they're", 'superheroes', 'batman', 'superman', 'spawn', 'geared', 'kids', 'casper', 'arthouse', 'crowd', 'ghost', 'world', 'comic', 'book', 'hell', 'starters', 'created', 'alan', 'moore', 'eddie', 'campbell', 'brought', 'medium', 'level', 'mid', "'80s", '12-part', 'series', 'called', 'watchmen', 'moore', 'campbell', 'researched', 'subject', 'jack', 'ripper', 'michael', 'jackson', 'starting', 'odd', 'book', 'graphic', '500', 'pages', 'long', 'includes', '30', 'consist', 'footnotes', 'words', 'dismiss', 'source', 'past', 'comic', 'book', 'thing', 'find', 'stumbling', 'block', "hell's", 'directors', 'albert', 'allen', 'hughes', 'hughes', 'brothers', 'direct', 'ludicrous', 'cast

[(0,
  '0.003*"life" + 0.002*"man" + 0.002*"people" + 0.002*"love" + 0.002*"work" + 0.002*"real" + 0.002*"comedy" + 0.002*"scream"'),
 (1,
  '0.003*"people" + 0.002*"end" + 0.002*"man" + 0.002*"love" + 0.002*"audience" + 0.002*"life" + 0.002*"fact" + 0.002*"performance"'),
 (2,
  '0.003*"life" + 0.002*"big" + 0.002*"man" + 0.002*"people" + 0.002*"love" + 0.002*"made" + 0.002*"back" + 0.002*"funny"'),
 (3,
  '0.003*"action" + 0.003*"people" + 0.002*"life" + 0.002*"man" + 0.002*"end" + 0.002*"role" + 0.002*"alien" + 0.002*"back"'),
 (4,
  '0.002*"world" + 0.002*"people" + 0.002*"life" + 0.002*"makes" + 0.002*"man" + 0.002*"made" + 0.002*"work" + 0.002*"action"')]

Now, you should have 3 models (or more): one without any stopword filtering, one with the standard stopword filtering and one with the domain-filtered stopwords using the list you modified yourself. Compare the topics found by the three models (just looking at them is fine, no need to code a comparison).

Do the topics look better with stopword filtering and with domain-specific stopword filtering? At this point, do the resulting topics correspond to particular film genres you have expected?

Of course the results look better with stopwords filtering. Without filtering there is puncuation, articles, and repetitive words from the sentences which do not add any value to the topic model. Removing more generic words leads to a more precise topic genre.

Increase the number of topics. What happens with the topics if you model very few or very many topics? (answer in a text box). Assign the model(s) to a new variable with a sensible name (avoid overwriting the previous models).

In [68]:
domainfiltered_reviews_ldamodel_extended = models.ldamodel.LdaModel(domainfiltered_movie_bow_corpus, num_topics=15, id2word = domainfiltered_movie_dictionary, passes=25)

domainfiltered_reviews_ldamodel_extended.show_topics(num_words=8) #Show the top 8 words for each topic

[(2,
  '0.003*"people" + 0.002*"funny" + 0.002*"life" + 0.002*"man" + 0.002*"real" + 0.002*"comedy" + 0.002*"big" + 0.002*"love"'),
 (0,
  '0.004*"life" + 0.004*"scream" + 0.003*"2" + 0.003*"mulan" + 0.003*"toy" + 0.003*"disney" + 0.003*"spice" + 0.002*"horror"'),
 (6,
  '0.003*"man" + 0.002*"big" + 0.002*"john" + 0.002*"world" + 0.002*"action" + 0.002*"murphy" + 0.002*"role" + 0.002*"makes"'),
 (8,
  '0.003*"man" + 0.002*"apes" + 0.002*"people" + 0.002*"witch" + 0.002*"love" + 0.002*"action" + 0.002*"blair" + 0.002*"funny"'),
 (14,
  '0.004*"ryan" + 0.003*"life" + 0.003*"tarzan" + 0.003*"big" + 0.003*"war" + 0.002*"man" + 0.002*"city" + 0.002*"back"'),
 (1,
  '0.002*"man" + 0.002*"action" + 0.002*"made" + 0.002*"love" + 0.002*"real" + 0.002*"original" + 0.002*"people" + 0.002*"thing"'),
 (12,
  '0.004*"star" + 0.003*"trek" + 0.003*"godzilla" + 0.003*"people" + 0.002*"effects" + 0.002*"special" + 0.002*"actors" + 0.002*"work"'),
 (9,
  '0.003*"truman" + 0.003*"action" + 0.003*"carrey" 

In [69]:
domainfiltered_reviews_ldamodel_reduced = models.ldamodel.LdaModel(domainfiltered_movie_bow_corpus, num_topics=2, id2word = domainfiltered_movie_dictionary, passes=25)

domainfiltered_reviews_ldamodel_reduced.show_topics(num_words=8) #Show the top 8 words for each topic

[(0,
  '0.003*"life" + 0.002*"man" + 0.002*"love" + 0.002*"people" + 0.002*"end" + 0.002*"made" + 0.002*"performance" + 0.002*"work"'),
 (1,
  '0.003*"people" + 0.002*"life" + 0.002*"action" + 0.002*"big" + 0.002*"man" + 0.002*"back" + 0.002*"world" + 0.002*"effects"')]

Modelling less topics shows topics for less movie reviews which means that less information is presented. No difference in the speed though.

Increase the number of topic words printed to get more information per topic.  Is it easier to make sense of a topic if you look further down the list, or are the initial words more clear?

In [70]:
domainfiltered_reviews_ldamodel_extended2 = models.ldamodel.LdaModel(domainfiltered_movie_bow_corpus, num_topics=15, id2word = domainfiltered_movie_dictionary, passes=25)

domainfiltered_reviews_ldamodel_extended2.show_topics(num_words=14) #Show the top 8 words for each topic

[(2,
  '0.003*"funny" + 0.002*"life" + 0.002*"action" + 0.002*"comedy" + 0.002*"makes" + 0.002*"bob" + 0.002*"smith" + 0.002*"man" + 0.002*"family" + 0.002*"long" + 0.002*"jay" + 0.002*"run" + 0.002*"people" + 0.002*"show"'),
 (3,
  '0.004*"action" + 0.003*"star" + 0.003*"life" + 0.002*"jackie" + 0.002*"people" + 0.002*"wars" + 0.002*"that\'s" + 0.002*"back" + 0.002*"guy" + 0.002*"effects" + 0.002*"end" + 0.002*"script" + 0.002*"series" + 0.002*"special"'),
 (6,
  '0.003*"end" + 0.003*"back" + 0.003*"funny" + 0.002*"people" + 0.002*"action" + 0.002*"evil" + 0.002*"script" + 0.002*"wild" + 0.002*"work" + 0.002*"thing" + 0.002*"life" + 0.002*"things" + 0.002*"audience" + 0.002*"man"'),
 (13,
  '0.004*"love" + 0.003*"people" + 0.003*"man" + 0.002*"life" + 0.002*"things" + 0.002*"world" + 0.002*"big" + 0.002*"made" + 0.002*"makes" + 0.002*"comedy" + 0.002*"young" + 0.002*"men" + 0.002*"performance" + 0.002*"end"'),
 (8,
  '0.003*"action" + 0.003*"people" + 0.003*"alien" + 0.002*"big" + 0.0

Up to a ceratin point it is easier to understand what kind of movie it is by looking at more topic words in the list. Yet looking at 100s of words would proabably not be better than maybe the 20 top words.

If you are interested, you can also experiment with the difference between positive and negative reviews.

### Part 4: Evaluation

There are a few numbers we can compute that indicate the quality of a topic model, such as [perplexity and coherence](https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/R_text_LDA_perplexity.md). For perplexity, a lower number means a better model, and for coherence, a higher number is better. Try computing these scores for your models, and see which is the best one according to the numbers

In a real project, you should compute these numbers over a separate part of the dataset (the test set) for a proper evaluation, but for simplicity and because we have not talked about this in the lecture we will skip that here.

In [101]:
from gensim.models import CoherenceModel

def coherence_model_function(reviews_ldamodel, movie_bow_corpus, modelpoint, dictionarypoint ,textspoint):

    #Compute perplexity for the basic model on the bag-of-words representation of the reviews:
    print(f"For {reviews_ldamodel}")
    
    print('Perplexity: ', reviews_ldamodel.log_perplexity(movie_bow_corpus))  
    #coherence_model_lda = {}
    # Compute coherence score on the same:
    coherence_model_lda = CoherenceModel(modelpoint, texts=textspoint, dictionary=dictionarypoint, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print('Coherence score: ', coherence_lda)

In [102]:
coherence_model_function(reviews_ldamodel, movie_bow_corpus, reviews_ldamodel, movie_dictionary ,movie_reviews)

For LdaModel<num_terms=50920, num_topics=5, decay=0.5, chunksize=2000>
Perplexity:  -7.012861000324164
Coherence score:  0.24250900769765563


In [110]:
coherence_model_function(preprocessed_reviews_ldamodel, preprocessed_movie_bow_corpus, preprocessed_reviews_ldamodel, preprocessed_movie_dictionary ,preprocessed_movie_reviews)

coherence_model_function(filtered_reviews_ldamodel, filtered_movie_bow_corpus, filtered_reviews_ldamodel, filtered_movie_dictionary ,filtered_movie_reviews)

For LdaModel<num_terms=50893, num_topics=5, decay=0.5, chunksize=2000>
Perplexity:  -7.390783729632215
Coherence score:  0.23951106201973132
For LdaModel<num_terms=50393, num_topics=5, decay=0.5, chunksize=2000>
Perplexity:  -9.20436684280963
Coherence score:  0.27373985326920336


In [112]:
coherence_model_function(domainfiltered_reviews_ldamodel, domainfiltered_movie_bow_corpus, domainfiltered_reviews_ldamodel, domainfiltered_movie_dictionary ,domainfiltered_movie_reviews)

For LdaModel<num_terms=50369, num_topics=5, decay=0.5, chunksize=2000>
Perplexity:  -9.447892057157762
Coherence score:  0.24760062733127483


However, just comparing numbers is not very interpretable. We will choose our topic model with the highest coherence score and validate the evaluation.

Using the top 20 topic words for each topic in the model with the highest coherence score, pick at least 5 topic numbers and determine what film genres (in an informal sense) they represent, i.e. think of a meaningful label for the topic. Write down the topic number and your topic label. Is it easy to guess what the topic represents? For how many topics are you fairly confident, for how many do you have to make a guess, and for how many do you have no real clue.

In [114]:
domainfiltered_reviews_ldamodel_highscore = models.ldamodel.LdaModel(domainfiltered_movie_bow_corpus, num_topics=5, id2word = domainfiltered_movie_dictionary, passes=25)

domainfiltered_reviews_ldamodel_highscore.show_topics(num_words=20) #Show the top 8 words for each topic

[(0,
  '0.002*"scream" + 0.002*"love" + 0.002*"people" + 0.002*"action" + 0.002*"man" + 0.002*"makes" + 0.002*"life" + 0.002*"made" + 0.002*"real" + 0.002*"funny" + 0.001*"acting" + 0.001*"big" + 0.001*"end" + 0.001*"audience" + 0.001*"high" + 0.001*"show" + 0.001*"young" + 0.001*"back" + 0.001*"horror" + 0.001*"original"'),
 (1,
  '0.004*"life" + 0.002*"star" + 0.002*"truman" + 0.002*"wars" + 0.002*"show" + 0.002*"people" + 0.002*"back" + 0.002*"action" + 0.002*"alien" + 0.001*"effects" + 0.001*"world" + 0.001*"end" + 0.001*"years" + 0.001*"young" + 0.001*"find" + 0.001*"lucas" + 0.001*"long" + 0.001*"ship" + 0.001*"work" + 0.001*"man"'),
 (2,
  '0.003*"life" + 0.002*"people" + 0.002*"world" + 0.002*"love" + 0.002*"back" + 0.002*"years" + 0.002*"audience" + 0.002*"man" + 0.002*"work" + 0.002*"big" + 0.002*"star" + 0.002*"made" + 0.002*"things" + 0.002*"end" + 0.002*"john" + 0.002*"action" + 0.001*"effects" + 0.001*"makes" + 0.001*"young" + 0.001*"find"'),
 (3,
  '0.003*"people" + 0.00

Despite the coherence scores saying that filtered model gets the highscore I will use the domainfiltered model because that makes more sense and in practice is easier to use for deciding a topic label. Even with the domainfiltered model it is hard to decide labels for some topics.
<br>0- Scary/comedy
<br>1- Star wars
<br>2- Space action
<br>3- Action
<br>4- Life

In [122]:
reviews_ldamodel.get_term_topics("the", minimum_probability = 1e-3)

[(0, 0.049062002), (3, 0.054476514), (4, 0.04544652)]

Do this for your own best model and the labels you just picked. For each of your topic labels, if the probability for the label is the highest for the topic number you wrote down, your guess was probably correct. Did you guess a suitable label for every topic?

In [131]:
domainfiltered_reviews_ldamodel.get_term_topics("comedy", minimum_probability = 1e-3)


[(0, 0.0016396681), (1, 0.0013902761), (2, 0.0016451329)]

In [138]:
domainfiltered_reviews_ldamodel.get_term_topics("star", minimum_probability = 1e-3)


[(2, 0.0013919857), (3, 0.0017244702)]

In [134]:
domainfiltered_reviews_ldamodel.get_term_topics("space", minimum_probability = 1e-3)

[]

In [135]:
domainfiltered_reviews_ldamodel.get_term_topics("action", minimum_probability = 1e-3)

[(1, 0.0012643057), (2, 0.0018023849), (3, 0.0030721389), (4, 0.0016725643)]

In [136]:
domainfiltered_reviews_ldamodel.get_term_topics("life", minimum_probability = 1e-3)

[(0, 0.0025653716),
 (1, 0.0017130619),
 (2, 0.0029064033),
 (3, 0.0023465625),
 (4, 0.0022303644)]

<br>0- validates
<br>1- failed
<br>2- validates
<br>3- validates
<br>4- validates

In my opinion the method of validating is effective, but I assume the statistical occurarnce of the word in the documents depends on the quality of the review too.

In a real project, you would also want to validate your topics by examining the reviews that are most strongly associated with that topic. You can see what documents have what topics using the get_document_topics() method. Here we look at the topics for the first document in the model (change the name of the model to yours):

In [140]:
domainfiltered_reviews_ldamodel.get_document_topics(movie_bow_corpus[0], minimum_probability = 0)

[(0, 0.0002559949),
 (1, 0.0002559384),
 (2, 0.99897516),
 (3, 0.00025665812),
 (4, 0.00025624284)]

Or for the first 20 of them:

In [142]:
for i, doc_topics in enumerate(domainfiltered_reviews_ldamodel.get_document_topics(domainfiltered_movie_bow_corpus)):
    if i >= 20:
        break
    print(f"Topics for the review {movie_reviewnames[i]}: {doc_topics}")

Topics for the review cv000_29590.txt: [(2, 0.9974779)]
Topics for the review cv001_18431.txt: [(2, 0.9970658)]
Topics for the review cv002_15918.txt: [(2, 0.9949002)]
Topics for the review cv003_11664.txt: [(3, 0.99819505)]
Topics for the review cv004_11636.txt: [(2, 0.9971732)]
Topics for the review cv005_29443.txt: [(4, 0.99808145)]
Topics for the review cv006_15448.txt: [(3, 0.9972568)]
Topics for the review cv007_4968.txt: [(2, 0.99708027)]
Topics for the review cv008_29435.txt: [(4, 0.9942103)]
Topics for the review cv009_29592.txt: [(1, 0.99637663)]
Topics for the review cv010_29198.txt: [(2, 0.99805504)]
Topics for the review cv011_12166.txt: [(1, 0.22473091), (2, 0.7729944)]
Topics for the review cv012_29576.txt: [(1, 0.9941265)]
Topics for the review cv013_10159.txt: [(1, 0.33858004), (4, 0.6573132)]
Topics for the review cv014_13924.txt: [(1, 0.34417218), (3, 0.6544162)]
Topics for the review cv015_29439.txt: [(2, 0.99653447)]
Topics for the review cv016_4659.txt: [(3, 0.994

But this assignment is already long enough so I will not ask you to report on this too!