# Topic modelling of Amazon reviews using LDA and Top2Vec

This project is aimed to show how topic modelling works in practice using Top2Vec and LDA. 

The dataset contains plots of movies scraped from Wikipedia articles. It can downloaded from on https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots.

# Importing packages

In [None]:
# General
import matplotlib.pyplot as plt
import pandas as pd
from wordcloud import WordCloud

# NLP
import re
import gensim
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.corpus import stopwords
from top2vec import Top2Vec

stemmer = SnowballStemmer('english')
nltk.download('wordnet')
nltk.download('omw-1.4')
pd.reset_option('^display.', silent=True)
stop_words = stopwords.words('english')

# Exploratory data analysis

In [None]:
df = pd.read_csv(r'C:\Users\Erik Konstenius\Downloads\wiki_movie_plots_deduped.csv')

In [None]:
df.head()

In [None]:
print("Number of movies: " + str(len(df)))

In [None]:
print("Number of unique genres: " + str(df["Genre"].nunique()))

In [None]:
# As we can see, most movies belong to multiple genre

for i in df["Genre"].unique()[80:90]:
    print(i)

Let's look at a couple of plots

In [None]:
pd.set_option('display.max_colwidth', None)
df[["Title","Plot"]][0:5]

In [None]:
print("Average length of movie plot: " + str(round(df['Plot'].str.len().mean())) + " words")

# Pre-processing pipeline

In [None]:
def pre_process_pipeline(df, model):
    # Remove stop words
    df = df.apply(lambda x: ' '.join([word for word in str(x).split() if word not in stop_words]))
    
    reviews = []
    
    for review in df:
        review = re.sub('[^A-Za-z0-9-" "]+', '', review) #remove special characters
        review = review.lower() #lower case words
        reviews.append(re.sub(r'\b\w{1,2}\b', '', review)) #remove short words
    
    if model == "lda":
        output = [word_tokenize(sentence) for sentence in reviews]
        temp = []
        for movie in output: # Lemmatize each word
            A = [stemmer.stem(WordNetLemmatizer().lemmatize(word, pos='v')) for word in movie]
            temp.append(A)
            
        reviews = temp
    
    elif model == "top2vec":
        pass

    else:
        print("Model not recognized. Top2vec assumed")
    
    return reviews

# Applying Top2Vec

Top2Vec is an algorithm for topic modeling and semantic search. The algorithm is an unsupervised machine learning technique that can find structure in the text that can be useful to organize data, search for similar text documents and possibly even work as a simple recommender system.

How the algorithm works:

1. Create jointly embedded document and word vectors using Doc2Vec.

2. Apply dimensionality reduction and convert a sparse dimensionalse vector space to a denser area of lower dimensional embeddings of text documents.

3. Cluster dense areas of documents using HDBSCAN.

4. For each dense area calculate the centroid of document vectors in original dimension. This centroid is the topic vector.

5. Find n-closest word vectors to the resulting topic vector.

In [None]:
corpus = pre_process_pipeline(df["Plot"], model = "top2vec")

In [None]:
# Check corpus after cleaning

corpus[100:103]

Creating the Top2Vec model

In [None]:
from top2vec import Top2Vec
import tensorflow as tf
model = Top2Vec(corpus, speed = 'deep-learn',  workers=12)

# corpus: Input corpus, should be a list of strings.
# speed: The ‘deep-learn’ option will learn the best quality vectors but will take significant time to train.
# workers: The amount of worker threads to be used in training the model.

# Warning: it can take multiple hours to create the model

In [None]:
# Get number of detected topics.

print("Number of topics: " + str(model.get_num_topics()))

In [None]:
from wordcloud import WordCloud

In [None]:
# This will return the topics in decreasing size.

topic_words, word_scores, topic_nums = model.get_topics(model.get_num_topics())

# topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.
# word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.
# topic_nums: The unique index of every topic will be returned.

for topic in topic_nums:
    model.generate_topic_wordcloud(topic)
    
# The results show that the technique has produced impressive segementation of the text data 
# stored in the movie plots.

Next, I search for topics by a given keywor.

In [None]:
topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["space", "planet"],num_topics=3)

# topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.
# word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.
# topic_scores: For each topic the cosine similarity to the search keywords will be returned.
# topic_nums: The unique index of every topic will be returned.

for topic in topic_nums:
    model.generate_topic_wordcloud(topic, background_color="black")

In [None]:
topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["career"],num_topics=3)

# topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.
# word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.
# topic_scores: For each topic the cosine similarity to the search keywords will be returned.
# topic_nums: The unique index of every topic will be returned.

for topic in topic_nums:
    model.generate_topic_wordcloud(topic, background_color="black")
    
# It appears to suggest topics within music, generael sports and boxing

Next I search for topic and find documents with the highest similarity of document to a specified topic. For each of the returned documents we are going to print its content, score and document number.

In [None]:
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=21, num_docs=5)

# documents: The documents in a list, the most similar are first.
# doc_scores: Semantic similarity of document to topic. The cosine similarity of the document and topic vector.
# doc_ids: Unique ids of documents. If ids were not given, the index of document in the original corpus.
    
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print(" ")
    print(doc[0:500])
    print(" ")
    
# topic_num 21 appears to be about the nazis, jews and the Second World War. The movies suggested 
# fit the chosen topic

In [None]:
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=105, num_docs=5)

# documents: The documents in a list, the most similar are first.
# doc_scores: Semantic similarity of document to topic. The cosine similarity of the document and topic vector.
# doc_ids: Unique ids of documents. If ids were not given, the index of document in the original corpus.
    
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print(" ")
    print(doc[0:500])
    print(" ")
    
# topic_num 105 appear to be about Harry Potter movies or similar movies.

# Applying LDA with BoW

How the algorithm works:


1. Assess which word appear in each document. I will lemmatize, remove stop words and remove words that only appear in one document or are very common. This will be done using the bag-of-word technique where each word is assigned the number of times the word appears in corpus.


2. Assess which words belong to each topic by looping over each word in each movie plot. This is done in the following steps:
    
    a) We loop through each movie plot an randomly assign a topic from a predefined set of topics.

    b) We then loop through each word in each plot and compute the proportion of words in each   document that are assigned to each topic. If a lot of words from a movie plot belongs to a particular topic it is more probable that word belongs to the topic. We then calculate the how many times the word was assigned to the particular topic over the entire corpus. 
 

The end result is a model that considers every word in each document and determines how much that word is associated to each topic. If a word is associated more to a particular topic, then the document is more likely to be classified within that topic.
 

Worth noting:
1. We need to state in advance how many topics we want the model to filter the corpus in.
2. Order of the words and the grammatical role of the words are not considered in the model.

In [None]:
corpus = pre_process_pipeline(df["Plot"], model = "lda")

In [None]:
# The corpus is now also tokenized and lemmatized
print(corpus[0])

In [None]:
# TODO: print just some of them

len(corpus) # Number of tokens

In [None]:
# removes words appearing in more than 0.5 % of the total corpus size

corpus.filter_extremes(no_above=0.5) 

In [None]:
len(corpus) # The filtering has removed a bit more than half of all tokens

In [None]:
bow_corpus = [bow.doc2bow(movie) for movie in corpus]

In [None]:
lda_model = gensim.models.LdaMulticore(corpus = bow_corpus, num_topics=10, id2word=bow, passes=10, workers=12)

# I restrict the model to only find 10 topics

In [None]:
for topic in range(lda_model.num_topics):
    plt.figure(figsize=(40,5))
    plt.imshow(WordCloud(width=800, height=400).fit_words(dict(lda_model.show_topic(topic, 300))))
    plt.axis("off")
    plt.title("Topic #" + str(topic))
    plt.show()

# Applying LDA with TF-IDF

In [None]:
corpus = pre_process_pipeline(df["Plot"], model = "lda")

In [None]:
corpus.filter_extremes(no_above=0.5) 

In [None]:
bow_corpus = [bow.doc2bow(movie) for movie in corpus]

In [None]:
#Create tf-idf model
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

# Apply transformation to the entire corpus
corpus_tfidf = tfidf[bow_corpus]
lda_tfidf_model = gensim.models.LdaMulticore(corpus = corpus_tfidf, num_topics=10, id2word=bow, passes=10, workers=12)

In [None]:
for topic in range(lda_model.num_topics):
    plt.figure(figsize=(40,5))
    plt.imshow(WordCloud(width=800, height=400).fit_words(dict(lda_tfidf_model.show_topic(topic, 300))))
    plt.axis("off")
    plt.title("Topic #" + str(topic))
    plt.show()

# Conclusion

The Top2Vec outperformed both LDA models. Yet, in some situations it may be beneficial to be able to specify the number of topics when we know how many groups there are in the population. In this case, finding over 100 topics is maybe not very useful.