# Topic Modeling for Customer Reviews
Once we classify negative reviews, the next step is to understand why the reviews are negative. Ideally, we want to find patterns that are common in negative reviews that would help product owners pinpoint the problem areas for products.

## Background:
  ### What is topic modeling?
  "A topic model is a type of statistical model for discovering abstract 'topics' that occur in a collection of documents." [Wikipedia]
  ![lda](./../images/lda.png)
  ![lda](./../images/lda1.png)
  ![lda](./../images/lda2.png)
  ![lda](./../images/lda3.png)

 ### What is topic modeling used for?
 * It can be used to identify topics of a news article. Ex - Sports, politics, economy etc
 * Tagging customer support issues. Ex - Billing issue, shipping issue, account issue etc.
 * Providing 'similar' articles to read on magazine websites. 

### References:
Parts of the code were adapted and modified from these references: <br>
https://www.youtube.com/watch?v=NYkbqzTlW3w&ab_channel=AliceZhao <br>
https://towardsdatascience.com/topic-modeling-with-nlp-on-amazon-reviews-an-application-of-latent-dirichlet-allocation-lda-ae42a4c8b369
<br><br>
Slides: <br>
https://www.youtube.com/watch?v=BuMu-bdoVrU&ab_channel=PyTexas<br>
https://www.youtube.com/watch?v=NYkbqzTlW3w&ab_channel=AliceZhao <br> <br>
Theory <br>
https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

## Step 0: Notebook setup

In [34]:
# Mac issue fix
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# General Packages 
import pandas as pd

# Preprocessing
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

# NLP preprocessing
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 
import string
from nltk import word_tokenize, pos_tag

# LDA modules
from gensim import matutils, models
import scipy.sparse


[nltk_data] Downloading package punkt to /Users/rahul/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rahul/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Step 1: Importing and exploring the data
### Importing

In [35]:
# **Assumes you have the data already from running the relevant code in the 'NLP_for_Customer_Reviews.ipynb'

col_names = ["label", "title", "review"]

train_df = pd.read_csv('./amazon_review_polarity_csv/train.csv', names=col_names)
df = pd.read_csv('./amazon_review_polarity_csv/test.csv', names = col_names)


### Exploring

In [36]:
# For more in depth exploration, see 'NLP_for_Customer_Reviews.ipynb'
df.head()

Unnamed: 0,label,title,review
0,2,Great CD,My lovely Pat has one of the GREAT voices of h...
1,2,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
2,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
3,2,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
4,2,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   label   400000 non-null  int64 
 1   title   399990 non-null  object
 2   review  400000 non-null  object
dtypes: int64(1), object(2)
memory usage: 9.2+ MB


In [38]:
# Sample title and review
print(df.iloc[1, 1] + '\n\n' + df.iloc[1, 2])

One of the best game music soundtracks - for a game I didn't really play

Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks. I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting. But even if those weren't included I would still consider the collection worth it.


## Step 2: Data Preparation
### Subsetting relevant negative reviews

In [39]:
# For the sake of saving compute resources, we will work with 10,000 negative reviews. 
# Subsetting 10,000 negative reviews for analysis
df1 = df[df['label'] == 1]
df1 = df1.iloc[0:10001, :]
df1.head(5)

Unnamed: 0,label,title,review
2,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
5,1,DVD Player crapped out after one year,I also began having the incorrect disc problem...
6,1,Incorrect Disc,"I love the style of this, but after a couple y..."
7,1,DVD menu select problems,I cannot scroll through a DVD menu that is set...
9,1,"Not an ""ultimate guide""","Firstly,I enjoyed the format and tone of the b..."


In [40]:
# Combining the tile and reviews into a text column (the idea being the title can help with identifying topics)
df1['Text'] = df1['title'] + ' ' + df1['review']
df1 = df1.drop(['title', 'review'], axis = 1)
df1.head()

Unnamed: 0,label,Text
2,1,Batteries died within a year ... I bought this...
5,1,DVD Player crapped out after one year I also b...
6,1,"Incorrect Disc I love the style of this, but a..."
7,1,DVD menu select problems I cannot scroll throu...
9,1,"Not an ""ultimate guide"" Firstly,I enjoyed the ..."


In [41]:
# Handling missing values
# Dropping 1 null value row
df1 = df1.dropna(axis = 0)

### Create the document term matrix

In [42]:
additional_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = text.ENGLISH_STOP_WORDS.union(additional_stop_words)

cv = CountVectorizer(stop_words = stop_words)
cv_data = cv.fit_transform(df1['Text'])

In [43]:
cv_data.shape 

(10000, 33319)

In [44]:
# Looking at our document - term matrix
data = pd.DataFrame(cv_data.toarray(), columns = cv.get_feature_names())
data.index = df1['Text'].index

In [45]:
data

Unnamed: 0,00,000,00000,007,00and2,01,010,02,029,03,...,zzzzzzzzzzzz,zzzzzzzzzzzzz,zzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzz,ésta,único
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20446,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20447,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20448,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20449,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Getting the term-document matrix

In [46]:
# Required inputs: term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,2,5,6,7,9,11,12,14,15,20,...,20439,20440,20441,20442,20444,20446,20447,20448,20449,20451
00,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
007,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
00and2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
# Adding the term document matrix in a gensim format
# df -> sparse mtx -> gensim corpus
sparse = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse)

In [48]:
# Gensim also requires a dictionary of all the terms and their locations in the term-document matrix
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [49]:
# Sample of the id2word
dict(list(id2word.items())[0: 5]) 

{3055: 'batteries',
 8441: 'died',
 33082: 'year',
 4023: 'bought',
 5304: 'charger'}

## Step 3: Analysis and iteration

In [50]:
# We currently have the corpus(term-document matrix) and id2word (dictionary of location: term),
# 2 parameters need to be specified: # of topics and number of passes 
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.025*"book" + 0.007*"read" + 0.006*"product" + 0.005*"don" + 0.005*"good" + 0.005*"work" + 0.005*"buy" + 0.005*"use" + 0.004*"money" + 0.004*"bought"'),
 (1,
  '0.016*"movie" + 0.008*"good" + 0.006*"don" + 0.006*"really" + 0.006*"film" + 0.006*"bad" + 0.005*"story" + 0.005*"cd" + 0.005*"dvd" + 0.005*"album"')]

In [51]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.031*"book" + 0.018*"movie" + 0.011*"read" + 0.007*"good" + 0.006*"film" + 0.006*"story" + 0.006*"don" + 0.005*"really" + 0.005*"bad" + 0.004*"books"'),
 (1,
  '0.008*"product" + 0.007*"dvd" + 0.007*"buy" + 0.006*"bought" + 0.006*"work" + 0.006*"use" + 0.006*"don" + 0.006*"money" + 0.005*"quality" + 0.005*"good"'),
 (2,
  '0.014*"cd" + 0.013*"album" + 0.011*"music" + 0.008*"good" + 0.008*"songs" + 0.006*"don" + 0.005*"really" + 0.005*"sound" + 0.005*"bad" + 0.004*"song"')]

In [52]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.009*"product" + 0.007*"buy" + 0.007*"bought" + 0.007*"work" + 0.006*"use" + 0.006*"don" + 0.005*"quality" + 0.005*"money" + 0.005*"good" + 0.004*"amazon"'),
 (1,
  '0.013*"book" + 0.008*"album" + 0.008*"good" + 0.007*"cd" + 0.007*"don" + 0.007*"read" + 0.007*"really" + 0.006*"music" + 0.006*"story" + 0.005*"boring"'),
 (2,
  '0.042*"movie" + 0.015*"film" + 0.009*"bad" + 0.009*"good" + 0.009*"dvd" + 0.007*"watch" + 0.007*"movies" + 0.006*"don" + 0.006*"really" + 0.005*"money"'),
 (3,
  '0.050*"book" + 0.015*"read" + 0.005*"author" + 0.005*"books" + 0.005*"good" + 0.005*"reading" + 0.004*"edition" + 0.004*"don" + 0.003*"better" + 0.003*"written"')]

## Attempt 2 - Nouns Only


In [53]:
# Function to pull out nouns from a string
def nouns(text):
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [54]:
nouns_only = pd.DataFrame(df1['Text'].apply(nouns))
nouns_only

Unnamed: 0,Text
2,Batteries year charger Jul OK while design con...
5,DVD Player year disc problems VCR DVD side DVD...
6,Incorrect Disc style couple years DVD problems...
7,DVD menu select problems DVD menu triangle key...
9,guide format tone book author reader insider s...
...,...
20446,Garbage desk piece junk paint metal metal brac...
20447,Ann job attention way detail times interest pa...
20448,Review book stars books detail fiction books l...
20449,Duplicate product Please product product i Hug...


In [55]:
# Create a new document-term matrix using only nouns


# Remove any extra stop words 
additional_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = text.ENGLISH_STOP_WORDS.union(additional_stop_words)

# Count vectorizer for only nouns
cvn = CountVectorizer(stop_words = stop_words)
cvn_data = cvn.fit_transform(nouns_only.Text)
dtm_nouns = pd.DataFrame(cvn_data.toarray(), columns=cvn.get_feature_names())
dtm_nouns.index = nouns_only.index
dtm_nouns


Unnamed: 0,000,01,010,02,04,06,08,10,100,100a,...,zzzzzzzzzzz,zzzzzzzzzzzz,zzzzzzzzzzzzz,zzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzz,ésta,único
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20446,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20447,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20448,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20449,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
# Gensim corpus
corpus_n = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(dtm_nouns.transpose()))

# Create the vocabulary dictionary
id2word_n = dict((v, k) for k, v in cvn.vocabulary_.items())

In [57]:
# Starting with 2 topics
ldan = models.LdaModel(corpus=corpus_n, num_topics=2, id2word=id2word_n, passes=10)
ldan.print_topics()

[(0,
  '0.050*"book" + 0.029*"movie" + 0.010*"film" + 0.010*"story" + 0.007*"books" + 0.006*"characters" + 0.006*"plot" + 0.005*"way" + 0.005*"author" + 0.005*"money"'),
 (1,
  '0.011*"product" + 0.010*"cd" + 0.009*"dvd" + 0.009*"album" + 0.009*"money" + 0.008*"music" + 0.008*"quality" + 0.006*"songs" + 0.005*"amazon" + 0.004*"way"')]

In [58]:
# Starting with 3 topics
ldan = models.LdaModel(corpus=corpus_n, num_topics=3, id2word=id2word_n, passes=10)
ldan.print_topics()

[(0,
  '0.074*"book" + 0.012*"story" + 0.011*"books" + 0.009*"characters" + 0.007*"author" + 0.006*"way" + 0.005*"plot" + 0.005*"character" + 0.005*"pages" + 0.005*"series"'),
 (1,
  '0.042*"movie" + 0.015*"film" + 0.013*"dvd" + 0.009*"money" + 0.007*"quality" + 0.007*"movies" + 0.006*"version" + 0.005*"product" + 0.005*"thing" + 0.005*"game"'),
 (2,
  '0.017*"cd" + 0.016*"album" + 0.012*"music" + 0.012*"product" + 0.010*"songs" + 0.007*"money" + 0.005*"song" + 0.005*"amazon" + 0.005*"item" + 0.005*"band"')]

In [59]:
# Starting with 4 topics
ldan = models.LdaModel(corpus=corpus_n, num_topics=4, id2word=id2word_n, passes=10)
ldan.print_topics()

[(0,
  '0.023*"cd" + 0.021*"album" + 0.018*"music" + 0.013*"songs" + 0.007*"song" + 0.006*"band" + 0.006*"voice" + 0.005*"money" + 0.005*"fan" + 0.004*"way"'),
 (1,
  '0.025*"dvd" + 0.016*"version" + 0.015*"book" + 0.011*"quality" + 0.009*"edition" + 0.008*"movie" + 0.008*"amazon" + 0.007*"video" + 0.006*"money" + 0.006*"copy"'),
 (2,
  '0.021*"product" + 0.011*"money" + 0.007*"quality" + 0.006*"game" + 0.006*"item" + 0.006*"months" + 0.005*"phone" + 0.005*"thing" + 0.005*"problem" + 0.005*"amazon"'),
 (3,
  '0.067*"book" + 0.040*"movie" + 0.015*"story" + 0.014*"film" + 0.010*"books" + 0.009*"characters" + 0.009*"plot" + 0.007*"author" + 0.006*"way" + 0.006*"character"')]

## Attempt 3 - Nouns and Adjectives


In [60]:
# Function to pull nouns and adjectives

def nouns_adj(text):
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [61]:
nouns_and_adj = pd.DataFrame(df1['Text'].apply(nouns_adj))
nouns_and_adj

Unnamed: 0,Text
2,Batteries year charger Jul OK while design nic...
5,DVD Player year incorrect disc problems VCR ht...
6,Incorrect Disc style couple years DVD problems...
7,DVD menu select problems DVD menu triangle key...
9,ultimate guide format tone book author reader ...
...,...
20446,Garbage desk piece junk silver paint metal met...
20447,Ann gbood job attention way much detail many t...
20448,Review book stars big books detail second fict...
20449,Duplicate product Please product product i ori...


In [62]:
# Create the document-term matrix using only nouns and adjectives
# This time, let's remove common words that occur too frequently with max_df
# Ex- max_df of 0.80 means 'ignore terms that appear in more than 80% of the documents'

# Remove any extra stop words 
additional_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people', 'the'
                  'youre', 'got', 'gonna', 'time', 'good','think', 'yeah', 'this', 'it', 'and','said']
stop_words = text.ENGLISH_STOP_WORDS.union(additional_stop_words)

cv_na = CountVectorizer(max_df=.8, stop_words = stop_words) #, max_df is used for removing data values that appear too frequently, also known as "corpus-specific stop words". 
# For example, max_df=.8 means "It ignores terms that appear in more than 80% of the documents". 
data_cv_na = cv_na.fit_transform(nouns_and_adj['Text'])
dtm_na = pd.DataFrame(data_cv_na.toarray(), columns=cv_na.get_feature_names())
dtm_na.index = nouns_and_adj.index
dtm_na

Unnamed: 0,000,01,010,02,04,05,06,07,08,089555464x,...,zzzzzzzzzzz,zzzzzzzzzzzz,zzzzzzzzzzzzz,zzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzz,ésta,único
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20446,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20447,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20448,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20449,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [63]:
# Create the gensim corpus
corpus_na = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(dtm_na.transpose()))

# Create the vocabulary dictionary
id2word_na = dict((v, k) for k, v in cv_na.vocabulary_.items())

In [64]:
# Starting with 2 topics
lda_na = models.LdaModel(corpus=corpus_na, num_topics=2, id2word=id2word_na, passes=10)
lda_na.print_topics()

[(0,
  '0.025*"movie" + 0.009*"film" + 0.009*"bad" + 0.008*"cd" + 0.008*"dvd" + 0.008*"album" + 0.006*"music" + 0.006*"story" + 0.006*"great" + 0.005*"money"'),
 (1,
  '0.039*"book" + 0.008*"product" + 0.006*"money" + 0.006*"books" + 0.004*"great" + 0.004*"way" + 0.004*"quality" + 0.004*"author" + 0.004*"new" + 0.004*"amazon"')]

In [65]:
# Starting with 3 topics
lda_na = models.LdaModel(corpus=corpus_na, num_topics=3, id2word=id2word_na, passes=10)
lda_na.print_topics()

[(0,
  '0.032*"movie" + 0.011*"film" + 0.010*"cd" + 0.010*"dvd" + 0.010*"bad" + 0.010*"album" + 0.008*"music" + 0.006*"great" + 0.006*"songs" + 0.006*"money"'),
 (1,
  '0.014*"product" + 0.008*"money" + 0.007*"quality" + 0.006*"amazon" + 0.005*"new" + 0.005*"great" + 0.005*"item" + 0.004*"problem" + 0.004*"months" + 0.004*"price"'),
 (2,
  '0.062*"book" + 0.009*"story" + 0.009*"books" + 0.007*"characters" + 0.006*"author" + 0.005*"way" + 0.004*"pages" + 0.004*"novel" + 0.004*"plot" + 0.004*"boring"')]

In [66]:
# Starting with 4 topics
lda_na = models.LdaModel(corpus=corpus_na, num_topics=4, id2word=id2word_na, passes=80)
lda_na.print_topics()

[(0,
  '0.016*"product" + 0.008*"money" + 0.007*"quality" + 0.006*"amazon" + 0.005*"item" + 0.005*"great" + 0.005*"new" + 0.005*"months" + 0.005*"problem" + 0.004*"year"'),
 (1,
  '0.019*"dvd" + 0.018*"cd" + 0.016*"album" + 0.014*"music" + 0.010*"songs" + 0.008*"version" + 0.006*"great" + 0.006*"bad" + 0.006*"money" + 0.006*"song"'),
 (2,
  '0.057*"book" + 0.009*"game" + 0.006*"edition" + 0.006*"information" + 0.006*"books" + 0.004*"version" + 0.004*"author" + 0.004*"text" + 0.004*"pages" + 0.004*"better"'),
 (3,
  '0.032*"book" + 0.030*"movie" + 0.011*"story" + 0.011*"film" + 0.008*"bad" + 0.007*"characters" + 0.006*"plot" + 0.006*"books" + 0.005*"boring" + 0.005*"great"')]

### Final thoughts
There is no 'correct answer' in Topic Modeling. What works for one person may not work for someone else. This process requires a lot of experimentation and computation to get 'right'. <br> <br>
LDA is only scratching the surface of Topic Modeling. There are many other methods such as: <br>
* Latent Semantic Analysis (LSA)
* Non Negative Matrix Factorization (NMF)
* Probabilistic Latent Semantic Analysis (PSLA)
* Correlated Topic Model (CTM)
* Pachinko Allocation Model (PAM)
<br> <br>