# Topic Modeling Using Unsupervised Learning Models
###              -- Final Project for LING 6300 Machine Learning & Linguistics

Large amounts of data are collected everyday. As more information becomes available, it becomes difficult to access what we are looking for. So, we need tools and techniques to organize, search and understand vast quantities of information. Topic Modeling is one of these techniques.

Topic modeling is a type of algorithm that scans a set of documents (known in the NLP field as a corpus), examines how words and phrases co-occur in them, and automatically “learns” groups or clusters of words that best characterize those documents.(Patrick van Kessel, 2018) These sets of words often appear to represent a coherent theme or topic.

In my final project, I will use two unsupervised learning models: K-Means and Latent Dirichlet Allocation(LDA) to do topic modeling on a user reviews dataset. 



# Part 0 Environment Setup

The packages and libraries I used for my project include: (1) Numpy; (2) Pandas; (3) nltk; (4) gensim; (5) Scikit-learn; (6) warnings. (7) matplotlib

(1) Install python version 3: https://www.python.org/downloads/

(2) Install packages that's not on your computer yet: pip3 install PACKAGE NAME

(3) Put the data file reviews.tsv in the same folder with this jupyter notenook file.

In [30]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import nltk
import gensim

import re
import os

from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/lansang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lansang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Dataset Description

The dataset that I used is a user reviews dataset about watches from Amazon. It is a open sourced dataset which can be downloaded at : https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz (I already put this data file as reviews.tsv in the same folder with this jupyter file.)

In [31]:
df = pd.read_csv('reviews.tsv', sep = '\t', header = 0, error_bad_lines=False) #Read in data file. Skip error data.

b'Skipping line 8704: expected 15 fields, saw 22\nSkipping line 16933: expected 15 fields, saw 22\nSkipping line 23726: expected 15 fields, saw 22\n'
b'Skipping line 85637: expected 15 fields, saw 22\n'
b'Skipping line 132136: expected 15 fields, saw 22\nSkipping line 158070: expected 15 fields, saw 22\nSkipping line 166007: expected 15 fields, saw 22\nSkipping line 171877: expected 15 fields, saw 22\nSkipping line 177756: expected 15 fields, saw 22\nSkipping line 181773: expected 15 fields, saw 22\nSkipping line 191085: expected 15 fields, saw 22\nSkipping line 196273: expected 15 fields, saw 22\nSkipping line 196331: expected 15 fields, saw 22\n'
b'Skipping line 197000: expected 15 fields, saw 22\nSkipping line 197011: expected 15 fields, saw 22\nSkipping line 197432: expected 15 fields, saw 22\nSkipping line 208016: expected 15 fields, saw 22\nSkipping line 214110: expected 15 fields, saw 22\nSkipping line 244328: expected 15 fields, saw 22\nSkipping line 248519: expected 15 fields,

In [32]:
df.head()# Take a look at how the dataset looks like. The most inportant column for this project is review_body

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,3653882,R3O9SGZBVQBV76,B00FALQ1ZC,937001370,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",Watches,5,0,0,N,Y,Five Stars,Absolutely love this watch! Get compliments al...,2015-08-31
1,US,14661224,RKH8BNC3L5DLF,B00D3RGO20,484010722,Kenneth Cole New York Women's KC4944 Automatic...,Watches,5,0,0,N,Y,I love thiswatch it keeps time wonderfully,I love this watch it keeps time wonderfully.,2015-08-31
2,US,27324930,R2HLE8WKZSU3NL,B00DKYC7TK,361166390,Ritche 22mm Black Stainless Steel Bracelet Wat...,Watches,2,1,1,N,Y,Two Stars,Scratches,2015-08-31
3,US,7211452,R31U3UH5AZ42LL,B000EQS1JW,958035625,Citizen Men's BM8180-03E Eco-Drive Stainless S...,Watches,5,0,0,N,Y,Five Stars,"It works well on me. However, I found cheaper ...",2015-08-31
4,US,12733322,R2SV659OUJ945Y,B00A6GFD7S,765328221,Orient ER27009B Men's Symphony Automatic Stain...,Watches,4,0,0,N,Y,"Beautiful face, but cheap sounding links",Beautiful watch face. The band looks nice all...,2015-08-31


In [33]:
df.review_body.dropna(inplace=True)

In [34]:
# I chose the first 3000 reviews so that it won't take too long to train the model.
data = df.loc[:3000, 'review_body'] 

# Part 1 Data Preprocessing

I used two main techniques in data preprocessing: Tokenization and Stemming. Tokenization is taking a text or set of text and breaking it up into its individual words.(Tatman, 2017) Stemming refers to the process of breaking a word down into its root. For example, if we have a a sentence: The boy’s cars are different colors. After stemming, We should get : The boy car be differ color. I also use the stopwords function in the nltk package so that we can ignore words like a, an, the, etc.

However, Stemming has some disadvantages such as overstemming and understemming. So I also tried tokenization without stemming

In [35]:
#Use the stopwords function in nltk package.
stopwords = nltk.corpus.stopwords.words('english')

print('We use ' + str(len(stopwords)) + ' stop words from nltk library')
print(stopwords[:10])

We use 179 stop words from nltk library
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [36]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

#Tokenization with stemming
def tokenization_and_stemming(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word not in stopwords]
    
    filtered_tokens = []
    
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
            
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

# tokenization without stemming
def tokenization(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word not in stopwords]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [37]:
# Check one example after being stemmed
tokenization_and_stemming(data[0]) 

['absolut',
 'love',
 'watch',
 'get',
 'compliment',
 'almost',
 'everi',
 'time',
 'i',
 'wear',
 'dainti']

In [38]:
# 1. I did tokenization and stemming for all the documents
# 2. I also tried to just do tokenization for all the documents without stemming
docs_stemmed = []
docs_tokenized = []
for i in data:
    tokenized_and_stemmed_results = tokenization_and_stemming(i)
    docs_stemmed.extend(tokenized_and_stemmed_results)
    
    tokenized_results = tokenization(i)
    docs_tokenized.extend(tokenized_results)

In [39]:
# Create a mapping from stemmed words to original tokenized words for result interpretation.
vocab_frame_dict = {docs_stemmed[x]:docs_tokenized[x] for x in range(len(docs_stemmed))}

# Part 2 TF-IDF

In this part, I used tf-idf to determine the importance of words in the documents.TF-IDF, full name is term frequency and inverse document frequency. It is the product of statistics, term frequency and inverse document frequency. It can be used as a weighting factor to reflect  how important a word is to a document in a corpus. Term Frequency refers to count of word A in document B. The weight of a term that occurs in a document is simply proportional to the term frequency.

The document frequency means number of documents where word A appears so inverse document frequency = 1/ (number of documents where word A appears) . It will diminish the weight of terms that occur very frequently in the document set and increase the weight of terms that occur rarely.

In [40]:
#Use the TfidfVectorizer from scikit-learn package and define vectorizer parameters
# max_df : maximum document frequency for the given word
# min_df : minimum document frequency for the given word
# max_features: maximum number of words
# use_idf: if not true, we only calculate tf
tfidf_model = TfidfVectorizer(max_df = 0.99, max_features = 1000,
                                 min_df = 0.01, stop_words = 'english',
                                 use_idf = True, tokenizer = tokenization_and_stemming, ngram_range = (1,1))

In [41]:
tfidf_matrix = tfidf_model.fit_transform(data)

print('In total, there are ' + str(tfidf_matrix.shape[0]) + \
    ' reviews and ' + str(tfidf_matrix.shape[1]) + ' terms.')

In total, there are 2998 reviews and 254 terms.


In [42]:
# Check parameters
tfidf_model.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 0.99,
 'max_features': 1000,
 'min_df': 0.01,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': 'english',
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': <function __main__.tokenization_and_stemming(text)>,
 'use_idf': True,
 'vocabulary': None}

In [43]:
tf_words = tfidf_model.get_feature_names()

In [44]:
tf_words

["'m",
 "'s",
 'abl',
 'absolut',
 'accur',
 'actual',
 'adjust',
 'alarm',
 'alreadi',
 'alway',
 'amaz',
 'amazon',
 'anoth',
 'appear',
 'arm',
 'arriv',
 'attract',
 'automat',
 'awesom',
 'bad',
 'band',
 'batteri',
 'beauti',
 'best',
 'better',
 'bezel',
 'big',
 'bit',
 'black',
 'blue',
 'bought',
 'box',
 'br',
 'bracelet',
 'brand',
 'break',
 'broke',
 'broken',
 'button',
 'buy',
 'ca',
 'came',
 'case',
 'casio',
 'chang',
 'cheap',
 'check',
 'clasp',
 'classi',
 'clear',
 'clock',
 'collect',
 'color',
 'come',
 'comfort',
 'compliment',
 'cool',
 'cost',
 'coupl',
 'crown',
 'crystal',
 'cute',
 'dark',
 'date',
 'daughter',
 'day',
 'deal',
 'definit',
 'deliveri',
 'design',
 'dial',
 'differ',
 'difficult',
 'digit',
 'disappoint',
 'display',
 'dress',
 'durabl',
 'easi',
 'easili',
 'eleg',
 'end',
 'everi',
 'everyday',
 'everyth',
 'exact',
 'excel',
 'expect',
 'expens',
 'face',
 'far',
 'fast',
 'featur',
 'feel',
 'fell',
 'figur',
 'fine',
 'finish',
 'fit'

# Part 2.1 Calculate document similarity

Here I tried to calculate the document similarity. This is not required for topic modeling. I did it just because I already got the tfidf_matrix.

In [45]:
#Try to calculate the document similarity. 
from sklearn.metrics.pairwise import cosine_similarity
cos_matrix = cosine_similarity(tfidf_matrix)
print(cos_matrix)

[[1.         0.42208339 0.         ... 0.11382006 0.04257457 0.        ]
 [0.42208339 1.         0.         ... 0.26966248 0.10086768 0.        ]
 [0.         0.         1.         ... 0.         0.         0.        ]
 ...
 [0.11382006 0.26966248 0.         ... 1.         0.10329327 0.        ]
 [0.04257457 0.10086768 0.         ... 0.10329327 1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


# Part 3 K-Means Clustering

The first model I used is K-means clustering(James MacQueen, 1967). K-means clustering is an unsupervised learning model. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Here I tried some numbers and finally set 5 as the cluster number. I used the K-means from scikit-learn library to train the model.

In [46]:
clusters_num = 5
km = KMeans(n_clusters = clusters_num)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [47]:
#Create DataFrame and check 10 results
product = {'review': df[:2998].product_title, 'cluster': clusters}
frame = pd.DataFrame(product, columns = ['review', 'cluster'])
frame.head(10)

Unnamed: 0,review,cluster
0,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",3
1,Kenneth Cole New York Women's KC4944 Automatic...,3
2,Ritche 22mm Black Stainless Steel Bracelet Wat...,2
3,Citizen Men's BM8180-03E Eco-Drive Stainless S...,2
4,Orient ER27009B Men's Symphony Automatic Stain...,2
5,Casio Men's GW-9400BJ-1JF G-Shock Master of G ...,2
6,Fossil Women's ES3851 Urban Traveler Multifunc...,4
7,INFANTRY Mens Night Vision Analog Quartz Wrist...,2
8,G-Shock Men's Grey Sport Watch,4
9,Heiden Quad Watch Winder in Black Leather,2


In [48]:
#Number of reviews in each cluster. 
#We can see that it is not balanced. The number of reviews in cluster 0 is larger than the combination number of all the other four clusters
print('Number of reviews included in each cluster:')
frame['cluster'].value_counts().to_frame()

Number of reviews included in each cluster:


Unnamed: 0,cluster
2,2083
4,296
3,263
0,179
1,177


In [49]:
#Print cluster word(6 in each) and reviews for each cluster
print('<Document clustering result by K-Means>')

order_centroids = km.cluster_centers_.argsort()[:, ::-1]

cluster_keywords_summary = {}
for i in range(clusters_num):
    print('Cluster' + str(i) + ' word: ', end = ' ')
    cluster_keywords_summary[i] = []
    for n in order_centroids[i, :6]:
        cluster_keywords_summary[i].append(vocab_frame_dict[tf_words[n]])
        print(vocab_frame_dict[tf_words[n]] + ',', end = ' ')
    print()
    
    cluster_reviews = frame[frame.cluster==i].review.tolist()
    print('Cluster' + str(i) + 'reviews (' +str(len(cluster_reviews)) + 'reviews): ')
    print(', '.join(cluster_reviews))
    print()

<Document clustering result by K-Means>
Cluster0 word:  good, watches, quality, looks, product, price, 
Cluster0reviews (179reviews): 
Luminox Men's 3081 Evo Navy SEAL Chronograph Watch, Voguestrap TX046801XL Allstrap 16-20mm Brown Extra-Long-Length Fits Fast-Wrap Expedition Watchband, XOXO Women's XO110 Silver Dial Gold-tone Bracelet Watch, Akribos XXIV Men's AK787YGBU Quartz Movement Watch with Blue Dial and Yellow Gold Stainless Steel Bracelet, AMPM24 Men's Hand-winding Mechanical Watch Black Leather Watchband Skeleton PMW069, Gotham Men's Silver-Tone Ultra Thin Railroad Open Face Quartz Pocket Watch # GWC15022S, Casio Unisex MRW200H-2BV Neo-Display Black Watch with Resin Band, Seiko Wall Clock Silver-Tone Metallic Case Luminous  Numerals, Casio Men's STB-1000-1CF OmniSync Sports Gear Bluetooth Fitness Smartwatch, Casio Men's PRG-270B-1CR PRO TREK Aviator Black Watch, Casio Men's Dive Style Watch, DASSARI Carrera Distressed Leather GT Rally Racing Watch Strap, CYMA 18mm Black Alliga

# Part 4 Latent Dirichlet Allocation(LDA)

The second model I used is Latent Dirichlet Allocation(LDA). (Blei, Ng, Jordan, 2003) LDA is a generative statistical model. In LDA, each document is assumed to be characterized by a particular set of topics. And the goal of LDA is to map all the documents to the topics in a way, such that the words in each document are mostly captured by those imaginary topics.

I used LDA model in scikit-learn package to train the model.

In [50]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 5, learning_method = 'online')

In [51]:
from sklearn.feature_extraction.text import CountVectorizer

#We have to calculate tfidf here again because LDA requires integer values
tfidf_model_lda = CountVectorizer(max_df = 0.99, max_features = 1000,
                            min_df = 0.01, stop_words = 'english',
                            tokenizer = tokenization_and_stemming, ngram_range = (1,1))

tfidf_matrix_lda = tfidf_model_lda.fit_transform(data)

print('In total, there are ' + str(tfidf_matrix_lda.shape[0]) + \
     ' reviews and ' + str(tfidf_matrix_lda.shape[1]) + ' terms.')

In total, there are 2998 reviews and 254 terms.


In [52]:
#Matrix for reviews and topics. The review will be classified to the topic with highest score.
lda_output = lda.fit_transform(tfidf_matrix_lda)
print(lda_output.shape)
print(lda_output)

(2998, 5)
[[0.57532313 0.02541305 0.0252892  0.3485363  0.02543832]
 [0.0515602  0.05128072 0.050988   0.42665331 0.41951777]
 [0.10310746 0.10063256 0.59625898 0.10000059 0.10000041]
 ...
 [0.37252261 0.03017967 0.02875373 0.53858721 0.02995678]
 [0.35692585 0.19473997 0.43199315 0.00809114 0.00824989]
 [0.02890742 0.29396788 0.02865587 0.02893855 0.61953027]]


In [53]:
#Matrix for word and topics.
topic_word = lda.components_
print(topic_word.shape)
print(topic_word)

(5, 254)
[[2.67862852e+01 6.72555480e+01 1.18030048e+01 ... 4.78508189e+01
  7.09875333e+00 1.61578831e+02]
 [8.88485400e+01 1.12394452e+02 2.07434786e-01 ... 1.70418548e+01
  3.88118187e+01 3.02901604e-01]
 [1.33998191e+00 9.26421513e+01 4.55690490e+01 ... 2.73071910e-01
  6.15428648e+01 5.70111345e+00]
 [3.11400300e-01 1.51714903e+02 2.07386942e-01 ... 2.04374505e-01
  1.57469246e+02 2.05643910e-01]
 [5.06509608e+01 2.50506598e+02 5.23825283e+00 ... 2.06408097e-01
  3.22306022e+00 3.93780772e+00]]


In [54]:
#Check 10 examples and see how they are classified
topic_names = ['topic' + str(i) for i in range(lda.n_components)]

doc_names = ['Doc' + str(i) for i in range(len(data))]

df_doc_topic = pd.DataFrame(np.round(lda_output, 2), columns = topic_names, index = doc_names)

topic = np.argmax(df_doc_topic.values, axis = 1)
df_doc_topic['topic'] = topic

df_doc_topic.head(10)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic
Doc0,0.58,0.03,0.03,0.35,0.03,0
Doc1,0.05,0.05,0.05,0.43,0.42,3
Doc2,0.1,0.1,0.6,0.1,0.1,2
Doc3,0.88,0.03,0.03,0.03,0.03,0
Doc4,0.07,0.19,0.33,0.1,0.3,2
Doc5,0.03,0.03,0.03,0.18,0.73,4
Doc6,0.22,0.57,0.03,0.15,0.03,1
Doc7,0.03,0.39,0.03,0.03,0.53,4
Doc8,0.59,0.32,0.01,0.07,0.01,0
Doc9,0.02,0.25,0.37,0.02,0.34,2


In [55]:
#The result of LDA model. The result of LDA is more balanced than that of K-Means.
df_doc_topic['topic'].value_counts().to_frame()

Unnamed: 0,topic
4,734
1,692
0,653
3,647
2,272


In [56]:
#Print topic & term matrix. Five topics and 254 terms, each term is assigned a 
print(lda.components_)
# topic-term matrix
df_topic_words = pd.DataFrame(lda.components_)

# column and index
df_topic_words.columns = tfidf_model_lda.get_feature_names()
df_topic_words.index = topic_names

df_topic_words.head()

[[2.67862852e+01 6.72555480e+01 1.18030048e+01 ... 4.78508189e+01
  7.09875333e+00 1.61578831e+02]
 [8.88485400e+01 1.12394452e+02 2.07434786e-01 ... 1.70418548e+01
  3.88118187e+01 3.02901604e-01]
 [1.33998191e+00 9.26421513e+01 4.55690490e+01 ... 2.73071910e-01
  6.15428648e+01 5.70111345e+00]
 [3.11400300e-01 1.51714903e+02 2.07386942e-01 ... 2.04374505e-01
  1.57469246e+02 2.05643910e-01]
 [5.06509608e+01 2.50506598e+02 5.23825283e+00 ... 2.06408097e-01
  3.22306022e+00 3.93780772e+00]]


Unnamed: 0,'m,'s,abl,absolut,accur,actual,adjust,alarm,alreadi,alway,...,weight,went,white,wife,wish,work,worn,worth,wrist,year
topic0,26.786285,67.255548,11.803005,0.204514,40.633106,9.672277,5.126643,75.179255,0.214948,0.205408,...,17.217024,31.395853,32.391252,51.391098,4.76687,288.591649,26.779261,47.850819,7.098753,161.578831
topic1,88.84854,112.394452,0.207435,0.203379,0.203366,42.193382,4.686924,0.204028,35.79478,16.795059,...,0.211113,0.203595,0.205318,0.202722,15.185887,63.110421,0.205802,17.041855,38.811819,0.302902
topic2,1.339982,92.642151,45.569049,5.018191,6.775435,7.157278,29.285723,0.202628,0.205021,1.943788,...,29.926114,0.204616,0.205484,0.202419,6.108349,0.784195,7.34135,0.273072,61.542865,5.701113
topic3,0.3114,151.714903,0.207387,40.268551,0.202518,0.217836,26.679168,0.207334,0.204106,0.204431,...,0.210806,0.201933,0.204099,0.213153,10.466433,0.209171,0.205313,0.204375,157.469246,0.205644
topic4,50.650961,250.506598,5.238253,0.397511,0.255457,0.255945,21.27808,4.332582,0.206222,32.655048,...,0.206769,4.858831,0.230998,0.20406,23.53106,90.746925,0.207232,0.206408,3.22306,3.937808


In [57]:
#Print top n keywords for each topic 
def print_topic_words(tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
    for topic_words_weights in lda_model.components_:
        top_words = topic_words_weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

topic_keywords = print_topic_words(tfidf_model = tfidf_model_lda, lda_model = lda, n_words = 10) 

df_topic_words = pd.DataFrame(topic_keywords)
df_topic_words.columns = ['Word ' + str(i) for i in range(df_topic_words.shape[1])]
df_topic_words.index = ['Topic ' + str(i) for i in range(df_topic_words.shape[0])]
df_topic_words

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9
Topic 0,watch,work,year,got,time,batteri,light,excel,n't,replac
Topic 1,watch,great,look,like,n't,price,face,time,littl,cheap
Topic 2,br,watch,hand,band,use,time,link,second,'s,n't
Topic 3,watch,love,nice,perfect,small,fit,wrist,'s,band,easi
Topic 4,watch,good,look,'s,time,qualiti,n't,set,need,beauti


# Part 5 Conclusion

In this project, I chose two unsupervised learning models--K-Means and LDA to do topic modeling on a watch user reviews dataset. The result of LDA model is much more balanced than that of K-Means Clustering model. So LDA performed better than K-Means.

In addition, if we look at the top 10 words of each topic, as the printed table above, we can see that the classification makes sense. The main topic is watch, of course. But each topic contains different details other than watch: apperance, watch band, price, quality and watch as a gift. Note that the order of these five detailed topics may be different everyime you run the code and retrain the model. But it is definite that each one will contain one of these five detailed topics.

The differences of the topic modeling results are not that obvious because the dataset I chose is all about watch reviews. It will show larger differences if the dataset contains documents that are more different in topic.

# Part 6 Future Work

In this project, I used a dataset all about watch user reviews. There are many limits due to the dataset contains reviews that has one similar main topic: watch. In the future, I want to use dataset about wikipedia title or academic article paper title and abstract to do LDA topic modeling. 

# References

[1] Rachael Tatman, Data Science 101, 2017

[2] Blei, Ng, Jordan, Latent Dirichlet, 2003

[3] MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, 1967.