# Document Clustering and Topic Modeling

## Contents

* [Part 1: Load Data](#Part-1:-Load-Data)
* [Part 2: Tokenizing and Stemming](#Part-2:-Tokenizing-and-Stemming)
* [Part 3: TF-IDF](#Part-3:-TF-IDF)
* [Part 4: K-means clustering](#Part-4:-K-means-clustering)
* [Part 5: Topic Modeling - Latent Dirichlet Allocation](#Part-5:-Topic-Modeling---Latent-Dirichlet-Allocation)


# Part 1: Load Data

In [11]:
import numpy as np
import pandas as pd
import nltk

import gensim
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer 
import matplotlib.pyplot as plt

In [3]:
# Load data into dataframe
df = pd.read_csv('watch_reviews.tsv', sep='\t', header=0, error_bad_lines=False)

b'Skipping line 8704: expected 15 fields, saw 22\nSkipping line 16933: expected 15 fields, saw 22\nSkipping line 23726: expected 15 fields, saw 22\n'
b'Skipping line 85637: expected 15 fields, saw 22\n'
b'Skipping line 132136: expected 15 fields, saw 22\nSkipping line 158070: expected 15 fields, saw 22\nSkipping line 166007: expected 15 fields, saw 22\nSkipping line 171877: expected 15 fields, saw 22\nSkipping line 177756: expected 15 fields, saw 22\nSkipping line 181773: expected 15 fields, saw 22\nSkipping line 191085: expected 15 fields, saw 22\nSkipping line 196273: expected 15 fields, saw 22\nSkipping line 196331: expected 15 fields, saw 22\n'
b'Skipping line 197000: expected 15 fields, saw 22\nSkipping line 197011: expected 15 fields, saw 22\nSkipping line 197432: expected 15 fields, saw 22\nSkipping line 208016: expected 15 fields, saw 22\nSkipping line 214110: expected 15 fields, saw 22\nSkipping line 244328: expected 15 fields, saw 22\nSkipping line 248519: expected 15 fields,

In [4]:
df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,3653882,R3O9SGZBVQBV76,B00FALQ1ZC,937001370,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",Watches,5,0,0,N,Y,Five Stars,Absolutely love this watch! Get compliments al...,2015-08-31
1,US,14661224,RKH8BNC3L5DLF,B00D3RGO20,484010722,Kenneth Cole New York Women's KC4944 Automatic...,Watches,5,0,0,N,Y,I love thiswatch it keeps time wonderfully,I love this watch it keeps time wonderfully.,2015-08-31
2,US,27324930,R2HLE8WKZSU3NL,B00DKYC7TK,361166390,Ritche 22mm Black Stainless Steel Bracelet Wat...,Watches,2,1,1,N,Y,Two Stars,Scratches,2015-08-31
3,US,7211452,R31U3UH5AZ42LL,B000EQS1JW,958035625,Citizen Men's BM8180-03E Eco-Drive Stainless S...,Watches,5,0,0,N,Y,Five Stars,"It works well on me. However, I found cheaper ...",2015-08-31
4,US,12733322,R2SV659OUJ945Y,B00A6GFD7S,765328221,Orient ER27009B Men's Symphony Automatic Stain...,Watches,4,0,0,N,Y,"Beautiful face, but cheap sounding links",Beautiful watch face. The band looks nice all...,2015-08-31


In [5]:
# Remove missing value
df.review_body.dropna(inplace=True)

In [6]:
# use the first 1000 data as our training data
data = df.loc[:2000, 'review_body'].tolist()

In [8]:
data[0]

'Absolutely love this watch! Get compliments almost every time I wear it. Dainty.'

# Part 2: Tokenizing and Stemming

Load stopwords and stemmer function from NLTK library.
Stop words are words like "a", "the", or "in" which don't convey significant meaning.
Stemming is the process of breaking a word down into its root.

In [9]:
# Use nltk's English stopwords.
stopwords = nltk.corpus.stopwords.words('english')

print ("We use " + str(len(stopwords)) + " stop-words from nltk library.")
print (stopwords[:10])

We use 179 stop-words from nltk library.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [80]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer() 

# tokenization and stemming
def tokenization_and_lemming(text):
    # exclude stop words and tokenize the document, generate a list of string 
    tokens = [word.lower() for word in nltk.word_tokenize(text) if word.lower() not in stopwords]

    filtered_tokens = []
    
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
            
    # Lemming
    lemms = [lemmatizer.lemmatize(t) for t in filtered_tokens]
    return lemms

# tokenization without lemmatization
def tokenization(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent) if word.lower() not in stopwords]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [24]:
# 1. do tokenization and stemming for all the documents
# 2. also just do tokenization for all the documents
# the goal is to create a mapping from lemm words to original tokenized words for result interpretation.

docs_lemm = []
docs_tokenized = []
for i in data:
    tokenized_and_lemm_results = tokenization_and_lemming(i)
    docs_lemm.extend(tokenized_and_lemm_results)
    
    tokenized_results = tokenization(i)
    docs_tokenized.extend(tokenized_results)

In [19]:
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [22]:
data[0]

'Absolutely love this watch! Get compliments almost every time I wear it. Dainty.'

In [23]:
# tokenization and stemming
tokenization_and_lemming(data[0])

['absolutely',
 'love',
 'watch',
 'get',
 'compliment',
 'almost',
 'every',
 'time',
 'wear',
 'dainty']

In [25]:
# create a mapping from stemmed words to original words
vocab_frame_dict = {docs_lemm[x]:docs_tokenized[x] for x in range(len(docs_lemm))}
vocab_frame_dict

{'absolutely': 'absolutely',
 'love': 'loves',
 'watch': 'watch',
 'get': 'get',
 'compliment': 'compliments',
 'almost': 'almost',
 'every': 'every',
 'time': 'time',
 'wear': 'wears',
 'dainty': 'dainty',
 'keep': 'keeps',
 'wonderfully': 'wonderfully',
 'scratch': 'scratches',
 'work': 'work',
 'well': 'well',
 'however': 'however',
 'found': 'found',
 'cheaper': 'cheaper',
 'price': 'price',
 'place': 'place',
 'making': 'making',
 'purchase': 'purchase',
 'beautiful': 'beautiful',
 'face': 'face',
 'band': 'band',
 'look': 'looks',
 'nice': 'nice',
 'around': 'around',
 'link': 'links',
 'make': 'makes',
 'squeaky': 'squeaky',
 'cheapo': 'cheapo',
 'noise': 'noise',
 'swing': 'swings',
 'back': 'back',
 'forth': 'forth',
 'wrist': 'wrist',
 'embarrassing': 'embarrassing',
 'front': 'front',
 'enthusiast': 'enthusiasts',
 'naked': 'naked',
 'eye': 'eyes',
 'afar': 'afar',
 'ca': 'ca',
 "n't": "n't",
 'tell': 'tell',
 'cheap': 'cheap',
 'folded': 'folded',
 'polished': 'polished',
 

# Part 3: TF-IDF

TF: Term Frequency

IDF: Inverse Document Frequency

In [26]:
# define vectorizer parameters
# TfidfVectorizer will help us to create tf-idf matrix
# max_df : maximum document frequency for the given word
# min_df : minimum document frequency for the given word
# max_features: maximum number of words
# use_idf: if not true, we only calculate tf
# stop_words : built-in stop words
# tokenizer: how to tokenize the document
# ngram_range: (min_value, max_value), eg. (1, 3) means the result will include 1-gram, 2-gram, 3-gram
tfidf_model = TfidfVectorizer(max_df=0.99, max_features=1000,
                                 min_df=0.01, stop_words='english',
                                 use_idf=True, tokenizer=tokenization_and_lemming, ngram_range=(1,2))

tfidf_matrix = tfidf_model.fit_transform(data) #fit the vectorizer to synopses

print ("In total, there are " + str(tfidf_matrix.shape[0]) + \
      " reviews and " + str(tfidf_matrix.shape[1]) + " terms.")

  'stop_words.' % sorted(inconsistent))


In total, there are 2000 reviews and 280 terms.


In [27]:
# check the parameters
tfidf_model.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 0.99,
 'max_features': 1000,
 'min_df': 0.01,
 'ngram_range': (1, 2),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': 'english',
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': <function __main__.tokenization_and_lemming(text)>,
 'use_idf': True,
 'vocabulary': None}

Save the terms identified by TF-IDF.

In [28]:
# words
tf_selected_words = tfidf_model.get_feature_names()

In [29]:
# print out words
tf_selected_words

["'ll",
 "'m",
 "'s",
 "'ve",
 'able',
 'absolutely',
 'accurate',
 'actually',
 'add',
 'adjust',
 'alarm',
 'amazing',
 'amazon',
 'arrived',
 'attractive',
 'automatic',
 'awesome',
 'bad',
 'band',
 'battery',
 'beautiful',
 'beautiful watch',
 'best',
 'better',
 'big',
 'bit',
 'black',
 'blue',
 'bought',
 'bought watch',
 'box',
 'br',
 'br br',
 'br watch',
 'brand',
 'bright',
 'broke',
 'broken',
 'button',
 'buy',
 'buy watch',
 'ca',
 "ca n't",
 'came',
 'case',
 'casio',
 'change',
 'cheap',
 'clasp',
 'classy',
 'clock',
 'collection',
 'color',
 'come',
 'comfortable',
 'compliment',
 'cool',
 'cost',
 'couple',
 'crown',
 'crystal',
 'cute',
 'dark',
 'date',
 'daughter',
 'day',
 'deal',
 'definitely',
 'delivery',
 'design',
 'dial',
 'different',
 'digital',
 'disappointed',
 'display',
 'dress',
 'durable',
 'easily',
 'easy',
 'easy read',
 'end',
 'everyday',
 'exactly',
 'excellent',
 'expected',
 'expensive',
 'extremely',
 'face',
 'fact',
 'far',
 'fast',
 'f

# Part 4: K-means clustering

In [30]:
# k-means clustering
from sklearn.cluster import KMeans

num_clusters = 5

# number of clusters
km = KMeans(n_clusters=5)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()
#tfidf_matrix

## 4.1. Analyze K-means Result

In [35]:
# create DataFrame films from all of the input files.
product = { 'review': df[:2000].product_title, 'cluster': clusters}
frame = pd.DataFrame(product, columns = ['review', 'cluster'])

In [36]:
frame.head(10)

Unnamed: 0,review,cluster
0,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",1
1,Kenneth Cole New York Women's KC4944 Automatic...,1
2,Ritche 22mm Black Stainless Steel Bracelet Wat...,0
3,Citizen Men's BM8180-03E Eco-Drive Stainless S...,0
4,Orient ER27009B Men's Symphony Automatic Stain...,0
5,Casio Men's GW-9400BJ-1JF G-Shock Master of G ...,1
6,Fossil Women's ES3851 Urban Traveler Multifunc...,3
7,INFANTRY Mens Night Vision Analog Quartz Wrist...,0
8,G-Shock Men's Grey Sport Watch,3
9,Heiden Quad Watch Winder in Black Leather,0


In [37]:
print ("Number of reviews included in each cluster:")
frame['cluster'].value_counts().to_frame()

Number of reviews included in each cluster:


Unnamed: 0,cluster
0,1338
3,216
1,209
2,136
4,101


In [38]:
km.cluster_centers_

array([[0.00585172, 0.01553297, 0.04438039, ..., 0.00854199, 0.01856901,
        0.01400758],
       [0.        , 0.00403156, 0.02941768, ..., 0.00221672, 0.01471138,
        0.00283457],
       [0.00258364, 0.        , 0.01798581, ..., 0.        , 0.00471382,
        0.        ],
       [0.0026421 , 0.0070928 , 0.0141092 , ..., 0.00228126, 0.00617267,
        0.01343498],
       [0.        , 0.        , 0.02139016, ..., 0.        , 0.00573214,
        0.        ]])

# Part 5: Topic Modeling - Latent Dirichlet Allocation

In [81]:
# Use LDA for clustering
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5)

In [82]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [83]:
from sklearn.feature_extraction.text import CountVectorizer
# LDA requires integer values
tfidf_model_lda = CountVectorizer(max_df=0.99, max_features=500,
                                 min_df=0.01, stop_words='english',
                                 tokenizer=tokenization_and_lemming, ngram_range=(1,2))

tfidf_matrix_lda = tfidf_model_lda.fit_transform(data) #fit the vectorizer to synopses

print ("In total, there are " + str(tfidf_matrix_lda.shape[0]) + \
      " reviews and " + str(tfidf_matrix_lda.shape[1]) + " terms.")

  'stop_words.' % sorted(inconsistent))


In total, there are 2000 reviews and 280 terms.


In [84]:
# document topic matrix for tfidf_matrix_lda
lda_output = lda.fit_transform(tfidf_matrix_lda)
print(lda_output.shape)
print(lda_output)

(2000, 5)
[[0.02532579 0.89881092 0.02513505 0.02552381 0.02520444]
 [0.04091527 0.83760235 0.04029201 0.04059706 0.0405933 ]
 [0.59767631 0.10168281 0.1000017  0.10000228 0.1006369 ]
 ...
 [0.01829337 0.01821064 0.58705726 0.01824809 0.35819063]
 [0.10229721 0.59599822 0.10000188 0.10000288 0.10169981]
 [0.2        0.2        0.2        0.2        0.2       ]]


In [85]:
# topics and words matrix
topic_word = lda.components_
print(topic_word.shape)
print(topic_word)

(5, 280)
[[ 25.57269492  31.33949059 173.39144505 ...   6.96011365  54.5250916
   34.29141266]
 [  2.6367952    7.50899928  60.18186543 ...   9.18114688  10.93593807
   73.09987787]
 [  4.90791733  37.63470766  63.90459983 ...   3.61038729  46.02879633
    0.20437772]
 [  0.20345931   5.66558305  43.70215617 ...  10.01877796  59.30686711
    0.20293292]
 [ 14.67913324  31.85121942 121.81993352 ...   6.22957422   0.2033069
    0.20139883]]


In [87]:
lda_output

array([[0.02532579, 0.89881092, 0.02513505, 0.02552381, 0.02520444],
       [0.04091527, 0.83760235, 0.04029201, 0.04059706, 0.0405933 ],
       [0.59767631, 0.10168281, 0.1000017 , 0.10000228, 0.1006369 ],
       ...,
       [0.01829337, 0.01821064, 0.58705726, 0.01824809, 0.35819063],
       [0.10229721, 0.59599822, 0.10000188, 0.10000288, 0.10169981],
       [0.2       , 0.2       , 0.2       , 0.2       , 0.2       ]])

In [88]:
# column names
topic_names = ["Topic" + str(i) for i in range(lda.n_components)]

# index names
doc_names = ["Doc" + str(i) for i in range(len(data))]

df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topic_names, index=doc_names)

# get dominant topic for each document
topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['topic'] = topic

df_document_topic.head(10)

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,topic
Doc0,0.03,0.9,0.03,0.03,0.03,1
Doc1,0.04,0.84,0.04,0.04,0.04,1
Doc2,0.6,0.1,0.1,0.1,0.1,0
Doc3,0.05,0.05,0.05,0.05,0.8,4
Doc4,0.39,0.01,0.27,0.16,0.18,0
Doc5,0.38,0.53,0.03,0.03,0.03,1
Doc6,0.02,0.26,0.67,0.02,0.02,2
Doc7,0.05,0.8,0.05,0.05,0.05,1
Doc8,0.01,0.01,0.01,0.95,0.01,3
Doc9,0.5,0.02,0.18,0.02,0.28,0


In [89]:
tfidf_matrix_lda

<2000x280 sparse matrix of type '<class 'numpy.int64'>'
	with 16544 stored elements in Compressed Sparse Row format>

In [90]:
df_document_topic['topic'].value_counts().to_frame()

Unnamed: 0,topic
2,467
1,435
4,433
3,343
0,322


In [91]:
# topic word matrix
print(lda.components_)
# topic-word matrix
df_topic_words = pd.DataFrame(lda.components_)

# column and index
df_topic_words.columns = tfidf_model_lda.get_feature_names()
df_topic_words.index = topic_names

df_topic_words.head()

[[ 25.57269492  31.33949059 173.39144505 ...   6.96011365  54.5250916
   34.29141266]
 [  2.6367952    7.50899928  60.18186543 ...   9.18114688  10.93593807
   73.09987787]
 [  4.90791733  37.63470766  63.90459983 ...   3.61038729  46.02879633
    0.20437772]
 [  0.20345931   5.66558305  43.70215617 ...  10.01877796  59.30686711
    0.20293292]
 [ 14.67913324  31.85121942 121.81993352 ...   6.22957422   0.2033069
    0.20139883]]


Unnamed: 0,'ll,'m,'s,'ve,able,absolutely,accurate,actually,add,adjust,...,wo n't,woman,work,work great,worked,working,worn,worth,wrist,year
Topic0,25.572695,31.339491,173.391445,53.295361,19.632808,1.976612,17.762095,13.68549,12.027045,6.895974,...,7.401504,0.207245,19.778943,0.200221,2.463463,12.876545,8.733967,6.960114,54.525092,34.291413
Topic1,2.636795,7.508999,60.181865,16.14,0.201182,27.31239,0.203576,0.205607,0.206312,0.204073,...,0.200567,0.201856,22.411462,0.200645,13.988896,53.474411,1.070996,9.181147,10.935938,73.099878
Topic2,4.907917,37.634708,63.9046,9.920623,11.719013,0.222235,1.053946,9.804609,9.363117,3.898446,...,0.202935,10.158213,1.230493,0.200725,0.205116,0.201401,7.851677,3.610387,46.028796,0.204378
Topic3,0.203459,5.665583,43.702156,7.829919,2.042883,0.208932,0.759391,14.640111,0.202979,3.01531,...,0.202172,13.229262,115.294158,24.196869,1.560724,0.201894,12.141447,10.018778,59.306867,0.202933
Topic4,14.679133,31.851219,121.819934,5.814097,4.404114,0.279832,2.220992,1.664183,0.200548,11.986197,...,13.992822,0.203424,56.284944,0.201539,3.781801,0.245748,0.201912,6.229574,0.203307,0.201399


In [92]:
# column names
topic_names = ["Topic" + str(i) for i in range(lda.n_components)]

In [93]:
# print top n keywords for each topic
def print_topic_words(tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
    # for each topic, we have words weight
    for topic_words_weights in lda_model.components_:
        top_words = topic_words_weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

topic_keywords = print_topic_words(tfidf_model=tfidf_model_lda, lda_model=lda, n_words=15)        

df_topic_words = pd.DataFrame(topic_keywords)
df_topic_words.columns = ['Word '+str(i) for i in range(df_topic_words.shape[1])]
df_topic_words.index = ['Topic '+str(i) for i in range(df_topic_words.shape[0])]
df_topic_words

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,br,watch,br br,time,'s,n't,like,day,hand,second,band,hour,link,light,wrist
Topic 1,watch,love,great,love watch,great watch,year,wear,'s,bought,got,working,battery,time,color,band
Topic 2,watch,band,look,like,great,product,face,looking,n't,'s,really,watch look,leather,cheap,size
Topic 3,watch,beautiful,work,perfect,great,easy,fit,strap,wrist,time,read,look,comfortable,love,beautiful watch
Topic 4,watch,nice,good,n't,'s,really,gift,look,quality,price,time,nice watch,work,looking,set
