## Topic Modeling of Amazon Product Review using NLP and Unsupervised Learning

### Part 0. Project Introduction
#### In this project, I applied natural language processing on amazon product (watch) review data, used unsupervised learning models to culster unlabeled documents into different groups and identified their latent topics/structures.
#### Contents:
- [Part 1. Load Data](#Part-1.-Load-Data)
- [Part 2. Tokenizing and Stemming](#Part-2.-Tokenizing-and-Stemming)
- [Part 3. TF-IDF](#Part-3.-TF-IDF)
- [Part 4. K-means clustering](#Part-4.-K-means-clustering)
- [Part 5. Topic Modeling - Latent Dirichlet Allocation](#Part-5.-Topic-Modeling---Latent-Dirichlet-Allocation)


### Part 1. Load Data

In [1]:
import pandas as pd

df = pd.read_csv('data/reviews.tsv', sep='\t', error_bad_lines=False)

b'Skipping line 8704: expected 15 fields, saw 22\nSkipping line 16933: expected 15 fields, saw 22\nSkipping line 23726: expected 15 fields, saw 22\n'
b'Skipping line 85637: expected 15 fields, saw 22\n'
b'Skipping line 132136: expected 15 fields, saw 22\nSkipping line 158070: expected 15 fields, saw 22\nSkipping line 166007: expected 15 fields, saw 22\nSkipping line 171877: expected 15 fields, saw 22\nSkipping line 177756: expected 15 fields, saw 22\nSkipping line 181773: expected 15 fields, saw 22\nSkipping line 191085: expected 15 fields, saw 22\nSkipping line 196273: expected 15 fields, saw 22\nSkipping line 196331: expected 15 fields, saw 22\n'
b'Skipping line 197000: expected 15 fields, saw 22\nSkipping line 197011: expected 15 fields, saw 22\nSkipping line 197432: expected 15 fields, saw 22\nSkipping line 208016: expected 15 fields, saw 22\nSkipping line 214110: expected 15 fields, saw 22\nSkipping line 244328: expected 15 fields, saw 22\nSkipping line 248519: expected 15 fields,

In [2]:
df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,3653882,R3O9SGZBVQBV76,B00FALQ1ZC,937001370,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",Watches,5,0,0,N,Y,Five Stars,Absolutely love this watch! Get compliments al...,2015-08-31
1,US,14661224,RKH8BNC3L5DLF,B00D3RGO20,484010722,Kenneth Cole New York Women's KC4944 Automatic...,Watches,5,0,0,N,Y,I love thiswatch it keeps time wonderfully,I love this watch it keeps time wonderfully.,2015-08-31
2,US,27324930,R2HLE8WKZSU3NL,B00DKYC7TK,361166390,Ritche 22mm Black Stainless Steel Bracelet Wat...,Watches,2,1,1,N,Y,Two Stars,Scratches,2015-08-31
3,US,7211452,R31U3UH5AZ42LL,B000EQS1JW,958035625,Citizen Men's BM8180-03E Eco-Drive Stainless S...,Watches,5,0,0,N,Y,Five Stars,"It works well on me. However, I found cheaper ...",2015-08-31
4,US,12733322,R2SV659OUJ945Y,B00A6GFD7S,765328221,Orient ER27009B Men's Symphony Automatic Stain...,Watches,4,0,0,N,Y,"Beautiful face, but cheap sounding links",Beautiful watch face. The band looks nice all...,2015-08-31


In [3]:
# remove missing values
df.dropna(subset=['review_body'], inplace=True)
df.reset_index(inplace=True, drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960056 entries, 0 to 960055
Data columns (total 15 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   marketplace        960056 non-null  object
 1   customer_id        960056 non-null  int64 
 2   review_id          960056 non-null  object
 3   product_id         960056 non-null  object
 4   product_parent     960056 non-null  int64 
 5   product_title      960054 non-null  object
 6   product_category   960056 non-null  object
 7   star_rating        960056 non-null  int64 
 8   helpful_votes      960056 non-null  int64 
 9   total_votes        960056 non-null  int64 
 10  vine               960056 non-null  object
 11  verified_purchase  960056 non-null  object
 12  review_headline    960049 non-null  object
 13  review_body        960056 non-null  object
 14  review_date        960052 non-null  object
dtypes: int64(5), object(10)
memory usage: 109.9+ MB


In [4]:
# use the first 1000 data as our training data
data_size = 1000
data = df.loc[:data_size-1, 'review_body'].tolist()
len(data)

1000

### Part 2. Tokenizing and Stemming

In [5]:
import nltk

# use nltk's Enligh stopwords and some self defined stop words
stopwords = nltk.corpus.stopwords.words('english')
stopwords.append("'s")
stopwords.append("'m")
stopwords.append("br")
stopwords.append('watch')

In [6]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')

def tokenize(text):
    tokens = []
    for word in nltk.word_tokenize(text):
        if word.lower() not in stopwords and word.isalpha():
            tokens.append(word.lower())
    
    stems = [stemmer.stem(token) for token in tokens]
    return stems

### Part 3. TF-IDF

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_model = TfidfVectorizer(max_df=0.99, 
                              max_features=1000, 
                              min_df=0.01, 
                              use_idf=True, 
                              tokenizer=tokenize, 
                              ngram_range=(1,1))

In [8]:
tfidf_matrix = tfidf_model.fit_transform(data)
print('In total {} reviews, there are {} terms'.format(tfidf_matrix.shape[0], tfidf_matrix.shape[1]))

In total 1000 reviews, there are 276 terms


In [9]:
tfidf_model.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 0.99,
 'max_features': 1000,
 'min_df': 0.01,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': <function __main__.tokenize(text)>,
 'use_idf': True,
 'vocabulary': None}

In [10]:
tf_selected_words = tfidf_model.get_feature_names()
tf_selected_words[:10]

['abl',
 'absolut',
 'accur',
 'actual',
 'adjust',
 'alarm',
 'almost',
 'alreadi',
 'also',
 'alway']

In [11]:
tfidf_matrix.todense()

matrix([[0.        , 0.43786652, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

### Part 4. K-means clustering

In [12]:
from sklearn.cluster import KMeans

num_clusters = 5

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [13]:
review_cluster = {'review': df[:data_size].review_body, 'cluster': clusters}
result = pd.DataFrame(review_cluster, columns=['review', 'cluster'])

In [14]:
result.head()

Unnamed: 0,review,cluster
0,Absolutely love this watch! Get compliments al...,2
1,I love this watch it keeps time wonderfully.,1
2,Scratches,2
3,"It works well on me. However, I found cheaper ...",2
4,Beautiful watch face. The band looks nice all...,2


In [15]:
print('Number of reviews in each cluster:')
result['cluster'].value_counts().to_frame()

Number of reviews in each cluster:


Unnamed: 0,cluster
2,677
1,108
3,78
0,76
4,61


In [16]:
km.cluster_centers_

array([[0.        , 0.        , 0.        , ..., 0.00810014, 0.        ,
        0.        ],
       [0.        , 0.03256575, 0.        , ..., 0.01344688, 0.00411372,
        0.00421882],
       [0.00549086, 0.00540562, 0.0033356 , ..., 0.01709002, 0.01297125,
        0.00516523],
       [0.        , 0.        , 0.        , ..., 0.        , 0.01509588,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.00797103, 0.        ,
        0.        ]])

In [17]:
km.cluster_centers_.shape

(5, 276)

In [18]:
print('<Document: Clustering result by K-means>')

centroids = km.cluster_centers_.argsort()[:, ::-1]

Cluster_words_summay = {}
for i in range(num_clusters):
    print('Ten sample words in cluster #{}: '.format(i), end=' ')
    Cluster_words_summay[i] = []
    for ind in centroids[i, :10]:
        Cluster_words_summay[i].append(tf_selected_words[ind])
        print(tf_selected_words[ind], ',', end=' ')
    print()
    
    cluster_reviews = result[result.cluster==i].review.tolist()
    print('Sample reivews in cluster #{}:'.format(i))
    print('; '.join(cluster_reviews[:3]))
    print()

<Document: Clustering result by K-means>
Ten sample words in cluster #0:  good , product , seller , qualiti , price , work , big , look , excel , keep , 
Sample reivews in cluster #0:
very good; It's a good value, and a good functional watch strap.  It's super wide though, and takes more space on the wrist than I'd like.; very good

Ten sample words in cluster #1:  love , wife , husband , look , beauti , gift , great , color , absolut , bought , 
Sample reivews in cluster #1:
I love this watch it keeps time wonderfully.; i love this watch for my purpose, about the people complaining should of done their research better before buying. dumb people.; Love this watch, I just received it yesterday it looks really nice on my  wrist, my friends and family love it.

Ten sample words in cluster #2:  look , like , work , band , time , perfect , beauti , one , wear , excel , 
Sample reivews in cluster #2:
Absolutely love this watch! Get compliments almost every time I wear it. Dainty.; Scratches;

### Part 5. Topic Modeling - Latent Dirichlet Allocation

In [19]:
# user LDA for clutering
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=5)
lda_output = lda.fit_transform(tfidf_matrix)
print(lda_output.shape) 
print(lda_output)  #  documents and topics matrix

(1000, 5)
[[0.08630816 0.05363387 0.49013423 0.31469624 0.05522749]
 [0.24675154 0.52508154 0.07474596 0.07733538 0.07608558]
 [0.2        0.2        0.2        0.2        0.2       ]
 ...
 [0.10000039 0.59988765 0.10000039 0.10000062 0.10011095]
 [0.06198638 0.06446851 0.06845327 0.06373521 0.74135664]
 [0.07012546 0.7240728  0.06742329 0.06815469 0.07022376]]


In [20]:
# topics and words matrix
topic_word = lda.components_
df_topic_word = pd.DataFrame(topic_word)
df_topic_word.columns = tfidf_model.get_feature_names()
topic_names = ['Topic' + str(i) for i in range(lda.n_components)]
df_topic_word.index = topic_names
df_topic_word.head()

Unnamed: 0,abl,absolut,accur,actual,adjust,alarm,almost,alreadi,also,alway,...,wish,within,without,work,worn,worth,would,wrist,year,yet
Topic0,0.200239,5.090469,0.20007,0.200268,1.398326,0.200199,0.200661,0.200122,0.200138,0.200088,...,0.200108,0.20059,0.200096,0.201119,0.202141,1.436021,0.200314,0.201246,0.200112,0.202785
Topic1,0.200063,0.200204,0.200049,0.208061,0.201343,0.200063,0.202104,0.200898,0.201983,0.200479,...,0.200342,0.2001,0.200862,6.710469,0.200085,0.201118,0.203727,0.850557,0.201536,0.200042
Topic2,0.916672,2.477202,0.20008,0.200562,0.200172,0.200119,0.204056,2.989975,1.117651,0.92817,...,0.200088,0.200891,0.200204,0.202932,0.200113,0.863986,0.923574,0.283979,0.20105,0.200754
Topic3,0.2503,0.20098,0.206343,0.20037,0.223924,4.967547,2.813021,0.208561,0.870264,0.200768,...,1.105023,0.678507,1.693054,22.783946,0.20572,0.200268,0.2024,3.422321,0.501566,0.200414
Topic4,3.150039,0.20785,2.451661,4.852186,4.085804,0.207276,3.742651,1.980874,6.222922,3.037664,...,2.544368,2.952903,4.094696,8.345449,3.24694,3.863816,13.422309,10.365948,10.299036,4.148501


In [21]:
# Assign a document to the topic with highest probability

import numpy as np

doc_names = ['Doc' + str(i) for i in range(len(data))]

df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topic_names, index=doc_names)
df_document_topic['Topic'] = np.argmax(df_document_topic.values, axis=1)

df_document_topic.head(10)

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic
Doc0,0.09,0.05,0.49,0.31,0.06,2
Doc1,0.25,0.53,0.07,0.08,0.08,1
Doc2,0.2,0.2,0.2,0.2,0.2,0
Doc3,0.06,0.06,0.06,0.06,0.77,4
Doc4,0.04,0.04,0.04,0.04,0.85,4
Doc5,0.2,0.07,0.08,0.07,0.57,4
Doc6,0.35,0.46,0.06,0.06,0.06,1
Doc7,0.06,0.06,0.06,0.06,0.75,4
Doc8,0.04,0.04,0.22,0.04,0.65,4
Doc9,0.06,0.06,0.06,0.06,0.77,4


In [22]:
df_document_topic['Topic'].value_counts().to_frame()

Unnamed: 0,Topic
4,453
3,180
1,157
0,112
2,98


In [23]:
# find out top n words for each topic

def print_topic_words(tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
    for weights in lda_model.components_:
        top_words = weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

In [24]:
topic_words = print_topic_words(tfidf_model, lda, 15)

df_topic_words = pd.DataFrame(topic_words)
df_topic_words.columns = ['Word' + str(i) for i in range(df_topic_words.shape[1])] 
df_topic_words.index = ['Topic' + str(i) for i in range(df_topic_words.shape[0])]
df_topic_words

Unnamed: 0,Word0,Word1,Word2,Word3,Word4,Word5,Word6,Word7,Word8,Word9,Word10,Word11,Word12,Word13,Word14
Topic0,love,beauti,like,gift,wife,husband,realli,bought,absolut,pleas,cool,look,much,classi,fast
Topic1,good,great,look,price,awesom,thank,qualiti,keep,work,deal,fast,comfort,ship,big,want
Topic2,excel,product,amaz,great,pretti,seller,recommend,price,color,super,compliment,box,qualiti,quick,high
Topic3,nice,work,perfect,expect,time,fit,well,wear,easi,light,littl,look,hand,small,exact
Topic4,band,look,one,like,time,day,face,watch,get,love,would,wear,great,nice,size
