# LDA Topic Modelling

In parallel with KMeans Clustering, we also would like to try clustering the review text using LDA Topic Modelling. The key difference between the 2 clustering methods is LDA topic modelling clusters reviews into different topics by solely looking at **text data** which in this case will be the review text. In contrast, KMeans Clustering can cluster the reviews based on **all features**, tokenized text and other numeric features. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import LatentDirichletAllocation as LDA

In [2]:
from nltk.corpus import stopwords 
ENGLISH_STOP_WORDS = stopwords.words('english')
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
#bringing in just the reviewText from the dataset (require custom functions library)
import functions_library as fl
review_text = fl.cleanDF(fl.createPdDF('All_Beauty.json.gz'))['reviewText']

In [4]:
#make sure it's loaded in properly
review_text

0                                                     great
1         My  husband wanted to reading about the Negro ...
2         This book was very informative, covering all a...
3         I am already a baseball fan and knew a bit abo...
4         This was a good story of the Black leagues. I ...
                                ...                        
362247    It was awful. It was super frizzy and I tried ...
362248    I was skeptical about buying this.  Worried it...
362249                             Makes me look good fast.
362250    Way lighter than photo\nNot mix blend of color...
362251    No return instructions/phone # in packaging.  ...
Name: reviewText, Length: 362252, dtype: object

#### TF-IDF vectorization

In [5]:
#using same settings used for KMeans clustering to be consistent
vectorizer = TfidfVectorizer(min_df = 1000, tokenizer = fl.spl_tokenizer, ngram_range = (1,2))

In [6]:
#get tokens from reviewText
word_matrix = vectorizer.fit_transform(review_text)

In [7]:
# Helper function
def print_topics(model, vectorizer, n_top_words):
    words = vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(",".join([words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

## LDA Topic Modelling with 25 Topics
For LDA topic Modelling, we need to pre-select the number of topics we think exist in our text. To be consistent with KMeans clustering, I will choose 25 topics as we had selected 25 clusters for KMeans. Note: this is not necessarily the optimal way to determine the number of topics. Can make improvements in future iterations.

In [8]:
# Setting number of topics and also the top number of words we want to see from the model
number_topics = 25
number_words = 15

In [9]:
# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=4, verbose=1)
lda.fit(word_matrix)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


LatentDirichletAllocation(n_components=25, n_jobs=4, verbose=1)

In [10]:
# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, vectorizer, number_words)

Topics found via LDA:

Topic #0:
good,deodorant,smell,smell good,natural,work,really good,taste,really,natural deodorant,tried,odor,far,like,day

Topic #1:
using,product,week,result,difference,day,skin,see,eye,use,ive,cream,used,month,every

Topic #2:
fast,amazing,thanks,shipping,delivery,fast shipping,quick,product,bad,super,shipped,wait,item,delivered,smell amazing

Topic #3:
great,work,work great,price,great price,good price,value,good,job,deal,look great,item,great job,look,buy

Topic #4:
nail,recommend,described,polish,highly,highly recommend,advertised,would recommend,would,product,anyone,recommend product,coat,recommend anyone,nail polish

Topic #5:
long,last,last long,long time,time,absolutely,little,love,absolutely love,way,go,lash,go long,long way,use

Topic #6:
teeth,water,floss,waterpik,gum,use,dentist,mouth,clean,flossing,one,dental,get,using,toothbrush

Topic #7:
love,stuff,fit,love stuff,perfectly,fine,wife,love smell,husband,comfortable,work fine,son,smell,work,everythi

As we can see above, these are the top words for each topic. The results are pretty good: we can see topics related to specific types of products like topic 2 (shaving), topic 3 (teeth) and topic 17 (skin). Other topics are related to logistics such as topic 9 and 13.

In [11]:
import joblib

In [12]:
#saving model to computer
joblib.dump(lda,'lda_25.pkl')

['lda_25.pkl']

In [13]:
#use this line if you need to load the model back into the notebook
lda = joblib.load('lda_25.pkl')