Unsupervised technique that intends to analyze large volumes of text data by clustering the documents into groups. Trying to group the documents into clusters based on similar characteristics.

Latent Dirichlet Allocation (LDA)

The LDA is based upon two general assumptions:

    Documents that have similar words usually have the same topic
    Documents that have groups of words frequently occurring together usually have the same topic.

These assumptions make sense because the documents that have the same topic, for instance, Business topics will have words like the "economy", "profit", "the stock market", "loss", etc. 
The second assumption states that if these words frequently occur together in multiple documents, those documents may belong to the same category.

Mathematically, the above two assumptions can be represented as:

    Documents are probability distributions over latent topics
    Topics are probability distributions over words

In [1]:
import os
import pandas as pd
import numpy as np

os.chdir("D:\Choogle\Data\dataset_review")
reviews_datasets = pd.read_csv(r'dataset2_london.csv')
reviews_datasets = reviews_datasets.head(20000)
reviews_datasets.dropna()

Unnamed: 0,crayon_review_id,crayon_user_id,crayon_product_id,domain,url,type,category,date_created,gid,key,...,user_id,user_name,user_location_text,user_city,user_country,user_total_reviews,user_total_reviews_range,user_helpful_reviews,user_helpful_reviews_range,partition_0


In [2]:
reviews_datasets.head()

Unnamed: 0,crayon_review_id,crayon_user_id,crayon_product_id,domain,url,type,category,date_created,gid,key,...,user_id,user_name,user_location_text,user_city,user_country,user_total_reviews,user_total_reviews_range,user_helpful_reviews,user_helpful_reviews_range,partition_0
0,RR-202001000-552550429,RU-302001000-806642411,R-102001000-202456154,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-11,item_1_saturam_restaurant_review_metadata_info...,ef65c514583b9254dbd8f794c516aaca21c864076d59e3...,...,732393D49A0A1F0B63E5037F07DCDA04,Stevetarn2014,"London, United Kingdom",London,United Kingdom,423,101 to 500,1030,1001 to 5000,item_1_restaurant_review_20190609_v10_stage4_c...
1,RR-202001000-550793573,RU-302001000-802600320,R-102001000-202456154,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-11,item_1_saturam_restaurant_review_metadata_info...,e766851ea18b154b154e419691d4ee7eeb4db9d80333f4...,...,2D0CF2077ADEDA59A706B7E1E3DE26E6,T35BZcristinag,,,,141,101 to 500,9,1 to 100,item_1_restaurant_review_20190609_v10_stage4_c...
2,RR-202001000-533398566,RU-302001000-802928560,R-102001000-200371326,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-12,item_1_saturam_restaurant_review_metadata_info...,982dced88fd267e522b3591a402d2a3f5d3cb8bc783bdd...,...,32BF09E09D5417A78AC008B76F91EF11,Q1487BW,"Bristol, United Kingdom",Bristol,United Kingdom,8,1 to 100,2,1 to 100,item_1_restaurant_review_20190609_v10_stage4_c...
3,RR-202001000-556167355,RU-302001000-803691542,R-102001000-200371326,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-12,item_1_saturam_restaurant_review_metadata_info...,ffdeac1f47ee1b9e21e3bf54869d87e1b9ab7b9a55d02f...,...,3FF79B747F3F4A0FA6BD325DA30ECDFE,Ian M,,,,6,1 to 100,1,1 to 100,item_1_restaurant_review_20190609_v10_stage4_c...
4,RR-202001000-545446145,RU-302001000-809584645,R-102001000-200371326,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-12,item_1_saturam_restaurant_review_metadata_info...,cf0e41f130196d008f51f301712f0f48e572472e4532a1...,...,A62A1443262F2B60BCFDA5903E663B03,gillank,"Hanoi, Vietnam",Hanoi,Vietnam,140,101 to 500,41,1 to 100,item_1_restaurant_review_20190609_v10_stage4_c...


In [3]:
reviews_datasets['text'][350]

'I tried a selection of sushi and passion martini everything was cooked well, tasty and nicely presented. The barman was very friendly he help me with the menu as well. The Manager Laura so friendly and totally professional. Absolutely perfect'

In [4]:
# create vocabulary of all the words in our data using countVectorizer

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')
doc_term_matrix = count_vect.fit_transform(reviews_datasets['text'].values.astype('U'))

# Used CountVectorizer class from the sklearn.feature_extraction.text module to create a document-term matrix. 
# Specified to only include those words that appear in less than 80% of the document and appear in at least 2 documents. 
# We also remove all the stop words as they do not really contribute to topic modeling.

In [5]:
# document term matrix
doc_term_matrix

<1500x3619 sparse matrix of type '<class 'numpy.int64'>'
	with 50348 stored elements in Compressed Sparse Row format>

In [6]:
#Each of the 1500 reviews is represented as 3619 dimensional vector, which means that our vocabulary has 50348 words

In [7]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=10, random_state=42)
LDA.fit(doc_term_matrix)

#LatentDirichletAllocation class from the sklearn.decomposition library perform LDA on our document-term matrix. 
#The parameter n_components specifies the number of categories, or topics, that we want our text to be divided into. 
#The parameter random_state (aka the seed) is set to 42

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=10, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [8]:
import random

for i in range(10):
    random_id = random.randint(0,len(count_vect.get_feature_names()))
    print(count_vect.get_feature_names()[random_id])
    
#Randomly fetch words from our vocabulary. 
#count vectorizer contains all the words in our vocabulary. 
#Using the get_feature_names() method and passing it the ID of the word that we wanted to fetch.

quot
luke
focus
finest
underwhelming
tanner
production
brioche
exquisite
ran


In [9]:
#10 words with the highest probability for the first topic. 
#To get the first topic, we use the components_ attribute and pass a 0 index as the value:

first_topic = LDA.components_[0]

#The first topic contains the probabilities of 50348 words for topic 1.

In [10]:
#To sort the indexes according to probability values, we can use the argsort() function. 
#Once sorted, the 10 words with the highest probabilities will now belong to the last 10 indexes of the array. 
#The following script returns the indexes of the 10 words with the highest probabilities:

top_topic_words = first_topic.argsort()[-10:]

In [11]:
##These indexes can then be used to retrieve the value of the words from the count_vect object
for i in top_topic_words:
    print(count_vect.get_feature_names()[i])

place
amazing
really
staff
atmosphere
wine
service
good
great
food


In [12]:
for i,topic in enumerate(LDA.components_):
    print(f'Top 30 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-30:]])
    print('\n')

Top 30 words for topic #0:
['night', 'just', 'bit', 've', 'visit', 'selection', 'time', 'nice', 'dishes', 'definitely', 'evening', 'like', 'wait', 'menu', 'busy', 'tasty', 'table', 'friendly', 'excellent', 'restaurant', 'place', 'amazing', 'really', 'staff', 'atmosphere', 'wine', 'service', 'good', 'great', 'food']


Top 30 words for topic #1:
['lunch', 'little', 'did', 'decided', 'try', 'enjoyed', 'nice', 'nique', 'really', 'pique', 'dinner', 'tables', 'friends', 'time', 'meal', 'like', 'amp', 'order', 'french', 'place', 'menu', 'just', 'wine', 'table', 'service', 'good', 'bar', 'staff', 'restaurant', 'food']


Top 30 words for topic #2:
['amazing', 'don', 'list', 'deal', 'night', 'did', 'let', 'restaurant', 'came', 'ordering', 'think', 'meal', 'visited', 'ordered', 'table', 'great', 'special', 'staff', 'make', 'really', 'nice', 'small', 'wines', 'price', 'just', 'place', 'experience', 'wine', 'service', 'food']


Top 30 words for topic #3:
['came', 'wahaca', 'delicious', 'clean', 'de

In [13]:
#Add a column to the original data frame that will store the topic for the text. 
#To do so, we can use LDA.transform() method and pass it our document-term matrix. 
#This method will assign the probability of all the topics to each document.
topic_values = LDA.transform(doc_term_matrix)
topic_values.shape

(1500, 10)

In [14]:
#Adds a new column for topic in the data frame and assigns the topic value to each row in the column
reviews_datasets['Topic'] = topic_values.argmax(axis=1)

In [15]:
reviews_datasets.head()

Unnamed: 0,crayon_review_id,crayon_user_id,crayon_product_id,domain,url,type,category,date_created,gid,key,...,user_name,user_location_text,user_city,user_country,user_total_reviews,user_total_reviews_range,user_helpful_reviews,user_helpful_reviews_range,partition_0,Topic
0,RR-202001000-552550429,RU-302001000-806642411,R-102001000-202456154,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-11,item_1_saturam_restaurant_review_metadata_info...,ef65c514583b9254dbd8f794c516aaca21c864076d59e3...,...,Stevetarn2014,"London, United Kingdom",London,United Kingdom,423,101 to 500,1030,1001 to 5000,item_1_restaurant_review_20190609_v10_stage4_c...,0
1,RR-202001000-550793573,RU-302001000-802600320,R-102001000-202456154,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-11,item_1_saturam_restaurant_review_metadata_info...,e766851ea18b154b154e419691d4ee7eeb4db9d80333f4...,...,T35BZcristinag,,,,141,101 to 500,9,1 to 100,item_1_restaurant_review_20190609_v10_stage4_c...,2
2,RR-202001000-533398566,RU-302001000-802928560,R-102001000-200371326,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-12,item_1_saturam_restaurant_review_metadata_info...,982dced88fd267e522b3591a402d2a3f5d3cb8bc783bdd...,...,Q1487BW,"Bristol, United Kingdom",Bristol,United Kingdom,8,1 to 100,2,1 to 100,item_1_restaurant_review_20190609_v10_stage4_c...,4
3,RR-202001000-556167355,RU-302001000-803691542,R-102001000-200371326,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-12,item_1_saturam_restaurant_review_metadata_info...,ffdeac1f47ee1b9e21e3bf54869d87e1b9ab7b9a55d02f...,...,Ian M,,,,6,1 to 100,1,1 to 100,item_1_restaurant_review_20190609_v10_stage4_c...,1
4,RR-202001000-545446145,RU-302001000-809584645,R-102001000-200371326,item_1_saturam_restaurant_review_metadata_info,https://item_1_saturam_restaurant_review_metad...,Reviews,Restaurants,2019-04-12,item_1_saturam_restaurant_review_metadata_info...,cf0e41f130196d008f51f301712f0f48e572472e4532a1...,...,gillank,"Hanoi, Vietnam",Hanoi,Vietnam,140,101 to 500,41,1 to 100,item_1_restaurant_review_20190609_v10_stage4_c...,8


In [16]:
import pyLDAvis.sklearn
panel = pyLDAvis.sklearn.prepare(LDA, doc_term_matrix, count_vect, mds='tsne')
pyLDAvis.display(panel)

  from collections import Iterable
  from collections import Mapping
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
