# What is Topic Modeling :

Topic modeling is an unsupervised technique that intends to analyze large volumes of text data by clustering the documents into groups. In the case of topic modeling, the text data do not have any labels attached to it. Rather, topic modeling tries to group the documents into clusters based on similar characteristics.

A typical example of topic modeling is clustering a large number of newspaper articles that belong to the same category. In other words, cluster documents that have the same topic. It is important to mention here that it is extremely difficult to evaluate the performance of topic modeling since there are no right answers. It depends upon the user to find similar characteristics between the documents of one cluster and assign it an appropriate label or topic.
    
Two approaches are mainly used for topic modeling: **Latent Dirichlet Allocation** and **Non-Negative Matrix factorization**.

# Latent Dirichlet Allocation (LDA) :

**The LDA is based upon two general assumptions:**



*   Documents that have similar words usually have the same topic
*  Documents that have groups of words frequently occurring together usually have the same topic.

These assumptions make sense because the documents that have the same topic, for instance, Business topics will have words like the "economy", "profit", "the stock market", "loss", etc. The second assumption states that if these words frequently occur together in multiple documents, those documents may belong to the same category.

**Mathematically, the above two assumptions can be represented as:**

Documents are probability distributions over latent topics
Topics are probability distributions over words


# LDA for Topic Modeling in Python:

---



In this section we will see how Python can be used to implement LDA for topic modeling. The data set can be downloaded from the Kaggle.

The data set contains user reviews for different products in the food category. We will use LDA to group the user reviews into 5 categories.

**The first step, as always, is to import the data set:**

In [0]:
import pandas as pd  
import numpy as np

#reviews_datasets = pd.read_csv(r'E:\Datasets\Reviews.csv')
reviews_datasets = pd.read_csv(r'Reviews_c1.csv')
reviews_datasets = reviews_datasets.head(1000)  
reviews_datasets.dropna()
reviews_datasets.head() 

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1.0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1.0,1.0,5.0,1303862000.0,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2.0,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0.0,0.0,1.0,1346976000.0,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3.0,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1.0,1.0,4.0,1219018000.0,"""Delight"" says it all",This is a confection that has been around a fe...
3,4.0,B000UA0QIQ,A395BORC6FGVXV,Karl,3.0,3.0,2.0,1307923000.0,Cough Medicine,If you are looking for the secret ingredient i...
4,5.0,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0.0,0.0,5.0,1350778000.0,Great taffy,Great taffy at a great price. There was a wid...


In [0]:
reviews_datasets['Text'][350]

'These chocolate covered espresso beans are wonderful!  The chocolate is very dark and rich and the "bean" inside is a very delightful blend of flavors with just enough caffine to really give it a zing.'

**Before we can apply LDA**, we need to create vocabulary of all the words in our data. Remember from the previous article, we could do so with the help of a count vectorizer. Look at the following script:

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(max_df=0.8, min_df=2, stop_words='english')  
doc_term_matrix = count_vect.fit_transform(reviews_datasets['Text'].values.astype('U'))  

In the script above we use the ***CountVectorizer*** class from the ***sklearn.feature_extraction.text*** module to create a document-term matrix. We specify to only include those words that appear in less than 80% of the document and appear in at least 2 documents. We also remove all the stop words as they do not really contribute to topic modeling.

**Now let's look at our document term matrix:**

In [0]:
doc_term_matrix 

<1000x2697 sparse matrix of type '<class 'numpy.int64'>'
	with 25469 stored elements in Compressed Sparse Row format>

Each of 20k documents is represented as 14546 dimensional vector, which means that our vocabulary has 14546 words.

**Next, we will use LDA to create topics along with the probability distribution for each word in our vocabulary for each topic:**

In [0]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=5, random_state=42)  
LDA.fit(doc_term_matrix)  

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=5, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In the script above we use the*** LatentDirichletAllocation*** class from the ***sklearn.decomposition*** library to perform LDA on our document-term matrix. The parameter *n_components* specifies the number of categories, or topics, that we want our text to be divided into. The parameter *random_state* (aka the seed) is set to 42 so that you get the results similar to mine.

Let's randomly fetch words from our vocabulary. We know that the count vectorizer contains all the words in our vocabulary. We can use the get_feature_names() method and pass it the ID of the word that we want to fetch.

Let's randomly fetch words from our vocabulary. We know that the count vectorizer contains all the words in our vocabulary. We can use the get_feature_names() method and pass it the ID of the word that we want to fetch.

**The following script randomly fetches 10 words from our vocabulary:**

In [0]:
import random

for i in range(10):  
    random_id = random.randint(0,len(count_vect.get_feature_names()))
    print(count_vect.get_feature_names()[random_id])

luckily
vary
handy
8oz
compete
formula
middle
salty
saltiness
arrives


**Let's find 10 words with the highest probability for the first topic. To get the first topic, you can use the components_ attribute and pass a 0 index as the value:**

In [0]:
first_topic = LDA.components_[0]  


The first topic contains the probabilities of 14546 words for topic 1. To sort the indexes according to probability values, we can use the argsort() function. Once sorted, the 10 words with the highest probabilities will now belong to the last 10 indexes of the array. The following script 

**returns the indexes of the 10 words with the highest probabilities:**

In [0]:
top_topic_words = first_topic.argsort()[-10:]
top_topic_words

array([2440,  741, 1049,  352, 1914, 1076, 2383,  300, 1373,  963])

These indexes can then be used to retrieve the value of the words from the count_vect object.

**which can be done like this:**

In [0]:
for i in top_topic_words:  
    print(count_vect.get_feature_names()[i])

time
dog
good
buy
really
great
taste
br
like
food


**Let's print the 10 words with highest probabilities for all the five topics:**

In [0]:
for i,topic in enumerate(LDA.components_):  
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['time', 'dog', 'good', 'buy', 'really', 'great', 'taste', 'br', 'like', 'food']


Top 10 words for topic #1:
['sugar', 'price', 'taste', 'good', 'coffee', 'just', 'tea', 'product', 'like', 'br']


Top 10 words for topic #2:
['taste', 'instant', 'sugar', 'mix', 'oatmeal', 'like', 'br', 'good', 'hot', 'great']


Top 10 words for topic #3:
['little', 'use', 'eat', 'like', 'product', 'flavor', 'love', 'best', 'tea', 'good']


Top 10 words for topic #4:
['taste', 'chip', 'salt', 'potato', 'like', 'bag', 'flavor', 'kettle', 'br', 'chips']




The output shows that the second topic might contain reviews about chocolates, etc. Similarly, the third topic might again contain reviews about sodas or juices. You can see that there a few common words in all the categories. This is because there are few words that are used for almost all the topics. For instance "good", "great", "like" etc.

As a final step, we will add a column to the original data frame that will store the topic for the text. To do so, we can use LDA.transform() method and pass it our document-term matrix. This method will assign the probability of all the topics to each document. 

**Look at the following code:**

In [0]:
topic_values = LDA.transform(doc_term_matrix)  
topic_values.shape  

(1000, 5)

In the output, you will see (20000, 5) which means that each of the document has 5 columns where each column corresponds to the probability value of a particular topic. To find the topic index with maximum value, we can call the argmax() method and pass 1 as the value for the axis parameter.

**The following script adds a new column for topic in the data frame and assigns the topic value to each row in the column:**

In [0]:
reviews_datasets['Topic'] = topic_values.argmax(axis=1)  

**Let's now see how the data set looks:**

In [0]:
reviews_datasets.head()  

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Topic
0,1.0,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1.0,1.0,5.0,1303862000.0,Good Quality Dog Food,I have bought several of the Vitality canned d...,1
1,2.0,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0.0,0.0,1.0,1346976000.0,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,4
2,3.0,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1.0,1.0,4.0,1219018000.0,"""Delight"" says it all",This is a confection that has been around a fe...,3
3,4.0,B000UA0QIQ,A395BORC6FGVXV,Karl,3.0,3.0,2.0,1307923000.0,Cough Medicine,If you are looking for the secret ingredient i...,3
4,5.0,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0.0,0.0,5.0,1350778000.0,Great taffy,Great taffy at a great price. There was a wid...,2
