# Topic Modelling

**References:**

https://monkeylearn.com/blog/introduction-to-topic-modeling/

https://iq.opengenus.org/topic-modelling-techniques/

https://www.analyticsvidhya.com/blog/2021/07/topic-modelling-with-lda-a-hands-on-introduction/

https://medium.com/voice-tech-podcast/topic-modelling-using-nmf-2f510d962b6e

### Contents:

1. <a href="#Introduction!">Introduction</a>
2. <a href="#How-Does-Topic-Modeling-Work?">How Does Topic Modeling Work?</a>
3. <a href="#Different-Methods-of-Topic-Modeling">Different Methods of Topic Modeling</a>

## Introduction!

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.

By using topic analysis models, businesses are able to offload simple tasks onto machines instead of overloading employees with too much data. Just imagine the time your team could save and spend on more important tasks, if a machine was able to sort through endless lists of customer surveys or support tickets every morning.

Topic modeling is an ‘unsupervised’ machine learning technique, in other words, one that doesn’t require training. Topic classification is a ‘supervised’ machine learning technique, one that needs training before being able to automatically analyze texts. 

### How Does Topic Modeling Work?

Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. Let’s say you’re a software company and you want to know what customers are saying about particular features of your product. Instead of spending hours going through heaps of feedback, in an attempt to deduce which texts are talking about your topics of interest, you could analyze them with a topic modeling algorithm.

By detecting patterns such as word frequency and distance between words, a topic model clusters feedback that is similar, and words and expressions that appear most often. With this information, you can quickly deduce what each set of texts are talking about. Remember, this approach is ‘unsupervised’ meaning that no training is required. 

Topic Modeling refers to the process of dividing a corpus of documents in two:

A list of the topics covered by the documents in the corpus
Several sets of documents from the corpus grouped by the topics they cover.
The underlying assumption is that every document comprises a statistical mixture of topics, i.e. a statistical distribution of topics that can be obtained by “adding up” all of the distributions for all the topics covered. What topic modeling methods do is try to figure out which topics are present in the documents of the corpus and how strong that presence is.

### Different Methods of Topic Modeling

This highly important process can be performed by various algorithms or methods. Some of them are:

1. <a href = "#1.-Latent-Dirirchlet-Allocation-(LDA)">Latent Dirirchlet Allocation (LDA)</a>
2. <a href = "#2.-Non-Negative-Matrix-Factorization-(NMF)">Non Negative Matrix Factorization (NMF)</a>
3. <a href = "#3.-Latent-Semantic-Analysis-(LSA)">Latent Semantic Analysis (LSA)</a>

## 1. Latent Dirirchlet Allocation (LDA)

Latent Dirichlet Allocation is a statistical and graphical model which are used to obtain relationships between multiple documents in a corpus. It is developed using Variational Exception Maximization (VEM) algorithm for obtaining the maximum likelihood estimate from the whole corpus of text. Traditionally, this can be solved by picking out the top few words in the bag of words. However this completely lack the semantics in the sentence. This model follows the concept that each document can be described by the probabilistic distribution of topics and each topic can be described by the probabilistic distribution of words. Thus we can get a much clearer vision about how the topics are connected.

For example, consider you have a corpus of 1000 documents. After preprocessing the corpus, the bag of words consists of 1000 common words. By applying LDA, we can determine the topics which are related to each document. Thus it is made simple to obtain the extracts from the corpus of data.

![image.png](attachment:image.png)

In the above picture, the upper level represents the documents, the middle level represents the topics generated and the lower level represents the words. Thus it clearly explains the rule it follows that document is described a the distribution of topics and topics are described as the distribution of words.

#### Lets See how Topic modelling in LDA works!

Data and Steps for Working with Text
We will apply LDA on the corpus:

- Document 1: I want to watch a movie this weekend.
- Document 2: I went shopping yesterday. New Zealand won the World Test Championship by beating India by eight wickets at Southampton.
- Document 3: I don’t watch cricket. Netflix and Amazon Prime have very good movies to watch.
- Document 4: Movies are a nice way to chill however, this time I would like to paint and read some good books. It’s been so long!
- Document 5: This blueberry milkshake is so good! Try reading Dr. Joe Dispenza’s books. His work is such a game-changer! His books helped to learn so much about how our thoughts impact our biology and how we can all rewire our brains.

**The Work Flow for executing LDA in Python**

- After importing the required libraries, we will compile all the documents into one list to have the corpus.

- We will perform the following text preprocessing steps (can use either spacy or NLTK libraries for preprocessing):

 - Convert the text into lowercase
 - Split text into words
 - Remove the stop loss words
 - Remove the Punctuation, any symbols, and special characters
 - Normalize the word (I’ll be using Lemmatization for normalization)
 
The next step is to convert the cleaned text into a numerical representation where the process for gensim and sklearn packages differ:

- For sklearn: Use either the Count vectorizer or TF-IDF vectorizer to transform the Document Term Matrix (DTM) into numerical arrays.

- For gensim: Using gensim for Document Term Matrix(DTM), we don’t need to explicitly create the DTM matrix from scratch. The gensim library has an internal mechanism to create the DTM.

The only requirement for the gensim package is that we need to pass the cleaned data in the form of tokenized words.

- Next, we pass the vectorized corpus to the LDA model for both the packages gensim and sklearn.

In [1]:
#Lets get into the code!

We have taken the ‘Amazon Fine Food Reviews’ data from Kaggle (https://www.kaggle.com/snap/amazon-fine-food-reviews) here to illustrate how we can implement topic modelling using LDA in Python.

In [24]:
# Reading the Data!
import pandas as pd
rev = pd.read_csv("../../Data/Reviews.csv")
rev.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labr...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo""."
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all","This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with ..."
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The fl...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal."


In [3]:
#Looking at the data!

print(len(rev))
print(rev[:5])

568454
   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                     1                       1      4  1219017600   
3                     3                       3      2  1307923200   
4                     0                       0      5  1350777600   

                 Summary                                               Text  
0  Good Quality Dog Food  I have bought several of the Vitality canned d.

**Data Pre-processing**

We will perform the following steps:

- Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
- Words that have fewer than 3 characters are removed.
- All stopwords are removed.
- Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
- Words are stemmed — words are reduced to their root form.


Loading gensim and nltk libraries

In [4]:
import nltk
from nltk.corpus import stopwords  #stopwords
from nltk.stem import WordNetLemmatizer  
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download("stopwords")
stop_words=set(nltk.corpus.stopwords.words('english'))
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sahude7\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sahude7\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sahude7\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\sahude7\AppData\Roaming\nltk_data...


True

In [6]:
# Write a function to perform lemmatize and stem preprocessing steps on the data set.

def clean_text(headline):
      le=WordNetLemmatizer()
      word_tokens=word_tokenize(headline)
      tokens=[le.lemmatize(w) for w in word_tokens if w not in stop_words and len(w)>3]
      cleaned_text=" ".join(tokens)
      return cleaned_text
rev['cleaned_text']=rev['Text'].apply(clean_text)

#### TFIDF vectorization on the text column:

Carrying out a TFIDF vectorization on the text column gives us a document term matrix on which we can carry out the topic modelling. TFIDF refers to Term Frequency Inverse Document Frequency – as this vectorization compares the number of times a word appears in a document with the number of documents that contain the word.

In [7]:
vect =TfidfVectorizer(stop_words=stop_words,max_features=1000)
vect_text=vect.fit_transform(rev['cleaned_text'])

#### LDA on the vectorized text:

The parameters that we have given to the LDA model, as shown below, include the number of topics, the learning method (which is the way the algorithm updates the assignments of the topics to the documents), the maximum number of iterations to be carried out and the random state. The parameters that we have given to the LDA model, as shown below, include the number of topics, the learning method (which is the way the algorithm updates the assignments of the topics to the documents), the maximum number of iterations to be carried out and the random state.

In [8]:
from sklearn.decomposition import LatentDirichletAllocation
lda_model = LatentDirichletAllocation(n_components=10,learning_method='online',random_state=42,max_iter=1) 
lda_top = lda_model.fit_transform(vect_text)

**Checking the results:**

We can check the proportion of topics that have been assigned to the first document using the lines of code given below.

In [9]:
print("Document 0: ")
for i,topic in enumerate(lda_top[0]):
  print("Topic ",i,": ",topic*100,"%")

Document 0: 
Topic  0 :  2.1992339077819922 %
Topic  1 :  2.1985043545548097 %
Topic  2 :  17.024779111056652 %
Topic  3 :  2.1983783888013466 %
Topic  4 :  2.198732256871717 %
Topic  5 :  65.38648016641726 %
Topic  6 :  2.1984493293481044 %
Topic  7 :  2.198234371396265 %
Topic  8 :  2.1987634148807573 %
Topic  9 :  2.1984446988910835 %


**Analyzing the Topics:**

Let us check what are the top words that comprise the topics. This would give us a view of what defines each of these topics.

In [10]:
vocab = vect.get_feature_names()
for i, comp in enumerate(lda_model.components_):
     vocab_comp = zip(vocab, comp)
     sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:10]
     print("Topic "+str(i)+": ")
     for t in sorted_words:
            print(t[0],end=" ")
            print("n")

Topic 0: 
chip n
snack n
chocolate n
cooky n
taste n
like n
peanut n
butter n
great n
good n
Topic 1: 
shipping n
salt n
arrived n
order n
price n
delivered n
candy n
good n
fast n
grey n
Topic 2: 
product n
like n
would n
taste n
good n
really n
coconut n
time n
review n
little n
Topic 3: 
treat n
dog n
love n
chew n
training n
ball n
puppy n
bone n
baby n
soft n
Topic 4: 
popcorn n
taste n
drink n
flavor n
like n
water n
sugar n
sweet n
fruit n
really n
Topic 5: 
food n
product n
love n
cat n
month n
like n
year n
time n
hair n
good n
Topic 6: 
sauce n
soup n
spicy n
pasta n
great n
cheese n
flavor n
chicken n
make n
rice n
Topic 7: 
save n
com n
www n
http n
gp n
subscribe n
href n
amazon n
pack n
ordering n
Topic 8: 
store n
amazon n
find n
price n
local n
grocery n
order n
espresso n
great n
product n
Topic 9: 
coffee n
flavor n
like n
taste n
strong n
good n
cup n
green n
blend n
bold n




[<a href="#Content:">Back to Content</a>]

## 2. Non Negative Matrix Factorization (NMF)

Latent Semantic Analysis is also an unsupervised learning method used to extract relationship between different words in a pile of documents. This aids us in choosing the correct documents required. It simply acts as a dimensionality method used to reduce the dimension of the huge corpus of text data. These unnecessary data acts as a noise in determining the correct insights from the data.

**Some Important points about NMF:**

1. It belongs to the family of linear algebra algorithms that are used to identify the latent or hidden structure present in the data.

2. It is represented as a non-negative matrix.

3. It can also be applied for topic modelling, where the input is the term-document matrix, typically TF-IDF normalized.

Input: Term-Document matrix, number of topics.
Output: Gives two non-negative matrices of the original n-words by k topics and those same k topics by the m original documents.
In simple words, we are using linear algebra for topic modelling.

4. NMF has become so popular because of its ability to automatically extract sparse and easily interpretable factors.

Below is the pictorial representation of the above technique:

![image.png](attachment:image.png)

As described in the image above, we have the term-document matrix (A) which we decompose it into two the following two matrices,

First matrix: It has every topic and what terms in it,
Second matrix: It has every document and what topics in it.
For Example,

![image-2.png](attachment:image-2.png)

Let’s try to look at the practical application of NMF with an example described below:

Imagine we have a dataset consisting of reviews of superhero movies.

Input matrix: Here in this example, In the document term matrix we have individual documents along the rows of the matrix and each unique term along with the columns.

In case, the review consists of texts like Tony Stark, Ironman, Mark 42 among others. It may be grouped under the topic Ironman. In this method, each of the individual words in the document term matrix is taken into consideration.

While factorizing, each of the words is given a weightage based on the semantic relationship between the words. But the one with the highest weight is considered as the topic for a set of words. So this process is a weighted sum of different words present in the documents.

In [11]:
#Lets Get into the Code!

In [12]:
# Importing Necessary packages

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [13]:
# Importing Data
text_data= fetch_20newsgroups(remove=('headers', 'footers', 'quotes')).data
text_data[:3]

['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
 "A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't an

In [14]:
# Now, we will convert the document into a term-document matrix which is a collection of all the words in the given document.

# converting the given text term-document matrix
 
vectorizer = TfidfVectorizer(max_features=1500, min_df=10, stop_words='english')
X = vectorizer.fit_transform(text_data)
words = np.array(vectorizer.get_feature_names())

print(X)
print("X = ", words)

  (0, 829)	0.13596515131134768
  (0, 809)	0.1439640091285723
  (0, 707)	0.16068505607893963
  (0, 672)	0.16927150728890597
  (0, 1495)	0.1274990882101728
  (0, 506)	0.19413995565094086
  (0, 887)	0.17648781190400797
  (0, 757)	0.09424560560725692
  (0, 247)	0.17513150125349702
  (0, 1158)	0.1651151431885443
  (0, 1218)	0.19781957502373113
  (0, 128)	0.190572546028195
  (0, 1256)	0.153503242191245
  (0, 1118)	0.12154002727766956
  (0, 273)	0.14279390121865662
  (0, 484)	0.1714763727922697
  (0, 767)	0.18711856186440218
  (0, 808)	0.18303366583393096
  (0, 469)	0.2009979730339519
  (0, 411)	0.14249215589040326
  (0, 1191)	0.17201525862610714
  (0, 278)	0.630558141606117
  (0, 1472)	0.1855076564575762
  (1, 1355)	0.12138696862814867
  (1, 653)	0.1728163048656526
  :	:
  (11312, 1027)	0.45507155319966874
  (11312, 647)	0.21811161764585577
  (11312, 1302)	0.2391477981479836
  (11312, 1276)	0.39611960235510485
  (11312, 1100)	0.1839292570975713
  (11312, 926)	0.2458009890045144
  (11312, 140

Defining term document matrix is out of the scope of this article. In brief, the algorithm splits each term in the document and assigns weightage to each words.

Now, let us apply NMF to our data and view the topics generated. For ease of understanding, we will look at 10 topics that the model has generated. We will use Multiplicative Update solver for optimizing the model.

In [15]:
# Applying Non-Negative Matrix Factorization
 
nmf = NMF(n_components=10, solver="mu")
W = nmf.fit_transform(X)
H = nmf.components_

for i, topic in enumerate(H):
     print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in words[topic.argsort()[-10:]]])))

Topic 1: want,really,time,ve,good,know,think,like,just,don
Topic 2: help,anybody,info,looking,hi,mail,advance,know,does,thanks
Topic 3: does,church,christians,christian,faith,christ,believe,bible,jesus,god
Topic 4: league,win,hockey,play,players,season,games,year,team,game
Topic 5: bus,floppy,ide,controller,hard,drives,disk,card,scsi,drive
Topic 6: shipping,condition,car,offer,price,space,10,sale,00,new
Topic 7: running,problem,using,program,use,window,files,dos,file,windows
Topic 8: public,algorithm,escrow,government,use,keys,clipper,encryption,chip,key
Topic 9: rights,said,armenians,state,armenian,jews,israeli,government,israel,people
Topic 10: send,internet,ftp,email,article,university,cs,com,soon,edu


**When can we use this approach?**

NMF by default produces sparse representations. This mean that most of the entries are close to zero and only very few parameters have significant values. This can be used when we strictly require fewer topics.
NMF produces more coherent topics compared to LDA.

[<a href="#Content:">Back to Content</a>]

## 3. Latent Semantic Analysis (LSA)

NMF is a matrix factorization method where we make sure that the elements of the factorized matrices are non-negative. Consider the document-term matrix obtained from a corpus after removing the stopwords. The matrix can be factorized into two matrices term-topic matrix and topic-document matrix. There are many optimization models to perform the matrix factorization. Hierarchical Alternating Least Square is a faster and better way to perform NMF. Here the factorization occurs by updating one column at a time while keeping the other columns as constant.

**Steps involved in the implementation of LSA**

Let’s say we have m number of text documents with n number of total unique terms (words). We wish to extract k topics from all the text data in the documents. The number of topics, k, has to be specified by the user.

- Generate a document-term matrix of shape m x n having TF-IDF scores.

![image.png](attachment:image.png)

- Then, we will reduce the dimensions of the above matrix to k (no. of desired topics) dimensions, using singular-value decomposition (SVD).
- SVD decomposes a matrix into three other matrices. Suppose we want to decompose a matrix A using SVD. It will be decomposed into matrix U, matrix S, and VT (transpose of matrix V).

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

Each row of the matrix Uk (document-term matrix) is the vector representation of the corresponding document. The length of these vectors is k, which is the number of desired topics. Vector representation for the terms in our data can be found in the matrix Vk (term-topic matrix).

- So, SVD gives us vectors for every document and term in our data. The length of each vector would be k. We can then use these vectors to find similar words and similar documents using the cosine similarity method.

**Pros and Cons of LSA**

Latent Semantic Analysis can be very useful as we saw above, but it does have its limitations. It’s important to understand both the sides of LSA so you have an idea of when to leverage it and when to try something else.

**Pros:**

- LSA is fast and easy to implement.
- It gives decent results, much better than a plain vector space model.

**Cons:**

- Since it is a linear model, it might not do well on datasets with non-linear dependencies.
- LSA assumes a Gaussian distribution of the terms in the documents, which may not be true for all problems.
- LSA involves SVD, which is computationally intensive and hard to update as new data comes up.

**Implementation of LSA in Python**

It’s time to power up Python and understand how to implement LSA in a topic modeling problem. Once your Python environment is open, follow the steps I have mentioned below.

In [16]:
#Lets Getr in the Code!

#Data reading and inspection
#Let’s load the required libraries before proceeding with anything else.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_colwidth", 200)

In this code, we will use the ’20 Newsgroup’ dataset from sklearn. You can download the dataset here, and follow along with the code.

In [17]:
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
len(documents)

11314

In [18]:
dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

**Data Preprocessing**

To start with, we will try to clean our text data as much as possible. The idea is to remove the punctuations, numbers, and special characters all in one step using the regex replace(“[^a-zA-Z#]”, ” “), which will replace everything, except alphabets with space. Then we will remove shorter words because they usually don’t contain useful information. Finally, we will make all the text lowercase to nullify case sensitivity.

In [19]:
news_df = pd.DataFrame({'document':documents})

# removing everything except alphabets`
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z#]", " ")

# removing short words
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

# make all text lowercase
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

  news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z#]", " ")


It’s good practice to remove the stop-words from the text data as they are mostly clutter and hardly carry any information. Stop-words are terms like ‘it’, ‘they’, ‘am’, ‘been’, ‘about’, ‘because’, ‘while’, etc.

To remove stop-words from the documents, we will have to tokenize the text, i.e., split the string of text into individual tokens or words. We will stitch the tokens back together once we have removed the stop-words.

In [20]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# tokenization
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())

# remove stop-words
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])

# de-tokenization
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

news_df['clean_doc'] = detokenized_doc

In [21]:
# Document-Term Matrix

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', 
max_features= 1000, # keep top 1000 terms 
max_df = 0.5, 
smooth_idf=True)

X = vectorizer.fit_transform(news_df['clean_doc'])

X.shape # check shape of the document-term matrix

(11314, 1000)

We could have used all the terms to create this matrix but that would need quite a lot of computation time and resources. Hence, we have restricted the number of features to 1,000. If you have the computational power, I suggest trying out all the terms.

**Topic Modeling**

The next step is to represent each and every term and document as a vector. We will use the document-term matrix and decompose it into multiple matrices. We will use sklearn’s TruncatedSVD to perform the task of matrix decomposition.

Since the data comes from 20 different newsgroups, let’s try to have 20 topics for our text data. The number of topics can be specified by using the n_components parameter.

In [22]:
from sklearn.decomposition import TruncatedSVD

# SVD represent documents and terms in vectors 
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122)

svd_model.fit(X)

len(svd_model.components_)

20

The components of svd_model are our topics, and we can access them using svd_model.components_. Finally, let’s print a few most important words in each of the 20 topics and see how our model has done.

In [23]:
terms = vectorizer.get_feature_names()

for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])
        print(" ")

Topic 0: 
like
 
know
 
people
 
think
 
good
 
time
 
thanks
 
Topic 1: 
thanks
 
windows
 
card
 
drive
 
mail
 
file
 
advance
 
Topic 2: 
game
 
team
 
year
 
games
 
season
 
players
 
good
 
Topic 3: 
drive
 
scsi
 
disk
 
hard
 
card
 
drives
 
problem
 
Topic 4: 
windows
 
file
 
window
 
files
 
program
 
using
 
problem
 
Topic 5: 
government
 
chip
 
mail
 
space
 
information
 
encryption
 
data
 
Topic 6: 
like
 
bike
 
know
 
chip
 
sounds
 
looks
 
look
 
Topic 7: 
card
 
sale
 
video
 
offer
 
monitor
 
price
 
jesus
 
Topic 8: 
know
 
card
 
chip
 
video
 
government
 
people
 
clipper
 
Topic 9: 
good
 
know
 
time
 
bike
 
jesus
 
problem
 
work
 
Topic 10: 
think
 
chip
 
good
 
thanks
 
clipper
 
need
 
encryption
 
Topic 11: 
thanks
 
right
 
problem
 
good
 
bike
 
time
 
window
 
Topic 12: 
good
 
people
 
windows
 
know
 
file
 
sale
 
files
 
Topic 13: 
space
 
think
 
know
 
nasa
 
problem
 
year
 
israel
 
Topic 14: 
space
 
good
 
card
 
people
 
time
 
nas



In [None]:
#tested: no errors

[<a href="#Contents:">Back to Content</a>]

**THE END**