Topic modeling is an unsupervised machine learning technique that abstract topics from a collection of documents. This technique can be used for multiple applications such as document clustering, recommendation systems and more.

In this tutorial we will examine two ways to do topic modeling LDA and LSA.

## Summary
- Latent Dirichlet Allocation (LDA) 
- Latent Semantic Analysis (LSA)

In [25]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()

The data includes 20 groups of topics, we will select few to experiment with

In [26]:
data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [176]:
categories = [ 'comp.graphics', 'rec.motorcycles', 'talk.politics.guns', 'comp.windows.x',
              'misc.forsale', 'rec.sport.baseball']
data = fetch_20newsgroups(categories=categories, remove=('headers', 'footers', 'quotes'))

Let's take a look at a sample of the data

In [28]:
len(data.data), len(data.target)

(3503, 3503)

In [29]:
print(data.data[300])

This past week I've been playing with some of the R-D (Reaction-
Diffusion, not to be confused with RDS or R&D) techniques
from SIGGRAPH '91.

I was wondering what material is available to explain the control
mechanism a little more.  It seems to me very much like a matter of
picking random magic numbers and sitting back and waiting.  Although
both of the papers (Turk and Witkin & Kass) were very well organized
and extremely helpful, I guess what I need is a more basic description
of the technique, especially wrt the control mechanisms.  The tests
that I did had a tendency to either turn into blurry mud or become
unstable.

Is there any info available online?  Source code would be great but
not necessary.

Thanks!


-- 


## Let's do some preprocessing 

In [79]:
from nltk.stem import WordNetLemmatizer
import re
my_limitizer = WordNetLemmatizer()

def limmitizer(text):
    
    words = str(text).strip().lower()
    words = re.sub(r'[\W\d\s_]',' ',words)
    words = words.split()
    
    words = [my_limitizer.lemmatize(word,pos='v') for word in words]
    
    return ' '.join(words)


### Applying preprocessing

In [134]:
for i in range(len(data.data)):
    data.data[i]=limmitizer(data.data[i])

In [135]:
data.data[0]

'no rear tire as small as there be some front though so get a instead be there anything that size any other recomendations call the tire company yourself and tell them what you have they can make recomendations for you that s your best bet check a biker magazine cycle world etc for phone number it s possible there be no other tire available though erik astrup afm dod cbr rr cbr concours ninja'

Let's vectorize the data to be able to use it in the `LDA` algorithm.

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)

Let's split our data to a train and test split to test our model after training.

In [177]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=.2, random_state=42)


In [178]:
x_train_v = vectorizer.fit_transform(x_train)
x_test_v = vectorizer.transform(x_test)

x_train_v.shape,x_test_v.shape

((2802, 3000), (701, 3000))

## notice that 2802 + 701 = 3503 which is the nomber of len(data.data)

## Training the LDA

### **LDA** is an unsupervised algorithm, like the **PCA** it doesn't accept a **`y`** aka a target variable. **LDA** learns to model the topics on tokens/words basis, so that each word is assigned a probability that it belong to a specific topic, thus one document can belong to multiple topics, and the topic with the highest probability is assigned to the document.

### Because **LDA** is an unsupervised model, it doesn't know the topics, it create a form of cluster of words, and this cluster is then called a topic, and it's up to you to name the cluster base on your interpretation.

### **LDA** accepts a number of components in it's constructor, the number of components is the number of topics you would like your algorithm to detect. 

In [138]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=6 , random_state=42)
lda.fit(x_train_v)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=6, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

Let's see how topics are represented in the fitted **`lda`**

Recall we created a `vectorizer` with `3000` words, the `lda` will assign a probability for each word that it belongs to this specific topic, let's take a look.

In [87]:
lda.components_.shape

(6, 3000)

In [88]:
lda.components_

array([[1.09159186, 3.59344538, 1.95607715, ..., 0.65334361, 0.16693612,
        0.16685356],
       [0.16671551, 0.16668502, 0.16670557, ..., 0.16707109, 0.16671534,
        0.16668182],
       [0.80870455, 0.1676063 , 0.16690144, ..., 0.16667477, 1.97343481,
        0.16683954],
       [0.16684762, 0.16763242, 0.56104622, ..., 0.16897255, 0.16667643,
        0.16673571],
       [0.16817669, 0.16721707, 0.16726299, ..., 0.16738263, 0.88995903,
        0.16676554],
       [0.65260596, 0.42788096, 0.17012023, ..., 1.71382162, 0.16673981,
        3.83271304]])

In [94]:
vectorizer.get_feature_names()[2305]

'school'

In [99]:
lda.components_[0][2305]

3.2944436873757983

Let's look at the first topic for example

In [101]:
rng = np.random.randint(0, 3000, size=(10))
print(rng)
for word, prob in zip(np.array(vectorizer.get_feature_names())[rng], lda.components_[0, rng]):
    print(word, prob)

[ 248 1621 1441  200 2311 2297  466  181 1060 2758]
bell 1.5401075591544078
middle 1.606142042762906
larger 1.4476786251711273
aware 0.9990476233459842
score 5.4230079810009295
saw 5.061445679476285
cobra 0.16806091181706867
audio 0.16965489594476243
game 18.326401542778616
understand 7.531211076219876


Now let's extract the features or the set of tokens for each topic

In [106]:
list(enumerate(lda.components_))

[(0,
  array([1.09159186, 3.59344538, 1.95607715, ..., 0.65334361, 0.16693612,
         0.16685356])),
 (1,
  array([0.16671551, 0.16668502, 0.16670557, ..., 0.16707109, 0.16671534,
         0.16668182])),
 (2,
  array([0.80870455, 0.1676063 , 0.16690144, ..., 0.16667477, 1.97343481,
         0.16683954])),
 (3,
  array([0.16684762, 0.16763242, 0.56104622, ..., 0.16897255, 0.16667643,
         0.16673571])),
 (4,
  array([0.16817669, 0.16721707, 0.16726299, ..., 0.16738263, 0.88995903,
         0.16676554])),
 (5,
  array([0.65260596, 0.42788096, 0.17012023, ..., 1.71382162, 0.16673981,
         3.83271304]))]

In [116]:
l = [15, 8 , 8 , 5 , 6 , 9 , 2 , 4 , 7 ,1 ]
l[:-3],l[:-3:-1] # -1 means move from right to left (e.g. reverse the order)

([15, 8, 8, 5, 6, 9, 2], [1, 7])

In [179]:
topics = {}
names = vectorizer.get_feature_names()

for idx, topic in enumerate(lda.components_):  #6 components * 3000 words
    features = topic.argsort()[:-(12-1): -1]   #return indices of last 8 lda weights (e.g. the biggest 8)
    #print(features)
    tokens = [names[i] for i in features]
    topics[idx] = tokens


In [180]:
for key in topics:
    print(topics[key])

['francisco', 'tapes', 'outside', 'death', 'intended', 'limited', 'road', 'ya', 'jackson', 'landon']
['2b', 'sabretooth', 'finding', 'killed', 'behavior', 'production', 'technology', 'independent', 'sphere', 'equivalent']
['rides', 'needs', 'sales', 'section', 'center', 'monday', 'hi', 'points', 'defend', '85']
['ascii', '2nd', 'added', 'detail_win', 'thanx', 'bo', '98', 'built', 'buying', 'usage']
['undefined', 'talent', 'exception', 'posting', 'whitespace', 'jackson', 'force', 'wide', 'buyer', 'lies']
['ansi', 'remove', 'pascal', 'field', 'dealer', 'road', 'giants', 'tapes', 'defend', 'intended']


check back our topics

In [15]:
data.target_names

['comp.graphics',
 'comp.windows.x',
 'misc.forsale',
 'rec.motorcycles',
 'rec.sport.baseball',
 'talk.politics.guns']

In [185]:
lda.transform(x_test_v[0])

array([[0.52098168, 0.02191683, 0.06943509, 0.02246308, 0.33997713,
        0.02522618]])

In [184]:
lda.transform(vectorizer.transform(['i want to play basketball']))

array([[0.65107061, 0.06955011, 0.07004151, 0.06953725, 0.06982984,
        0.06997068]])

In [160]:
lda.transform(vectorizer.transform(['windows is good']))

array([[0.0703407 , 0.06935892, 0.06989191, 0.06935884, 0.6508159 ,
        0.07023373]])

In [183]:
lda.transform(vectorizer.transform(['wars are bad']))

array([[0.65063547, 0.0698604 , 0.06985681, 0.06985998, 0.06993068,
        0.06985667]])

In [181]:
import numpy as np

def model_text(index):
    print(f"the text to test is:\n {x_test[index]}\n")
    print(f"this text belogs to {data.target_names[y_test[index]]} class\n")
    probs = lda.transform(x_test_v[index])
    print("topic probs for the input:\n")
    for i, prob in enumerate(probs.flatten()):
        print(f"topic {topics[i]} prob {round(prob, 3)}")
    print(f"the highest topic prob is {topics[np.argmax(probs)]}")
    print()

In [188]:
model_text(10)

the text to test is:
 There is a multi threaded xlib version written.
Do an archie search for mt-xlib:
Host export.lcs.mit.edu

    Location: /contrib
      DIRECTORY drwxr-xr-x        512  Jul 30 1992  mt-xlib
    Location: /contrib/mt-xlib-1.1
           FILE -rw-r--r--     106235  Jan 21 14:02  mt-xlib-xhib92.ps.Z
           FILE -rw-r--r--    1658123  Jan 21 14:03  mt-xlib.tar.Z
    Location: /contrib/mt-xlib
           FILE -rw-r--r--     106235  Jul 30 1992  mt-xlib-xhib92.ps.Z
           FILE -rw-r--r--    1925529  Jul 30 1992  mt-xlib.tar.Z

this text belogs to comp.windows.x class

topic probs for the input:

topic ['francisco', 'tapes', 'outside', 'death', 'intended', 'limited', 'road', 'ya', 'jackson', 'landon'] prob 0.345
topic ['2b', 'sabretooth', 'finding', 'killed', 'behavior', 'production', 'technology', 'independent', 'sphere', 'equivalent'] prob 0.035
topic ['rides', 'needs', 'sales', 'section', 'center', 'monday', 'hi', 'points', 'defend', '85'] prob 0.036
topic ['as

In this example **`LDA`** detect the topic of the input as the same topic that has `bike` in it.

Let's investigate the `lda` object output and how to use it.

In [172]:
data.target_names[y_test[1]]

'rec.motorcycles'

In [18]:
predictions = lda.transform(x_test_v[1])
predictions

array([[0.0373075 , 0.03682845, 0.03716172, 0.03703984, 0.81491558,
        0.0367469 ]])

the output of the transform function is a numpy array that contains the probability that this text belongs to each of the topics, for instance here we can tell that that particular input belong to the first topic by a `0.037` probability, and the second one by `0.036` probability and so on.

So we get the `argmax` of the probabilities which is the index of the maximum probability, that would be the topic group to which this text belongs.

In [19]:
np.argmax(predictions)

4

and you can recall the topic from the `topics` dictionary we have declared before

In [29]:
topics[4]

['thanks', 'use', 'window', 'sale', 'know', 'mail', '00', 'email']

# Latent Semantic Analysis (LSA)

As we saw in the **LDA** we model the topics by assigning each word to a topic and measure how much of a certain topic is presented in the document and thus assign the document to this topic.

**LSA** works in a pretty different way, first we create a term-doc matrix where each word is represented by a row and each document in a column, then applying **SVD** on this matrix will produce three matrices:

1. term topic matrix **`u`** .
2. topic importance matrix **`s`**.
3. topic document matrix **`vh`**.

So basically **LSA** is like **PCA** but for text documents, we reduce the features which in our case the words to topics which capture the variance of the words thus the topic of them.

To be able to apply **SVD** on sparse matrices (the matrices that are returned by the text vectorizers of sklearn) we need to use `TruncatedSVD` class from sklearn as it doesn't center the data around zero like `PCA` does, thus it can work with the sparse matrices that we have.

Let's see how can we apply **LSA** using sklearn.

In [1]:
from sklearn.decomposition import TruncatedSVD

# we will reuse the vectors we have created before
lsa = TruncatedSVD(n_components=6, random_state=42)
lsa.fit(x_train_v)

NameError: name 'x_train_v' is not defined

Notice here that the number of components is used to calculate the singular values, specifying it to 6 means that we only wants to calculate 6 singular values aka 6 topics.

In [144]:
lsa.components_.shape

(6, 3000)

Let's take a look at the topics we have with **LSA**

In [145]:
topics_lsa = {}
names = vectorizer.get_feature_names()

for idx, topic in enumerate(lsa.components_):
    features = topic.argsort()[:-(10-1): -1]
    tokens = [names[i] for i in features]
    topics_lsa[idx] = tokens

In [146]:
for key in topics_lsa:
    print(topics_lsa[key])

['use', 'know', 'like', 'don', 'think', 'just', 'make', 'thank']
['file', 'thank', 'program', 'use', 'window', 'windows', 'color', 'mail']
['offer', 'sale', 'ship', 'sell', 'new', 'drive', 'condition', 'game']
['game', 'team', 'win', 'run', 'pitch', 'year', 'hit', 'score']
['bike', 'ride', 'motorcycle', 'thank', 'know', 'dod', 'look', 'dog']
['thank', 'mail', 'know', 'list', 'send', 'advance', 'post', 'address']


Compare this with the **LDA** topics

In [147]:
for key in topics:
    print(topics[key])

['gun', 'think', 'people', 'don', 'just', 'make', 'say', 'year', 'know', 'like']
['allocate', 'sega', 'genesis', 'lethal', 'cells', 'radiosity', 'ti', 'japan', 'steve', 'favorite']
['sale', 'offer', 'sell', 'ship', 'condition', 'new', 'include', 'price', 'drive', 'ask']
['brave', 'alomar', 'baerga', 'edu', 'tony', 'chop', 'atlanta', 'cnn', 'colorado', 'vacation']
['use', 'thank', 'file', 'program', 'window', 'know', 'graphics', 'windows', 'color', 'mail']
['bike', 'ride', 'pitch', 'game', 'dog', 'say', 'helmet', 'think', 'drive', 'just']


Each algorithm will have his own topics, and we can test them to see which one fits best in our case.

In [148]:
def model_text_lsa(index):
    print(f"the text to test is:\n {x_test[index]}\n")
    print(f"this text belogs to {data.target_names[y_test[index]]} class\n")
    probs = lsa.transform(x_test_v[index])
    print("topic probs for the input:")
    for i, prob in enumerate(probs.flatten()):
        print(f"topic {topics_lsa[i]} prob {round(prob, 3)}")
    print(f"the highest topic prob is {topics_lsa[np.argmax(probs)]}")
    print()

In [149]:
model_text_lsa(0)

the text to test is:
 no he s not nut wip be second to none the sport station they don t have tony bruno work espn radio and al morganti do friday night hockey because they suck i live in richmond va but i visit phila often and on the way i get wtem washington and wip i hear the fan at night wherever i go the signal use to be wnbc when they play golden oldies because you can t avoid it of those three wip have the best host hand down chuck cooperstein isn t a homer and neither be jody mac wtem be too generic to be place in the catergory in fact if you have hear wtem and the fan you notice the theme music be identical same ownership i think so wip be totally original their host actually have a personality this be a knock at tem the team not the fan because mike and the mad dog and sommers be good i mean compare the morning guy in philadelphia to the ones in washington be a total joke anyway i like the fan and wip but i think the edge go to ip when i get back from philly i go into withdra

That's how you use the **LSA** in sklearn.

## Conclusion

### **LDA** and **LSA** are topic modeling algorithms that uses dimensionality reduction techniques to model the topics in the corpus.
# they can be used in Medium or Quora for example to recommend to the reader other similar articles 