# Example Topic Analysis using LDA in SciKit Learn

A simple example of how to perform Latent Drichlet Allocation to extract topics from a corpus of text. The example uses the [SciKit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) implementation of LDA on a sample of the [20 newsgroups dataset (classification)](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). Vectorisation is performed using [Count Vectoriser](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [1]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups


import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

## Load the dataset

In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
documents[1]
print(len(documents))

11314


## Convert to Term-Document Matrix

In [3]:
# restrict to returning the top max_features ordered by term frequency across the corpus
no_features = 1000

# Count Vectoriser
# There is some pre-processing we can achieve through this, e.g.,

# max_df - When building the vocabulary ignore terms that have a document frequency 
#          strictly higher than the given threshold - percentage
# min_df - When building the vocabulary ignore terms that have a document frequency 
#          strictly lower than the given threshold. This value is also called cut-off in the literature
# token_pattern - Regular expression denoting what constitutes a “token”, e.g., token_pattern = r'\b[a-zA-Z]{3,}\b'
#                 would only include words with only letters with a min length of 3


tf_vectorizer = CountVectorizer(max_df=0.95,
                                min_df=10, 
                                max_features=no_features, 
                                stop_words='english',
                                strip_accents = 'unicode',
                                lowercase = True
                               )

tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

Lets look at the matrix - it is sparse (many 0 entries) and is of dimensions DxN where D is the number of documents and N is the number of feautures or tokens (words).

In [4]:
tf

<11314x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 286645 stored elements in Compressed Sparse Row format>

In [5]:
print(tf.shape)

(11314, 1000)


Lets look at the matrix itself - it is sparse with many 0s

In [6]:
tf.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Take a look at some of the tokens:

In [7]:
tf_feature_names[200:210]

['cause',
 'cd',
 'center',
 'certain',
 'certainly',
 'chance',
 'change',
 'changed',
 'changes',
 'check']

## Perform LDA
We have to define the number of topics. Here we know that we are expecting 20 different news groups, so we look for 20. In other dtasets we may need to run several different numbers of topics and see what returns the most useful. 

In [8]:
no_topics = 20

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)

## Show Topics
Some topics make sense - some dont! 

#### Question - Can you think of ways to improve this?
#### What preprocessing would you include? Have a play with the CountVectoriser settings above and re-run.

In [9]:
def display_topics(model, feature_names, no_top_words):
   for topic_idx, topic in enumerate(model.components_):
       print("Topic %d:" % (topic_idx))
       print(" ".join([feature_names[i]
                       for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
display_topics(lda, tf_feature_names, no_top_words)


Topic 0:
00 50 25 price sale new 20 shipping 10 15
Topic 1:
time high good problems low like don run better soon
Topic 2:
use point case question make does people used possible way
Topic 3:
10 11 14 12 17 16 15 13 18 20
Topic 4:
card bit memory color video speed drivers mode data 16
Topic 5:
thanks window db does application know help hi widget advance
Topic 6:
key chip encryption keys clipper security privacy public use algorithm
Topic 7:
edu file com available ftp files program version image mail
Topic 8:
game team games year play season hockey league players win
Topic 9:
god jesus people believe does say christian bible life think
Topic 10:
drive windows disk scsi use mac problem dos hard pc
Topic 11:
car cars church com engine article used perfect jim true
Topic 12:
book good books ago water years left best year radio
Topic 13:
ax max b8f g9v a86 pl 145 1d9 0t 34u
Topic 14:
don just like know think ve really ll going people
Topic 15:
output university current use section medical gr

## Assign topic probabilities to the documents

In [10]:
doc_topic_distrib = lda.transform(tf)

This assigns the probability that the document belongs to each class - so lets look at the first document.

In [11]:
doc_topic_distrib[1]

array([0.00151515, 0.00151515, 0.00151515, 0.00151515, 0.00151515,
       0.00151515, 0.00151515, 0.12000064, 0.00151515, 0.23673105,
       0.00151515, 0.09264141, 0.00151515, 0.00151515, 0.52638447,
       0.00151515, 0.00151515, 0.00151515, 0.00151515, 0.00151515])

We can pick out the top matching topics for that document.

## Visualising the Model

In [13]:
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
