## Text Mining: Topic Modeling
To apply clustering using K-Means with tfidf vectorizer, we are going to use the example into that URL. That example use a vectorizer to getting the tfidf of all words in a document. This vectorizer is TfidfVectorizer.

We are going to use the 20newgroups corpus and select two group: alt.atheis and sci.space

In [27]:
# Import all libraries 
import nltk 
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

#### Load the corpus of texts

In [17]:
# Load some categories from the training test 
categories = ['alt.atheism','sci.space']
print("Loading 20 newsgroups dataset for categories:", categories)

print(categories)
print()

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

trainCorpus = dataset.data
print("%d documents" % len(trainCorpus))

Loading 20 newsgroups dataset for categories: ['alt.atheism', 'sci.space']
['alt.atheism', 'sci.space']

1786 documents


#### Text vectorization

In [19]:
# Create the vectorizer 
vectorizer = CountVectorizer()

# Use the vectorizer to transform the documents on a matrix of tf's (term frequency) of documents
vectorizer.fit(trainCorpus)

# Extract the terms frequency
tfMatrix = vectorizer.transform(trainCorpus)

# Print the matrix. 
# This matrix converted in array indicates: 
#    - Each column is a one feature, 
#    - Each row is a one sentence of corpus.
#    - The (i,j) value indicates the frequency of j feature in a i sentence
print('\ntf Matrix:\n', tfMatrix.toarray())

# Print the shape of our matrix:
print("number of sentences %d, number of features %d" % tfMatrix.shape)

# Get all features 
tf_feature_names = vectorizer.get_feature_names()


tf Matrix:
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
number of sentences 1786, number of features 28382


#### Apply LDA Algorithm 

In [21]:
topics = 10

lda_model = LatentDirichletAllocation(n_components =topics, max_iter=5, 
                                      learning_method='online', 
                                      learning_offset=50.,
                                      random_state=0).fit(tfMatrix)

# Variational parameters for topic word distribution. 
#  Since the complete conditional for topic word distribution is a Dirichlet, 
#  components_[i, j] can be viewed as pseudocount that represents the number 
#  of times word j was assigned to topic i. It can also be viewed as distribution 
#  over the words for each topic after normalization:
topic_word_distribution = lda_model.components_

# Document topic distribution for tfMatrix.
document_topic_distribution = lda_model.transform(tfMatrix)

print(topic_word_distribution.shape)
print(document_topic_distribution.shape)


(10, 28382)
(1786, 10)


In [24]:

# Display topics depart from:
#  - H: Variational parameters for topic word distribution.
#  - W: Document topic distribution.
#  - feature_names: Names of each corpus feature
#  - documents: Corpus
#  - no_top_words: Maximum number of words (topic) to display
#  - no_top_documents: Maximum number of corpus document to display
def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("\nTopic %d:" % (topic_idx))
        for i in topic.argsort()[:-no_top_words - 1:-1]:
            print(" ",feature_names[i])
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])


In [28]:
no_top_words = 4
no_top_documents = 0

display_topics(topic_word_distribution, document_topic_distribution, 
               tf_feature_names, trainCorpus, no_top_words, no_top_documents)


Topic 0:
  af
  afit
  mil
  elements

Topic 1:
  het
  een
  van
  te

Topic 2:
  kadie
  of
  the
  pto

Topic 3:
  space
  launch
  satellite
  pub

Topic 4:
  arms
  the
  permanet
  prado

Topic 5:
  de
  van
  het
  een

Topic 6:
  the
  of
  to
  and

Topic 7:
  the
  of
  alcbel
  devdjn

Topic 8:
  wam
  the
  comet
  emr

Topic 9:
  the
  globe
  xpresso
  of
