# Automatic Learning of Key Phrases and Topics in Document Collections

## Part 3: Topic Modeling Training and Summarization

### Overview

This notebook is Part 3 of 6, in a series providing a step-by-step description of how to process and analyze the contents of a large collection of text documents in an unsupervised manner. Using Python packages and custom code examples, we have implemented the basic framework that combines key phrase learning and latent topic modeling as described in the paper entitled ["Modeling Multiword Phrases with Constrained Phrases Tree for Improved Topic Modeling of Conversational Speech"](http://people.csail.mit.edu/hazen/publications/Hazen-SLT-2012.pdf) which was originally presented in the 2012 IEEE Workshop on Spoken Language Technology.

Although the paper examines the use of the technology for analyzing human-to-human conversations, the techniques are quite general and can be applied to a wide range natural language data including news stories, legal documents, research publications, social media forum discussion, customer feedback forms, product reviews, and many more.

Part 3 of the series shows how to train a topic model on a collection of text documents and how to use the topic model to summarize the contents of the corpus. The training is applied to text generated from the preprocessing and phrase learning stages presented in Parts 1 and 2.  


### Import Relevant Python Packages

Most significantly, Part 3 relies on the use of the [Gensim Python library](http://radimrehurek.com/gensim/)  for generating a sparse bag-of-words representation of each document and then training a [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) model on the data. LDA produces a collection of latent topics learned in a completely unsupervised fashion from the text data. Each document can then be represented with a distribution of the learned topics.

In [1]:
import numpy
import pandas 
import re
import math
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import gensim
from gensim import corpora
from gensim import models
from operator import itemgetter
from collections import namedtuple
import time
import gc
import sys
import os
import multiprocessing

from azureml.logging import get_azureml_logger
aml_logger = get_azureml_logger()   # logger writes to AMLWorkbench runtime view
aml_logger.log('amlrealworld.document-collection-analysis.notebook3', 'true')

<azureml.logging.script_run_request.ScriptRunRequest at 0x2000c4fa940>

### Load Text Data

> **NOTE** The data file is saved under the folder defined by environment variable `AZUREML_NATIVE_SHARE_DIRECTORY` in notebook 1. If you have changed it to `../Data`, please also do the change here.

In [2]:
# Load full TSV file including a column of text
frame = pandas.read_csv(os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], "CongressionalDocsProcessed.tsv"), 
                        sep='\t',
                        encoding='ISO-8859-1')

In [3]:
print ("Total docs in corpus: %d\n" % len(frame))

# Show the first five rows of the data in the frame
frame[0:5]

Total docs in corpus: 297462



Unnamed: 0,DocID,ProcessedText
0,hconres1-93,provides that effective from january_3 1973 th...
1,hconres2-93,makes_it_the_sense_of_the_congress that the po...
2,hconres3-93,establishes a joint congressional_committee on...
3,hconres4-93,makes_it_the_sense_of_the_congress that the pr...
4,hconres5-93,makes_it_the_sense_of_the_congress that the co...


### Load the Stop Word Lists
Latent topic models attempt to restrict the topic learning processing to the use of only content bearing words by excluding non-content bearing <b><i>stop words</i></b>. Manually crafted stop word lists are typically manually crafted and include common functional words such as articles, conjunctions, prepositions, pronouns, etc.

In [4]:
# Define a function for loading lists into dictionary hash tables
def LoadListAsHash(filename):
    listHash = {}
    fp = open(filename, encoding='utf-8')

    # Read in lines one by one stripping away extra spaces, 
    # leading spaces, and trailing spaces and inserting each
    # cleaned up line into a hash table
    re1 = re.compile(' +')
    re2 = re.compile('^ +| +$')
    for stringIn in fp.readlines():
        term = re2.sub("", re1.sub(" ", stringIn.strip('\n')))
        if term != '':
            listHash[term] = 1

    fp.close()
    return listHash 

In [5]:
# Load the stop-list of non-content bearing function words
stopwordHash = LoadListAsHash(os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], "function_words.txt"))

# Additional words can also be manually added to the stop word list as needed
stopwordHash["foo"] = 1

### Load the Mapping of Lower-Cased Vocabulary Items to Their Most Common Surface Form

In [6]:
# Load surface form mappings here
fp = open(os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], "Vocab2SurfaceFormMapping.tsv"), 
          encoding='utf-8')

vocabToSurfaceFormHash = {}

# Each line in the file has two tab separated fields;
# the first is the vocabulary item used during modeling
# and the second is its most common surface form in the 
# original data
for stringIn in fp.readlines():
    fields = stringIn.strip().split("\t")
    if len(fields) != 2:
        print ("Warning: Bad line in surface form mapping file: %s" % stringIn)
    elif fields[0] == "" or fields[1] == "":
        print ("Warning: Bad line in surface form mapping file: %s" % stringIn)
    else:
        vocabToSurfaceFormHash[fields[0]] = fields[1]
fp.close()





### Do Topic Modeling on Corpus using Latent Dirichlet Allocation (LDA)

#### Create the Vocabulary Used for Topic Modeling

In [7]:
def CreateVocabForTopicModeling(textData,stopwordHash):

    print ("Counting words")
    numDocs = len(textData) 
    globalWordCountHash = {} 
    globalDocCountHash = {} 
    for textLine in textData:
        docWordCountHash = {}
        for word in str(textLine).split():
            if word in globalWordCountHash:
                globalWordCountHash[word] += 1
            else:
                globalWordCountHash[word] = 1
            if word not in docWordCountHash: 
                docWordCountHash[word] = 1
                if word in globalDocCountHash:
                    globalDocCountHash[word] += 1
                else:
                    globalDocCountHash[word] = 1

    minWordCount = 5;
    minDocCount = 2;
    maxDocFreq = .25;
    vocabCount = 0;
    vocabHash = {}

    excStopword = 0
    excNonalphabetic = 0
    excMinwordcount = 0
    excNotindochash = 0
    excMindoccount = 0
    excMaxdocfreq =0

    print ("Building vocab")
    for word in globalWordCountHash.keys():
        # Test vocabulary exclusion criteria for each word
        if ( word in stopwordHash ):
            excStopword += 1
        elif ( not re.search(r'[a-zA-Z]', word, 0) ):
            excNonalphabetic += 1
        elif ( globalWordCountHash[word] < minWordCount ):
            excMinwordcount += 1
        elif ( word not in globalDocCountHash ):
            print ("Warning: Word '%s' not in doc count hash") % (word)
            excNotindochash += 1
        elif ( globalDocCountHash[word] < minDocCount ):
            excMindoccount += 1
        elif ( float(globalDocCountHash[word])/float(numDocs) > maxDocFreq ):
            excMaxdocfreq += 1
        else:
            # Add word to vocab
            vocabHash[word]= globalWordCountHash[word];
            vocabCount += 1 
    print ("Excluded %d stop words" % (excStopword))       
    print ("Excluded %d non-alphabetic words" % (excNonalphabetic))  
    print ("Excluded %d words below word count threshold" % (excMinwordcount)) 
    print ("Excluded %d words below doc count threshold" % (excMindoccount))
    print ("Excluded %d words above max doc frequency" % (excMaxdocfreq)) 
    print ("Final Vocab Size: %d words" % vocabCount)
            
    return vocabHash
                    
vocabHash = CreateVocabForTopicModeling(frame['ProcessedText'], stopwordHash)

Counting words
Building vocab
Excluded 297 stop words
Excluded 18501 non-alphabetic words
Excluded 70004 words below word count threshold
Excluded 276 words below doc count threshold
Excluded 1 words above max doc frequency
Final Vocab Size: 68145 words


Show that the stop word "and" is not the vocabulary

In [8]:
'and' in vocabHash

False

Show a learned phrase is in the vocabulary

In [9]:
'department_of_labor' in vocabHash

True

The vocabulary hash table contains the total count of the vocabulary item in the data set.

In [10]:
vocabHash["department_of_labor"]

1944

Print the 10 most frequent non-excluded words in the vocabulary:

In [11]:
sorted(vocabHash.items(), key=lambda x: -x[1])[0:10]

[('requires', 129566),
 ('provides', 102427),
 ('state', 98103),
 ('program', 96471),
 ('including', 74228),
 ('certain', 68231),
 ('provide', 66405),
 ('programs', 62842),
 ('united_states', 62605),
 ('states', 56722)]

#### Convert the Text Data Into a Sparse Vector Format

In [12]:
# Start by tokenizing the full text string for each document into list of tokens
# Any token that is in not in the pre-defined set of acceptable vocabulary words is excluded
def TokenizeText(textData,vocabHash):
    tokenizedText = []
        
    for textLine in textData:
        tokenizedText.append([token for token in str(textLine).split() if token in vocabHash])    
    return tokenizedText
    
tokenizedDocs = TokenizeText(frame['ProcessedText'], vocabHash)

Examine the tokenizaton of the first two documents:

In [13]:
tokenizedDocs[0:2]

[['provides',
  'effective',
  'january_3',
  'joint_committee',
  'created',
  'make',
  'necessary',
  'arrangements',
  'inauguration',
  'president-elect_and_vice_president-elect',
  'united_states',
  '20th',
  'day',
  'january',
  'continued',
  'purpose',
  'power',
  'authority',
  'conferred',
  'senate',
  'concurrent_resolution',
  'ninety-second',
  'congress'],
 ['makes_it_the_sense_of_the_congress',
  'pollution',
  'waters',
  'all',
  'world',
  'matter',
  'vital',
  'concern',
  'all_nations',
  'dealt',
  'matter',
  'highest_priority',
  'makes_it_the_sense_of_the_congress',
  'president',
  'acting',
  'united_states',
  'delegation',
  'united',
  'national_conference',
  'human_environment',
  'steps',
  'necessary',
  'propose',
  'international_agreement',
  'amendments',
  'existing',
  'international_agreements',
  'appropriate',
  'providing',
  'coordinated',
  'international',
  'activites',
  'prohibit',
  'disposal',
  'munitions',
  'chemicals',
  'che

Count the total number of vocabulary tokens used over the entire corpus 

In [14]:
numTokens = 0

for i in range(0,len(tokenizedDocs)):
    numTokens += len(tokenizedDocs[i])
print("Total number of retained tokens: %d" % numTokens)

Total number of retained tokens: 22649513


In [15]:
# Create a dictionary mapping string tokens in the text to unique token IDs
dictionary = corpora.Dictionary(tokenizedDocs)

# If the reverse mapping for token ids back to string doesn't exist then create the mapping
# in the form of a list where the list index is the tokenID and the list value is the token
if len(dictionary.id2token) == 0:
    numTokens = len(dictionary.token2id);
    id2token = numTokens * [""];
    for token in dictionary.token2id:
        tokenID = dictionary.token2id[token]
        if tokenID < numTokens:
            id2token[tokenID] = token
        else: 
            print ("Warning: token id %d for token '%s' exceeds max index of %d" % (tokenID,token,numTopics-1))
    for i in range(0,numTokens):
        if id2token[i] == "":
            print ("Warning: token id %d has an empty token" % i)
else:
    id2token = dictionary.id2token
    

In [16]:
# The mapping from unique token ids to strings uses the id2token element of the dictionary
for i in range(0, 10):
    print ("Token ID %d --> %s" % (i, id2token[i]))

# The mapping from strings to unique token ids uses the token2id element of the dictionary   
print ("")
print ("%s --> Token ID %d" % ('spent', dictionary.token2id['spent']))

Token ID 0 --> day
Token ID 1 --> necessary
Token ID 2 --> purpose
Token ID 3 --> created
Token ID 4 --> effective
Token ID 5 --> president-elect_and_vice_president-elect
Token ID 6 --> joint_committee
Token ID 7 --> january_3
Token ID 8 --> ninety-second
Token ID 9 --> power

spent --> Token ID 9046


Create a Gensim corpus structure

In [17]:
corpus =[dictionary.doc2bow(tokens) for tokens in tokenizedDocs]

Show that the corpus structure models the tokenized text as a sparse list of the tokens in the document where each list item is represented by the unique ID for the token along with the count of how often that token appeared in the document.

In [18]:
print(tokenizedDocs[1])
print("spent:", dictionary.token2id['spent'])
print (corpus[1])

['makes_it_the_sense_of_the_congress', 'pollution', 'waters', 'all', 'world', 'matter', 'vital', 'concern', 'all_nations', 'dealt', 'matter', 'highest_priority', 'makes_it_the_sense_of_the_congress', 'president', 'acting', 'united_states', 'delegation', 'united', 'national_conference', 'human_environment', 'steps', 'necessary', 'propose', 'international_agreement', 'amendments', 'existing', 'international_agreements', 'appropriate', 'providing', 'coordinated', 'international', 'activites', 'prohibit', 'disposal', 'munitions', 'chemicals', 'chemical_munitions', 'military', 'material', 'pollutants', 'territorial_waters', 'contiguous', 'zones', 'deep_seabed', 'international', 'waters', 'prevent', 'pollution', 'waters', 'world']
spent: 9046
[(1, 1), (18, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 2), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 2), (35, 2), (36, 1), (37, 1), (38, 3), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (

#### Train an LDA Topic Model Using the Gensim Package 

Find out how many CPU cores on the compute context.

In [19]:
numCPUs = multiprocessing.cpu_count()
print ("Total number of CPUs:", numCPUs)

Total number of CPUs: 8


> **NOTE:** The execution time to train an LDA topic model depends on multiple factors such as the size of corpus, hyper parameter configuration, as well as the number of cores on the machine. Using multiple CPU cores trains a model faster. However, with the same hyper parameter setting more cores means fewer updates during training. It is recommended to have **at least 100 updates to train a converged LDA model**. The relationship between number of updates and hyper parameters is discussed in this post and this post. In our tests, it took about 3 hours to train an LDA model with 200 topics using the configuration of `workers=15`, `passes=10`, `chunksize=1000` on a machine with 16 cores (2.0 GHz).

> By default, it will learn a LDA model with 200 topics. If you just need to try it, change the variable `numTopics` to a smaller number, but be careful since it may affect the execution of other downstream notebooks.

> **Note that if you re-train the LDA model, you may not exactly get the same LDA topic model due the randomazation.**

In [20]:
# Set the number of topics to be learned to 200
numTopics=200

if numCPUs > 1:
    numWorkers = numCPUs - 1
else:
    numWorkers = 1
numIterations = 2000

# Train LDA model 

# If you want to train a new LDA model from scratch then set this if statement to 'True'
retrain = True

if retrain:
    # If we only have one core available use standard training
    if numWorkers == 1:
        lda = gensim.models.ldamodel.LdaModel(corpus, 
                                              id2word=dictionary, 
                                              num_topics=numTopics, 
                                              iterations=numIterations,
                                              passes=5,
                                              chunksize=1000,
                                              random_state=1,
                                              offset=1.0)
   
    # If we have multiple cores available, then use distributed multi-core training
    if numWorkers > 1:
        lda = gensim.models.ldamulticore.LdaMulticore(corpus, 
                                                      id2word=dictionary, 
                                                      num_topics=numTopics,
                                                      iterations=numIterations,
                                                      workers = numWorkers,
                                                      passes=5,
                                                      chunksize=1000,
                                                      random_state=1,
                                                      offset=1.0)


In [21]:
# Saving and loading a trained LDA model
ldaFile = os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], "CongressionalDocsLDA.pickle")

# To save a newly trained model to file set this to 'True'
if retrain:
    #Save a trained LDA model
    lda.save(ldaFile)

# This will load a pre-existing LDA model from file...set it to 'False' to use a newly trained model instead   
else:
    # Loaded trained model
    lda = gensim.models.ldamodel.LdaModel.load(ldaFile)

#### Accessing the Contents of the LDA Model

To see the accessible variables in the LDA model structure:

In [22]:
lda.__dict__

{'__ignoreds': ['dispatcher', 'state', 'id2word'],
 '__numpys': ['expElogbeta'],
 '__recursive_saveloads': [],
 '__scipys': [],
 'alpha': array([ 0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,
         0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,  0.005,

To print out the internal help document for the LDA model class you can use the help() function.

In [23]:
# To print the help document for the LDA model class
help(lda)


Help on LdaMulticore in module gensim.models.ldamulticore object:

class LdaMulticore(gensim.models.ldamodel.LdaModel)
 |  The constructor estimates Latent Dirichlet Allocation model parameters based
 |  on a training corpus:
 |  
 |  >>> lda = LdaMulticore(corpus, num_topics=10)
 |  
 |  You can then infer topic distributions on new, unseen documents, with
 |  
 |  >>> doc_lda = lda[doc_bow]
 |  
 |  The model can be updated (trained) with new documents via
 |  
 |  >>> lda.update(other_corpus)
 |  
 |  Model persistency is achieved through its `load`/`save` methods.
 |  
 |  Method resolution order:
 |      LdaMulticore
 |      gensim.models.ldamodel.LdaModel
 |      gensim.interfaces.TransformationABC
 |      gensim.utils.SaveLoad
 |      gensim.models.basemodel.BaseTopicModel
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, corpus=None, num_topics=100, id2word=None, workers=None, chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay

The print_topics(n) method of the LDA model object prints out a random sampling of n different learned topics as represented by the most likely terms in the topic's langauge model, i.e. the terms that maximize the topic language model P(term|topic).

In [24]:
lda.print_topics(10)

[(166,
  '0.053*"history" + 0.047*"abortion" + 0.045*"honors" + 0.023*"declaration" + 0.021*"amends_federal_law" + 0.021*"content" + 0.020*"rescission" + 0.018*"weight" + 0.018*"save" + 0.017*"contains"'),
 (14,
  '0.040*"lea" + 0.024*"change" + 0.020*"civil_rights" + 0.019*"immigrant" + 0.016*"student\'s" + 0.015*"authorize_the_secretary" + 0.014*"spouses" + 0.014*"references" + 0.014*"withhold" + 0.014*"school_year"'),
 (50,
  '0.131*"tribal" + 0.032*"tribes" + 0.023*"band" + 0.022*"restricted" + 0.021*"replace" + 0.017*"hold" + 0.014*"united_states" + 0.014*"pass" + 0.012*"all" + 0.012*"jurisdiction"'),
 (86,
  '0.079*"travel" + 0.069*"initiative" + 0.035*"update" + 0.032*"fraud" + 0.027*"abuse" + 0.023*"code" + 0.022*"pesticide" + 0.016*"society" + 0.015*"ethics" + 0.010*"travel_expenses"'),
 (113,
  '0.052*"women" + 0.031*"preservation" + 0.029*"boundaries" + 0.026*"cultural" + 0.025*"boundary" + 0.022*"preserve" + 0.016*"women\'s" + 0.015*"historical" + 0.015*"equity_act" + 0.014

#### Infer the Document Probability Score P(topic|doc) using the LDA Model

In this section, each document from the corpus is passed into the LDA model which then infers the topic distribution for each document. The topic distributions are collected into a single numpy array.

In [25]:
# To retrieve all topics and their probabilities we must set the LDA minimum probability setting to zero
lda.minimum_probability = 0

# This function generates the topic probabilities for each doc from the trained LDA model
# The probabilities are placed in a single matrix where the rows are documents and columns are topics
def ExtractDocTopicProbsMatrix(corpus,lda):
    # Initialize the matrix
    docTopicProbs = numpy.zeros((len(corpus),lda.num_topics))
    for docID in range(0,len(corpus)):
        for topicProb in lda[corpus[docID]]:
            docTopicProbs[docID,topicProb[0]]=topicProb[1]
    return docTopicProbs    

# docTopicProbs[docID,TopicID] --> P(topic|doc)
docTopicProbs = ExtractDocTopicProbsMatrix(corpus, lda)

In [26]:
# To save the document topic probabilities to file set this to True:
if retrain:
    docTopicProbsFile = os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], "CongressionalDocTopicProbs.npy")
    numpy.save(docTopicProbsFile,docTopicProbs)


### Next

The topic modeling step is finished. The next step will be topic model summarization which will be in the fourth notebook of the series: [`4_Topic_Model_Summarization.ipynb`](./4_Topic_Model_Summarization.ipynb).