## Automatic Learning of Key Phrases and Topics in Document Collections

## Part 6: Topic Modeling Interactive Visualization

### Overview

This notebook is Part 6 of 6, in a series providing a step-by-step description of how to process and analyze the contents of a large collection of text documents in an unsupervised manner. Using Python packages and custom code examples, we have implemented the basic framework that combines key phrase learning and latent topic modeling as described in the paper entitled ["Modeling Multiword Phrases with Constrained Phrases Tree for Improved Topic Modeling of Conversational Speech"](http://people.csail.mit.edu/hazen/publications/Hazen-SLT-2012.pdf) which was originally presented in the 2012 IEEE Workshop on Spoken Language Technology.

Although the paper examines the use of the technology for analyzing human-to-human conversations, the techniques are quite general and can be applied to a wide range natural language data including news stories, legal documents, research publications, social media forum discussion, customer feedback forms, product reviews, and many more.

Part 6 of the series shows how to interactively visualize learned topic model. It includes the tasks of finding similar topics, finding related documents given a specific topic, finding similar documents, and visualizing topic prevalence and topic evolving over time. The topic model and topic summarizations were generated in Part 3 and Part 4 of the series.


> **NOTE:** If you have retrained your own LDA model, you may not get the same topic model we are showing in this notebook. For the demonstration purpose, all files used in this notebook can be downloaded via the links below. You can download all files to the `AZUREML_NATIVE_SHARE_DIRECTORY` folder and you will have exactly the same results in this notebook.


| File Name | Link |
|-----------|------|
| `CongressionalDocsLDA.pickle` | https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/CongressionalDocsLDA.pickle |
| `CongressionalDocsLDA.pickle.expElogbeta.npy` | https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/CongressionalDocsLDA.pickle.expElogbeta.npy |
| `CongressionalDocsLDA.pickle.id2word` | https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/CongressionalDocsLDA.pickle.id2word |
| `CongressionalDocsLDA.pickle.state` | https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/CongressionalDocsLDA.pickle.state |
| `CongressionalDocsLDA.pickle.state.sstats.npy` | https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/CongressionalDocsLDA.pickle.state.sstats.npy |
| `CongressionalDocTopicLM.npy` | https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/CongressionalDocTopicLM.npy |
| `CongressionalDocTopicProbs.npy` | https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/CongressionalDocTopicProbs.npy |
| `CongressionalDocTopicSummaries.tsv` | https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/CongressionalDocTopicSummaries.tsv |
| `Vocab2SurfaceFormMapping.tsv` | https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/Vocab2SurfaceFormMapping.tsv |

### Download Data Files (optional)

You can download all those data files by executing the code in the cells below.

In [1]:
import urllib.request
import os

def download_file_from_blob(filename):
    shared_path = os.environ['AZUREML_NATIVE_SHARE_DIRECTORY']
    save_path = os.path.join(shared_path, filename)

    if not os.path.exists(save_path):
        # Base URL for anonymous read access to Blob Storage container
        STORAGE_CONTAINER = 'https://bostondata.blob.core.windows.net/scenario-document-collection-analysis/'
        url = STORAGE_CONTAINER + filename
        urllib.request.urlretrieve(url, save_path)
        print("Downloaded file: %s" % filename)
    else:
        print("File \"%s\" already existed" % filename)

In [2]:
download_file_from_blob('CongressionalDocTopicLM.npy')
download_file_from_blob('CongressionalDocTopicProbs.npy')
download_file_from_blob('CongressionalDocTopicSummaries.tsv')

File "CongressionalDocTopicLM.npy" already existed
File "CongressionalDocTopicProbs.npy" already existed
File "CongressionalDocTopicSummaries.tsv" already existed


### Import Relevant Python Packages

Part 6 primarily relies on the [Bokeh Python library](https://bokeh.pydata.org) for generating graphs. Make sure you have installed the Bokeh Python package.

In [3]:
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import os
import logging

from gensim import corpora, models
from datetime import datetime, date
from math import pi
from scipy.spatial.distance import cdist

from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure, curdoc
from bokeh.models import ColumnDataSource, CustomJS, DatetimeTickFormatter, HoverTool, NumberFormatter
from bokeh.models import DateRangeSlider, SingleIntervalTicker, Label, CategoricalColorMapper
from bokeh.models.widgets import DataTable, DateFormatter, TableColumn, HTMLTemplateFormatter
from bokeh.models.widgets import Select, Slider, TextInput, Button, Div
from bokeh.palettes import Spectral6, Viridis256
from bokeh.layouts import layout, column
from bokeh.application.handlers import FunctionHandler
from bokeh.application import Application

from azureml.logging import get_azureml_logger
aml_logger = get_azureml_logger()   # logger writes to AMLWorkbench runtime view
aml_logger.log('amlrealworld.document-collection-analysis.notebook6', 'true')

%matplotlib notebook
output_notebook()

# Disable logging from Bokeh for a reported bug (as of version 0.12.9):
# https://github.com/bokeh/bokeh/issues/6175
log = logging.getLogger("bokeh.server.views.ws")
log.disabled = True
amllog = logging.getLogger("azureml")
amllog.level = logging.ERROR

### Load Text Data

> **NOTE** The data file is saved under the folder defined by environment variable `AZUREML_NATIVE_SHARE_DIRECTORY` in notebook 1. If you have changed it to `../Data`, please also do the change here.

In [4]:
# Load full TSV file including a column of text
docsFrame = pd.read_csv(os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], "CongressionalDataAll_Jun_2017.tsv"), 
                            sep='\t', parse_dates=['Date'])
docsFrame.head()


Unnamed: 0,ID,Text,Date,SponsorName,Type,State,District,Party,Subjects
0,hconres1-93,"Provides that effective from January 3, 1973, ...",1973-01-03,"O'Neill, Thomas P., Jr.",rep,MA,8.0,Democrat,"congress,congressional joint committees,govern..."
1,hconres2-93,Makes it the sense of the Congress that the po...,1973-01-03,"Bennett, Charles E.",rep,FL,3.0,Democrat,"environmental protection,pollution,water resou..."
2,hconres3-93,Establishes a Joint Congressional Committee on...,1973-01-03,"Bennett, Charles E.",rep,FL,3.0,Democrat,"congress,congressional joint committees,congre..."
3,hconres4-93,Makes it the sense of the Congress that the Pr...,1973-01-03,"Collier, Harold R.",rep,IL,10.0,Republican,"armed forces and national security,missing in ..."
4,hconres5-93,Makes it the sense of the Congress that: (1) t...,1973-01-03,"Collier, Harold R.",rep,IL,10.0,Republican,"economics and public finance,federal budgets"


### Load Topic Language Models and Summaries from File

In [5]:
topicTermProbs = np.load(os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], "CongressionalDocTopicLM.npy"))
topicTermProbs.shape


(200, 68145)

### Load Topic Summaries from File

In [6]:
ldaTopicSummariesFile = os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], "CongressionalDocTopicSummaries.tsv")
topicSummaries = pd.read_csv(ldaTopicSummariesFile, sep='\t')
topicSummaries.head()


Unnamed: 0,TopicID,TopicSummary
0,0,"decision, review, appeal, judicial, orders, ca..."
1,1,"efforts, Calls, people, peace, commitment, lea..."
2,2,"laboratory, units, acid, laboratories, surface..."
3,3,"trade, merchandise, U.S., goods, country, mult..."
4,4,"sales, sea, State, Act to prohibit, gun, priso..."


### Load the Document Probability Score P(topic|doc) Computed by the LDA Model from File

In this section, each document from the corpus is passed into the LDA model which then infers the topic distribution for each document. The topic distributions are collected into a single numpy array.

In [7]:
docTopicProbsFile = os.path.join(os.environ['AZUREML_NATIVE_SHARE_DIRECTORY'], "CongressionalDocTopicProbs.npy")

# docTopicProbs[docID,TopicID] --> P(topic|doc)
docTopicProbs = np.load(docTopicProbsFile)

# The docTopicProbs shape should be (# of docs, # of topics)
docTopicProbs.shape

(297462, 200)

### Compute Topic Similarity and Find Similar Topics to A Reference Topic

In [8]:
# Topic Similarity 
# First compute unit normalized vectors
normVector = np.matrix(np.sqrt(np.sum(np.array(topicTermProbs) * np.array(topicTermProbs), axis=1))).T
topicTermProbsUnitNormed = np.matrix(np.array(topicTermProbs) / np.array(normVector))

# Compute topicSimilarity using cosine simlarity measure
topicSimilarity = topicTermProbsUnitNormed * topicTermProbsUnitNormed.T
topicSimilarity.shape


(200, 200)

> **NOTE:** The Bokeh interactive applications below will need to setup a few Bokeh Servers to host the applications. If you are running this notebook on a remote compute context (like a remote DSVM), you will **NOT** see them, since the Bokeh Server will not forward those interactions to the notebook port. Run this notebook on local compute context.

### Find Similar Topics

Define a function to find similar topics given a topic ID. The returned topics will be sorted by the similarity. And the first one will always be the given topic itself.

In [9]:
def GetSimilarTopics(topicID, topicSimilarity, topicSummaries):
    sortedTopics = np.array(np.argsort(-topicSimilarity[topicID]))[0]
    similarity = topicSimilarity[topicID, sortedTopics].tolist()[0]
    summaries = list(topicSummaries['TopicSummary'][sortedTopics])
    
    df = pd.DataFrame()
    df['Similarity'] = similarity
    df['TopicID'] = list(sortedTopics)
    df['TopicSummary'] = summaries
    return df


def GetAllSimilarity(maxn=50):
    df = pd.DataFrame()
    for tid in range(topicSimilarity.shape[0]):
        tmp = GetSimilarTopics(tid, topicSimilarity, topicSummaries)[:maxn]
        tmp['Target'] = [topicSummaries['TopicSummary'][tid]] * maxn
        df = pd.concat([df, tmp], ignore_index=True)
    return df


An example of get a list of similar topics.

In [10]:
similarityDF = GetSimilarTopics(38, topicSimilarity, topicSummaries)
similarityDF.head()

Unnamed: 0,Similarity,TopicID,TopicSummary
0,1.0,38,"security, electronic, terrorism, violence, for..."
1,0.090268,20,"U.S., international, countries, country, Unite..."
2,0.087188,44,"securities, transactions, issuer, assets, equi..."
3,0.084527,1,"efforts, Calls, people, peace, commitment, lea..."
4,0.080212,162,"ensure, develop, enhance, implementation, stra..."


Get similarity values of all topics and the list to topic summaries. It will be reused in finding the most similar topics.

In [11]:
# Get the original similarity table for all topics
similarityTable = GetAllSimilarity(maxn=50)

# Get the summary list of all topics
summaries = list(topicSummaries['TopicSummary'])


#### Similar Topic Bokeh Application

Show interactive application to find similar topics. The returned similar topics will be ranked in decreasing order. And the first (the most similar) topic will always the selected topic itself. Topic filtering is also available by searching key words in the topic summary.

In [12]:
def modify_similarTopic(doc):
    maxWidth = 900
    # The default is the first topic (topic ID 0)
    source = ColumnDataSource(similarityTable[:10])
    template = """<span href="#" data-toggle="tooltip" title="<%= value %>"><%= value %></span>"""
    dfmt = NumberFormatter(format="0.000")
    columns = [
            TableColumn(field="Similarity", title="Similarity", formatter=dfmt, width=50),
            TableColumn(field="TopicID", title="TopicID", width=50),
            TableColumn(field='TopicSummary', title='TopicSummary', formatter=HTMLTemplateFormatter(template=template)),
        ]

    data_table = DataTable(source=source, columns=columns, width=maxWidth, height=250)
    keywordInput = TextInput(title="Topic Summary Keyword:", value='', width=int(maxWidth / 2 - 20))
    topicSelect = Select(title="Select Topic:", value=summaries[0], options=summaries, width=maxWidth)
    topnSelect = Select(title="Select Top N Most Similar Topic:", value='10', 
                        options=[str(x) for x in range(5, 51, 5)], 
                        width=int(maxWidth / 2 - 10))
    div = Div(text="""<h3>Find Similar Topic</h3><p>Search and select a specific topic and find the most
                        similar topics.</p>""", width=maxWidth)

    def searchKeywordCallback(attr, old, new):
        newOptions = []
        if keywordInput.value.strip() == '':
            topicSelect.options = summaries
            topicSelect.value = summaries[0]
            return
        
        keywords = keywordInput.value.strip().lower()
        for item in summaries:
            if keywords in item.lower():
                newOptions.append(item)
        topicSelect.options = newOptions
        if newOptions:
            topicSelect.value = newOptions[0]
    
    def callback(attr, old, new):
        topicSummary = topicSelect.value
        topN = int(topnSelect.value)
        newDF = similarityTable[similarityTable['Target'] == topicSummary][:topN]
        newDF.reset_index(inplace=True)
        newSource = ColumnDataSource(newDF)
        ds = data_table.source
        ds.data = newSource.data
        ds.data['index'] = range(topN)
        ds.trigger('data', ds.data, ds.data)
    
    keywordInput.on_change('value', searchKeywordCallback)
    topicSelect.on_change('value', callback)
    topnSelect.on_change('value', callback)
    
    lyt = layout([[div],
                  [keywordInput, topnSelect],
                  [topicSelect],
                  [data_table]])
    doc.add_root(lyt)

handler_similarTopic = FunctionHandler(modify_similarTopic)
similarTopic_app = Application(handler_similarTopic)
show(similarTopic_app)


### Find Related Documents

#### Find Related Documents Bokeh Application

Show interactive application to find related documents of a specific topic. The returned documents will be ranked in decreasing order of the related score. Click a document item in the returned data table can view the details of the document.

In [13]:
def modify_RelatedDocs(doc):
    maxWidth = 900
    # Default we select the top 10 most related documents of topic 0
    select_mask = docTopicProbs[:, 0].argsort()[::-1][:10]
    relate_score = docTopicProbs[select_mask, 0]
    docDF = docsFrame.iloc[select_mask]
    # Pre-format Date column to String as Bokeh cannot properly render it
    # when we change the data source of DataTable
    docDF['Date'] = docDF['Date'].dt.strftime('%Y-%m-%d')
    docDF['Score'] = relate_score
    source = ColumnDataSource(docDF)
    dfmt = NumberFormatter(format="0.000")
    columns = [
            TableColumn(field="Score", title="Score", formatter=dfmt, width=70),
            TableColumn(field="ID", title="ID", width=130),
            TableColumn(field="Date", title="Date", width=100),
            TableColumn(field='Text', title='Text', width=550)
        ]

    data_table = DataTable(source=source, columns=columns, width=maxWidth, height=250)
    keywordInput = TextInput(title="Topic Summary Keyword:", value='', width=int(maxWidth/2-20))
    topicSelect = Select(title="Select Topic:", value=summaries[0], options=summaries, width=maxWidth)
    topnSelect = Select(title="Select Top N Most Related Documents:", value='10', 
                        options=[str(x) for x in range(5, 51, 5)], width=int(maxWidth/2-20))
    tdiv = Div(text="""<h3>Find Related Documents</h3><p>Search and select a specific topic and find the most
                        related documents.</p>""", width=maxWidth)
    div = Div(text="""<h3>Document Detail</h3><p><em>Please click an item on 
                    the above data table to view the details of the document.</em></p>""", width=maxWidth)
    title = Div(text="""<h3>Related Documents</h3><p>Please click an item on the data
                    table below to view the detail of the document.</p>""", width=maxWidth)

    def searchKeywordCallback(attr, old, new):
        newOptions = []
        if keywordInput.value.strip() == '':
            topicSelect.options = summaries
            topicSelect.value = summaries[0]
            return
        
        keywords = keywordInput.value.strip().lower()
        for item in summaries:
            if keywords in item.lower():
                newOptions.append(item)
        topicSelect.options = newOptions
        if newOptions:
            topicSelect.value = newOptions[0]
    
    def generateHTML(rowId):
        ds = data_table.source
        html = """<h3>Document Detail</h3>
                <p><strong>Score:</strong>&nbsp;%.3f</p>
                <p><strong>Introduced Date:</strong>&nbsp;%s</p>
                <p><strong>Primary Sponsor:</strong>&nbsp;<em>%s, %s&nbsp;%s</em></p>
                <p>%s</p>""" % (ds.data['Score'][rowId], 
                                ds.data['Date'][rowId], 
                                ds.data['SponsorName'][rowId],
                                ds.data['State'][rowId],
                                ds.data['Party'][rowId],
                                ds.data['Text'][rowId])
        return html
    
    def callback(attr, old, new):
        topicSummary = topicSelect.value
        topicID = summaries.index(topicSummary)
        topN = int(topnSelect.value)
        tmask = docTopicProbs[:, topicID].argsort()[::-1][:topN]
        newScore = docTopicProbs[tmask, topicID]
        newDF = docsFrame.iloc[tmask]
        newDF['Date'] = newDF['Date'].dt.strftime('%Y-%m-%d')
        newDF['Score'] = newScore
        newSource = ColumnDataSource(newDF)
        ds = data_table.source
        ds.data = newSource.data
        ds.data['index'] = range(topN)
        ds.trigger('data', ds.data, ds.data)

    def table_select_callback(attr, old, new):
        selected_row = new['1d']['indices'][0]
        html = generateHTML(selected_row)
        div.text = html
    
    keywordInput.on_change('value', searchKeywordCallback)
    topicSelect.on_change('value', callback)
    topnSelect.on_change('value', callback)
    source.on_change('selected', table_select_callback)
    
    lyt = layout([[tdiv], 
                  [keywordInput, topnSelect],
                  [topicSelect],
                  [title],
                  [data_table],
                  [div]])
    doc.add_root(lyt)

handler_relatedDocs = FunctionHandler(modify_RelatedDocs)
relatedDocs_app = Application(handler_relatedDocs)
show(relatedDocs_app)

### Find Similar Documents

First need to define a function to filter documents. The filtering options including political groups (House of Representatives / Senate), document type (bill / resolutions), primary sponsor's name, state, and party, start of introduced date, end of introduced date.

In [14]:
def FilterDocument(df, groups=None, docType=None, sponsor=None, 
                   state=None, party=None, from_date=None, to_date=None):
    
    # Initialize a mask list with all True
    mask = [True] *  len(df)
    
    # Filter by political groups
    # None: All political groups
    # 0   : House of Representatives
    # 1   : Senate
    if groups is not None:
        if groups == 0:
            mask &= df['ID'].str[0] == 'h'
        elif groups == 1:
            mask &= df['ID'].str[0] == 's'
    
    # Filter by document type
    # None: All kinds of documents
    # 0   : Bills (document ID does not contains 'res')
    # 1   : Resolutions (document ID contains 'res')
    if docType is not None:
        if docType == 0:
            mask &= ~df['ID'].str.contains('res')
        elif docType == 1:
            mask &= df['ID'].str.contains('res')
    
    # Filter by primary sponsor name
    if sponsor is not None:
        mask &= df['SponsorName'] == sponsor
    
    # Filter by state
    if state is not None:
        mask &= df['State'] == state
        
    # Filter by party
    if party is not None:
        mask &= df['Party'] == party
        
    # Filter by date
    if from_date is not None:
        mask &= df['Date'] >= from_date
    if to_date is not None:
        mask &= df['Date'] <= to_date
    
    # return the filtering mask
    return mask


An example of filter documents given some filtering metrics.

In [15]:
mask = FilterDocument(docsFrame, groups=None, docType=None, sponsor=None, 
                     state=None, party='Democrat', from_date=date(1975, 6, 1), to_date=date(1975, 7, 31))
tdf = docsFrame[mask]
tdf.sort_values(by='Date', ascending=True, inplace=True)
tdf.head()


Unnamed: 0,ID,Text,Date,SponsorName,Type,State,District,Party,Subjects
35639,hr7510-94,Removes the exclusion of service performed in ...,1975-06-02,"Rostenkowski, Dan",rep,IL,8.0,Democrat,"federal employees,federal employees and offici..."
35652,hr7523-94,Provides for the relief of Tin Kee Ng.,1975-06-02,"Harrington, Michael J.",rep,MA,6.0,Democrat,private legislation
35651,hr7522-94,Provides for the relief of Robert H. Glazier.,1975-06-02,"Edgar, Robert W.",rep,PA,7.0,Democrat,private legislation
35647,hr7518-94,Requires retail and wholesale food concerns au...,1975-06-02,"Richmond, Frederick W.",rep,NY,14.0,Democrat,"agriculture and food,food stamps,labor and emp..."
35646,hr7517-94,Federal Municipal Bond Guarantee Administratio...,1975-06-02,"Richmond, Frederick W.",rep,NY,14.0,Democrat,"bonds,economics and public finance,executive r..."


Get the options of political groups, document types, sponsor names, states, and parties for filtering.

In [16]:
political_groups = ["ALL", "House of Representatives", "Senate"]
types = ["Bills and Resolutions", "Bills", "Resolutions"]
sponsors = ["ANY SPONSOR"] + sorted(docsFrame.ffill().SponsorName.unique())
states = ["ANY STATE"] + sorted(docsFrame.ffill().State.unique())
parties = ["ANY PARTY"] + sorted(docsFrame.ffill().Party.unique())


#### Find Similar Documents Bokeh Application

Show interactive application to find similar documents given a specific document. The returned documents will be ranked in decreasing order of similarity score.

In [17]:
def modify_similarDocs(doc):
    maxWidth = 900
    def GetSimilarDocs(selectedRow=0, topn=10):
        dists = cdist(docTopicProbs, docTopicProbs[selectedRow, :][np.newaxis, :], metric='cosine')[:, 0]
        idxes = dists.argsort()
        return idxes[:topn], dists[idxes[:topn]]
    
    dfmt = NumberFormatter(format="0.000")
    filterDF = pd.DataFrame(np.nan, index=[], columns=list(docsFrame.columns) + ['RowID'])
    columns1 = [TableColumn(field="ID", title="ID", width=100),
            TableColumn(field="Date", title="Date", width=100),
            TableColumn(field='Text', title='Text')
        ]
    filterSource = ColumnDataSource(data=filterDF)
    filterDataTable = DataTable(source=filterSource, columns=columns1, width=maxWidth, height=250)
    
    #idxes, dists = GetSimilarDocs(0, 10)
    similarDF = pd.DataFrame(np.nan, index=[], columns=list(docsFrame.columns) + ['RowID', 'Similarity'])
    columns2 = [TableColumn(field="Similarity", title="Similarity", formatter=dfmt, width=70),
            TableColumn(field="ID", title="ID", width=130),
            TableColumn(field="Date", title="Date", width=100),
            TableColumn(field='Text', title='Text')
        ]
    similarSource = ColumnDataSource(data=similarDF)
    similarDataTable = DataTable(source=similarSource, columns=columns2, width=maxWidth, height=250)
    
    searchSponsor = TextInput(title="Search Primary Sponsor Name:", value='')
    topicSelect = Select(title="Select Topic:", value=summaries[0], options=summaries)
    topnSelect = Select(title="Select Top N Most Similar Documents:", value='10', 
                        options=[str(x) for x in range(5, 51, 5)])
    groupsSelect = Select(title="Political Groups:", value=political_groups[0], options=political_groups)
    typeSelect = Select(title="Document Type:", value=types[0], options=types)
    sponsorSelect = Select(title="Primary Sponsor:", value=sponsors[0], options=sponsors)
    stateSelect = Select(title="State:", value=states[0], options=states)
    partySelect = Select(title="Party:", value=parties[0], options=parties)
    dateRange = DateRangeSlider(title="Introduced Date Period", 
                                name="period",
                                start=date(1973, 1, 1),
                                end=date(2017, 6, 30),
                                step=1,
                                value=(date(1973, 1, 1), date(2017, 6, 30)), 
                                width=maxWidth)
    applyButton = Button(label='Apply Filter', button_type="success", width=250)
    div = Div(text="""<h3>Document Detail</h3><p><em>Please click an item on 
                    the above data table to view the details of the document.</em></p>""", width=maxWidth)
    title1 = Div(text="""<h3>Filtered Documents</h3><p>Please click an item on the data
                    table below to get the top N most similar documents.</p>""", width=maxWidth)
    title2 = Div(text="""<h3>Similar Documents</h3><p>Please click an item on the data
                    table below to view the detail of the document.</p>""", width=maxWidth)
        
    def filterSponsor(attr, old, new):
        newOptions = []
        if searchSponsor.value.strip() == '':
            sponsorSelect.options = sponsors
            return
        
        sponsorName = searchSponsor.value.strip().lower()
        for item in sponsors:
            if sponsorName in item.lower():
                newOptions.append(item)
        sponsorSelect.options = newOptions
        if newOptions:
            sponsorSelect.value = newOptions[0]
    
    def applyFilter():
        topicID = summaries.index(topicSelect.value)
        (fromDate, toDate) = dateRange.value_as_datetime
        sponsorName = sponsorSelect.value if sponsorSelect.value != "ANY SPONSOR" else None
        partyName = partySelect.value if partySelect.value != "ANY PARTY" else None
        stateName = stateSelect.value if stateSelect.value != "ANY STATE" else None
        if groupsSelect.value == "House of Representatives":
            groups = 0
        elif groupsSelect.value == "Senate":
            groups = 1
        else:
            groups = None
        if typeSelect.value == "Bills":
            docType = 0
        elif typeSelect.value == "Resolutions":
            docType = 1
        else:
            docType = None
        topn = int(topnSelect.value)
        
        filterMask = FilterDocument(docsFrame, groups=groups, docType=docType, sponsor=sponsorName, 
                     state=stateName, party=partyName, from_date=fromDate, to_date=toDate)
        filterDF = docsFrame[filterMask]
        filterDF['RowID'] = list(filterDF.index)
        filterDF['Date'] = filterDF['Date'].dt.strftime('%Y-%m-%d')
        filterSource = ColumnDataSource(data=filterDF)
        ds = filterDataTable.source
        ds.data = filterSource.data
        ds.data['index'] = range(len(filterDF))
        # clean the similarDataTable
        ds1 = similarDataTable.source
        similarSource = ColumnDataSource(data=pd.DataFrame(np.nan, index=[], columns=similarDF.columns))
        ds1.data = similarSource.data
        ds1.data['index'] = []
        ds.trigger('data', ds.data, ds.data)
        ds1.trigger('data', ds1.data, ds1.data)
    
    def filtered_select_callback(attr, old, new):
        selected_row = new['1d']['indices'][0]
        rowID = filterSource.data['RowID'][selected_row]
        topn = int(topnSelect.value)
        idxes, dists = GetSimilarDocs(rowID, topn)
        similarDF = docsFrame.iloc[idxes]
        similarDF['Date'] = similarDF['Date'].dt.strftime('%Y-%m-%d')
        similarDF['RowID'] = idxes
        similarDF['Similarity'] = 1.0 - dists
        similarSource = ColumnDataSource(data=similarDF)
        ds = similarDataTable.source
        ds.data = similarSource.data
        ds.data['index'] = range(len(similarDF))
        ds.trigger('data', ds.data, ds.data)

    def generateHTML(rowId):
        ds = similarDataTable.source
        html = """<h3>Document Detail</h3>
                <p><strong>Similarity:</strong>&nbsp;%.3f</p>
                <p><strong>Introduced Date:</strong>&nbsp;%s</p>
                <p><strong>Primary Sponsor:</strong>&nbsp;<em>%s, %s&nbsp;%s</em></p>
                <p>%s</p>""" % (ds.data['Similarity'][rowId], 
                                ds.data['Date'][rowId], 
                                ds.data['SponsorName'][rowId],
                                ds.data['State'][rowId],
                                ds.data['Party'][rowId],
                                ds.data['Text'][rowId])
        return html
    
    def doc_select_callback(attr, old, new):
        selected_row = new['1d']['indices'][0]
        html = generateHTML(selected_row)
        div.text = html
    
    searchSponsor.on_change('value', filterSponsor)
    applyButton.on_click(applyFilter)
    filterSource.on_change('selected', filtered_select_callback)
    similarSource.on_change('selected', doc_select_callback)
    
    infodiv = Div(text="""<h2>Find Similar Documents</h2>
                    <p>We can use this Bokeh application to find top N most similar documents to a 
                    selected document.</p>
                    <h3>How to use it:</h3>
                    <ol>
                    <li>Input topic summary keyword to filter topics or leave it blank;</li>
                    <li>Input primary sponsor name to filter primary sponsor or leave it blank;</li>
                    <li>Select the range of the introduced date of documents;</li>
                    <li>Select other filtering options and click the "<strong>Apply Filter</strong>" button;</li>
                    <li>Click on an item of the DataTable to view the list of similar documents;</li>
                    <li>Further click on an item of the similar documents DataTable to view the detail of the document.</li>
                    </ol>""", width=maxWidth)

    lyt = layout([[infodiv],
                  [searchSponsor],
                  [sponsorSelect, partySelect, stateSelect],
                  [groupsSelect, typeSelect, topnSelect],
                  [dateRange],
                  [applyButton],
                  [title1],
                  [filterDataTable],
                  [title2],
                  [similarDataTable],
                  [div]])
    doc.add_root(lyt)

handler_similarDocs = FunctionHandler(modify_similarDocs)
similarDocs_app = Application(handler_similarDocs)
show(similarDocs_app)

### Visualize Topic Prevalence Over Time

#### Get All Topic Monthly Heatmap

Define functions to calculate monthly heatmap of a topic, and aggregated values.  

In [18]:
def GetMonthlyHeatmap(date_list, lda_prob_list):
    dlist = []
    total = []
    count = []
    
    end_year, end_month = date_list[-1].year, date_list[-1].month
    t_total, t_count, len_list = 0.0, 0, len(date_list)
    p_year, p_month = date_list[0].year, date_list[0].month
    
    i = 0
    while i < len_list:
        c_year, c_month = date_list[i].year, date_list[i].month
        if c_year == p_year and c_month == p_month:
            t_total += lda_prob_list[i]
            t_count += 1
            i += 1
        else:
            dlist.append(date(p_year, p_month, 1))
            total.append(t_total)
            count.append(t_count)
            t_total, t_count = 0.0, 0
            if p_month == 12:
                p_month = 1
                p_year = p_year + 1
            else:
                p_month = p_month + 1
    dlist.append(date(p_year, p_month, 1))
    total.append(t_total)
    count.append(t_count)
    df = pd.DataFrame()
    df['date'] = dlist
    df['total'] = total
    df['count'] = count
    return df


def GetAggregated(df, agg_window=1, avg_window=12):
    len_df = len(df)
    dlist, total_list = [], []
    
    i = 0
    while i < len_df:
        t_total, t_count = 0.0, 0
        for j in range(i, min(len_df, i + agg_window)):
            t_total += df['total'][j]
            t_count += df['count'][j]
        dlist.append(df['date'][i])
        total_list.append(t_total / t_count)
        i += agg_window
    ret_df = pd.DataFrame()
    ret_df['date'] = dlist
    ret_df['total'] = total_list
    ret_df['moving_avg'] = pd.rolling_mean(ret_df['total'], window=avg_window, min_periods=1)
    return ret_df



#### Topic Prevalence Bokeh Application

Show interactive application to plot the prevalence score of a specific topic over time.

In [19]:
def modify_prevalence(doc):
    maxWidth = 900
    aggOptions = ["Monthly", "Quarterly", "Yearly"]
    windowOptions = [str(x) for x in range(1, 13)]

    mask = FilterDocument(docsFrame, groups=None, docType=None, sponsor=None, 
                     state=None, party='Democrat', from_date=date(1973, 3, 1), to_date=date(2017, 6, 30))
    tdf = docsFrame[mask]
    tdf.sort_values(by='Date', ascending=True, inplace=True)
    lda_prob = list(docTopicProbs[tdf.index, 0])
    date_list = list(tdf['Date'])
    monthly_df = GetMonthlyHeatmap(date_list, lda_prob)
    source_df = GetAggregated(monthly_df, 1, 12)
    source = ColumnDataSource(data=source_df)
    
    keywordInput2 = TextInput(title="Search Topic Summary Keyword:", value='', width=int(maxWidth/2-20))
    searchSponsor = TextInput(title="Search Primary Sponsor Name:", value='')
    topicSelect2 = Select(title="Select Topic:", value=summaries[0], options=summaries, width=maxWidth)
    applyButton = Button(label='Apply Filter', button_type="success")
    
    def searchTopic(attr, old, new):
        newOptions = []
        if keywordInput2.value.strip() == '':
            topicSelect2.options = summaries
            return
        
        keywords = keywordInput2.value.strip().lower()
        for item in summaries:
            if keywords in item.lower():
                newOptions.append(item)
        topicSelect2.options = newOptions
        topicSelect2.value = newOptions[0]
        
    def filterSponsor(attr, old, new):
        newOptions = []
        if searchSponsor.value.strip() == '':
            sponsorSelect.options = sponsors
            return
        
        sponsorName = searchSponsor.value.strip().lower()
        for item in sponsors:
            if sponsorName in item.lower():
                newOptions.append(item)
        sponsorSelect.options = newOptions
        sponsorSelect.value = newOptions[0]
    
    def get_width(df):
        mindate = min(df['date'])
        maxdate = max(df['date'])
        return 0.7 * (maxdate-mindate).total_seconds()*1000 / len(df['date'])
    
    def applyFilter():
        topicID = summaries.index(topicSelect2.value)
        (fromDate, toDate) = dateRange.value_as_datetime
        sponsorName = sponsorSelect.value if sponsorSelect.value != "ANY SPONSOR" else None
        partyName = partySelect.value if partySelect.value != "ANY PARTY" else None
        stateName = stateSelect.value if stateSelect.value != "ANY STATE" else None
        if groupsSelect.value == "House of Representatives":
            groups = 0
        elif groupsSelect.value == "Senate":
            groups = 1
        else:
            groups = None
        if typeSelect.value == "Bills":
            docType = 0
        elif typeSelect.value == "Resolutions":
            docType = 1
        else:
            docType = None
        if aggregateSelect.value == "Monthly":
            aggregateNum = 1
        elif aggregateSelect.value == "Quarterly":
            aggregateNum = 3
        else:
            aggregateNum = 12
        windowNum = int(windowSelect.value)
        
        filterMask = FilterDocument(docsFrame, groups=groups, docType=docType, sponsor=sponsorName, 
                     state=stateName, party=partyName, from_date=fromDate, to_date=toDate)
        filterDF = docsFrame[filterMask]
        filterDF.sort_values(by='Date', ascending=True, inplace=True)
        filter_lda_prob = list(docTopicProbs[filterDF.index, topicID])
        filter_date_list = list(filterDF['Date'])
        filter_monthly_df = GetMonthlyHeatmap(filter_date_list, filter_lda_prob)
        source_df = GetAggregated(filter_monthly_df, aggregateNum, windowNum)
        source.data = ColumnDataSource(data=source_df).data
        
        plot.title.text = "Prevalence Overtime of Topic: " + summaries[topicID]
        plot.x_range.start = min(source_df['date'])
        plot.x_range.end = max(source_df['date'])
        plot.y_range.end = max(source_df['total']) + 0.001
        plot.xaxis[0].ticker.desired_num_ticks = 20
        prevalence_bar.glyph.width = get_width(source.data)
    
    keywordInput2.on_change('value', searchTopic)
    searchSponsor.on_change('value', filterSponsor)
    applyButton.on_click(applyFilter)
    
    div = Div(text="""<h2>Explore Topic Prevalence Over Time</h2>
                    <p>We can use this Bokeh application to explore topic prevalence over time.&nbsp;</p>
                    <h3>How to use it:</h3>
                    <ol>
                    <li>Input topic summary keyword to filter topics or leave it blank;</li>
                    <li>Input primary sponsor name to filter primary sponsor or leave it blank;</li>
                    <li>Choose the topic you want to explore from the topic list;</li>
                    <li>Select the range of the introduced date of documents;</li>
                    <li>Select other filtering options and click the "<strong>Apply Filter</strong>" button.</li>
                    </ol>""", width=maxWidth)
    
    groupsSelect = Select(title="Political Groups:", value=political_groups[0], options=political_groups)
    typeSelect = Select(title="Document Type:", value=types[0], options=types)
    sponsorSelect = Select(title="Primary Sponsor:", value=sponsors[0], options=sponsors)
    stateSelect = Select(title="State:", value=states[0], options=states)
    partySelect = Select(title="Party:", value=parties[0], options=parties)
    aggregateSelect = Select(title="Aggregate Data:", value=aggOptions[0], options=aggOptions)
    windowSelect = Select(title="Moving Window:", value=windowOptions[-1], options=windowOptions)
    dateRange = DateRangeSlider(title="Introduced Date Period", 
                                name="period",
                                start=date(1973, 1, 1),
                                end=date(2017, 6, 30),
                                step=1,
                                width=maxWidth,
                                value=(date(1973, 1, 1), date(2017, 6, 30)))
    
    plot = figure(x_axis_type='datetime',
                  x_axis_label='Date',
                  y_axis_label='Topic Prevalence',
                  x_range=(min(source_df['date']), max(source_df['date'])),
                  y_range=(0, max(source_df['total']) + 0.001),
                  plot_height=500,
                  plot_width=maxWidth+50,
                  title="Prevalence Overtime of Topic: " + summaries[0])
    plot.sizing_mode = 'scale_width'
    plot.xaxis.major_label_orientation = -pi / 2
    plot.xaxis[0].ticker.desired_num_ticks = 20
    plot.xaxis.major_tick_line_color = "firebrick"
    plot.xaxis.major_tick_line_width = 2.5
    plot.xaxis.formatter = DatetimeTickFormatter(days=["%Y-%m-%d"],
                                            months=["%Y-%m-%d"],
                                            years=["%Y-%m-%d"])
    
    prevalence_bar = plot.vbar(x='date',
                               source=source,
                               width=get_width(source.data), 
                               top='total', 
                               legend='Prevalence',
                               color="steelblue")
    
    plot.add_tools(HoverTool(renderers=[prevalence_bar], 
                             tooltips=[('Date', "@date{%F}"), 
                                       ('Prevalence', "@total{0.00 a}"), 
                                       ('Moving Average', "@moving_avg{0.00 a}")],
                             formatters={'date': 'datetime'},
                             mode='vline'))
    
    moving_avg_line = plot.line(source=source,
                                x='date', 
                                y='moving_avg',
                                legend='Moving Average',
                                line_width=2.5,
                                color="darkorange")

    lyt = layout([[div],
                  [keywordInput2],
                  [topicSelect2],
                  [dateRange],
                  [searchSponsor, sponsorSelect, partySelect],
                  [stateSelect, groupsSelect, typeSelect],
                  [aggregateSelect, windowSelect, applyButton],
                  [plot]])
    doc.add_root(lyt)

handler_prevalence = FunctionHandler(modify_prevalence)
prevalence_app = Application(handler_prevalence)
show(prevalence_app)

### Visualize Topic Evolving Over Time

Define functions to calculate evolving data of all topics.

In [20]:
def GetYearlyData(year, totalDoc, probTopic):
    np.seterr(divide='ignore', invalid='ignore')
    
    mask = docsFrame.Date.dt.year == year
    # Filtered document topic probability in a specific year
    yearDocTopicProb = docTopicProbs[mask, :]
    
    # The total number of document in a specific year
    yearlyNumDocs = sum(mask)
    
    # The probability that a randomly selected document came from
    # a specific year for the data time span (1973-2017)
    probYear = yearlyNumDocs * 1.0 / totalDoc
    
    # Compute the conditional probability of a topic given a specific year
    probTopicGivenYear = yearDocTopicProb.sum(axis=0) / np.sum(yearDocTopicProb)
    
    # Compute the conditional probability of a specific year given a topic
    probYearGivenTopic = yearDocTopicProb.sum(axis=0) / docTopicProbs.sum(axis=0)
    
    # Produce a "heat" indicator to highlight year for which a topic 
    # has higher than expected activity
    topicHeatMap = ((probTopicGivenYear - probTopic + 1.0) * (probYearGivenTopic - probYear + 1.0))
    
    # This array contains all data we need given a year
    # It has 3 columns: Score, NumDoc, and AnomalousScore
    data = np.zeros((yearDocTopicProb.shape[1], 3))
    data[:, 0] = np.sqrt(yearDocTopicProb.sum(axis=0)) * 3.0
    data[:, 2] = topicHeatMap
    
    for i in range(yearlyNumDocs):
        maxIdx = np.argmax(yearDocTopicProb[i, :])
        data[maxIdx, 1] += 1
    data[:, 1] = data[:, 1] / yearlyNumDocs
    
    df = pd.DataFrame(data, columns=['Score', 'NumDoc', 'AnomalousScore'])
    return df


def GetEvolveData():
    # Total number of document
    totalDoc = len(docsFrame)

    # This array contains the prior probability of a topic across the whole corpus
    probTopic = docTopicProbs.sum(axis=0) / np.sum(docTopicProbs)

    data = {}
    minYear, maxYear = min(docsFrame.Date.dt.year), max(docsFrame.Date.dt.year)
    for year in range(minYear, maxYear + 1):
        df = GetYearlyData(year, totalDoc, probTopic)
        df['topic'] = summaries
        data[year] = df.to_dict('series')
    return data


Calculate all year's topic evolving data for plotting.

In [21]:
evolve_data = GetEvolveData()

#### Topic Evolving Bokeh Application

The interactive Bokeh application to show the topic evolving over time. Two plots will be showing that one is an overview of all topics evolving over time, while the other is the evolving trajectory of a specific topic.

In [22]:
def evolve_doc(doc):
    global selected_topic
    
    minYear, maxYear = min(docsFrame.Date.dt.year), max(docsFrame.Date.dt.year)
    years = range(minYear, maxYear + 1)

    evolve_source = ColumnDataSource(data=evolve_data[years[0]])

    plot = figure(x_range=(0, 0.15), 
                  y_range=(0.9, 1.21), 
                  title='Topic Evolving Over Time', 
                  plot_height=220)
    plot.xaxis.ticker = SingleIntervalTicker(interval=0.01)
    plot.xaxis.axis_label = "Ratio of Supporting Documents"
    plot.yaxis.ticker = SingleIntervalTicker(interval=0.05)
    plot.yaxis.axis_label = "Anomalous Score"

    label = Label(x=0.1, y=0.9, text=str(years[0]), text_font_size='100pt', text_color='#eeeeee')
    plot.add_layout(label)

    color_mapper = CategoricalColorMapper(palette=Viridis256, factors=summaries)
    plot.circle(
        x='NumDoc',
        y='AnomalousScore',
        size='Score',
        source=evolve_source,
        fill_color={'field': 'topic', 'transform': color_mapper},
        fill_alpha=0.8,
        line_color='#7c7e71',
        line_width=0.5,
        line_alpha=0.5,
    )
    plot.add_tools(HoverTool(tooltips=[("Topic", "@topic")], show_arrow=False, point_policy='follow_mouse'))
    
    # Plot trajectory of a topic
    selected_topic = 0
    plot2 = figure(x_range=(0, 0.15), 
                  y_range=(0.9, 1.21), 
                  title='Trajectory of Topic: ' + summaries[selected_topic], 
                  plot_height=220)
    plot2.xaxis.ticker = SingleIntervalTicker(interval=0.01)
    plot2.xaxis.axis_label = "Ratio of Supporting Documents"
    plot2.yaxis.ticker = SingleIntervalTicker(interval=0.05)
    plot2.yaxis.axis_label = "Anomalous Score"
    label2 = Label(x=0.1, y=0.9, text=str(years[0]), text_font_size='100pt', text_color='#eeeeee')
    plot2.add_layout(label2)
    plot2.add_tools(HoverTool(tooltips=[("Year", "@year"), 
                                        ("Score", "@size{0.0000 a}"), 
                                        ("NumDoc", "@x{0.0000 a}"), 
                                        ("AnomalousScore", "@y{0.0000 a}")], 
                              show_arrow=False, 
                              point_policy='follow_mouse')
            )
    
    c = plot2.circle(
        x=[evolve_data[1973]['NumDoc'][selected_topic]],
        y=[evolve_data[1973]['AnomalousScore'][selected_topic]],
        size=[evolve_data[1973]['Score'][selected_topic]],
        fill_color=Viridis256[selected_topic],
        fill_alpha=0.8,
        line_color='#7c7e71',
        line_width=0.5,
        line_alpha=0.5,
    )
    trajectory_ds = c.data_source
    trajectory_ds.data['year'] = [1973]

    def animate_update():
        year = slider.value + 1
        if year > years[-1]:
            year = years[0]
            trajectory_ds.data['x'].clear()
            trajectory_ds.data['y'].clear()
            trajectory_ds.data['size'].clear()
            trajectory_ds.data['year'].clear()
        slider.value = year

    def slider_update(attrname, old, new):
        global selected_topic
        year = int(slider.value)
        label.text = str(year)
        evolve_source.data = evolve_data[year]
        
        # update trajectory plot
        label2.text = str(year)
        trajectory_ds.data['x'].append(evolve_data[year]['NumDoc'][selected_topic])
        trajectory_ds.data['y'].append(evolve_data[year]['AnomalousScore'][selected_topic])
        trajectory_ds.data['size'].append(evolve_data[year]['Score'][selected_topic])
        trajectory_ds.data['year'].append(year)
        trajectory_ds.trigger('data', trajectory_ds.data, trajectory_ds.data)

    slider = Slider(start=years[0], end=years[-1], value=years[0], step=1, title="Year")
    slider.on_change('value', slider_update)

    def animate():
        if play_button.label == '► Play':
            play_button.label = '❚❚ Pause'
            curdoc().add_periodic_callback(animate_update, 750)
        else:
            play_button.label = '► Play'
            curdoc().remove_periodic_callback(animate_update)

    play_button = Button(label='► Play', width=60)
    play_button.on_click(animate)
    
    def searchTopic(attr, old, new):
        newOptions = []
        if keywordInput.value.strip() == '':
            topicSelect.options = summaries
            topicSelect.value = summaries[0]
            return
        
        keywords = keywordInput.value.strip().lower()
        for item in summaries:
            if keywords in item.lower():
                newOptions.append(item)
        topicSelect.options = newOptions
        if newOptions:
            topicSelect.value = newOptions[0]
    
    def changeTopic(attr, old, new):
        global selected_topic
        topic_summary = topicSelect.value
        selected_topic = summaries.index(topic_summary)
        trajectory_ds.data['x'].clear()
        trajectory_ds.data['y'].clear()
        trajectory_ds.data['size'].clear()
        trajectory_ds.data['year'].clear()
        year = int(slider.value)
        trajectory_ds.data['x'].append(evolve_data[year]['NumDoc'][selected_topic])
        trajectory_ds.data['y'].append(evolve_data[year]['AnomalousScore'][selected_topic])
        trajectory_ds.data['size'].append(evolve_data[year]['Score'][selected_topic])
        trajectory_ds.data['year'].append(year)
        trajectory_ds.trigger('data', trajectory_ds.data, trajectory_ds.data)
        c.glyph.fill_color = Viridis256[selected_topic]
        plot2.title.text = 'Trajectory of Topic: ' + summaries[selected_topic]
        
    keywordInput = TextInput(title="Search Topic:", value='')
    keywordInput.on_change('value', searchTopic)
    topicSelect = Select(title="Select Topic:", value=summaries[0], options=summaries)
    topicSelect.on_change('value', changeTopic)
    
    div1 = Div(text="""<h2>Overview of Topic Evolving Over Time</h2>
                <p>The plot below shows an overview of topic evolving over time. The x-axis 
                represents the ratio of supporting document in a specific year. The average is 0.005.
                The y-axis shows the scaled anomalous score. An anomalous score smaller than 1 means
                lower than expected and the score larger than 1 means higher than expected. The size
                of each circle represents the score of a topic discuss in a year.</p>
                """)
    
    div2 = Div(text="""<h2>Topic Evolving Trajectory</h2>
                <p>The plot below shows the trajectory of a specific topic evolves over time.""")
    
    evolve_lyt = layout([[div1],
                         [plot],
                         [div2],
                         [plot2],
                         [keywordInput], 
                         [topicSelect],
                         [slider, play_button]], 
                        sizing_mode='scale_width')
    doc.add_root(evolve_lyt)

evolve_handler = FunctionHandler(evolve_doc)
evolve_app = Application(evolve_handler)

show(evolve_app)


### End

This is the last notebook of the document collection analysis series.