# Interactive Data Visualization with Bokeh 

This notebook serves as a brief introduction to Bokeh with Python.

In [1]:
import numpy as np
import pandas as pd

# Bokeh Essentials 
from bokeh.io import output_notebook
from bokeh.plotting import figure, show, ColumnDataSource

# Bokeh Helpers 
from bokeh.palettes import brewer
from bokeh.models import HoverTool

In [2]:
# Load Bokeh for visualization
output_notebook()

## Topic Model Visualization

In this section we'll take a look at visualizing a corpus by exploring clustering and dimensionality reduction techniques. Text analysis is certainly high dimensional visualization and this can be applied to other data sets as well. 

The first step is to load our documents from disk and vectorize them using Gensim. This content is a bit beyond the scope of the workshop for today, however I did want to provide code for reference, and I'm happy to go over it offline. 

In [3]:
import nltk 
import string
import pickle
import gensim
import random 

from operator import itemgetter
from collections import defaultdict 
from nltk.corpus import wordnet as wn
from gensim.matutils import sparse2full
from nltk.corpus.reader.api import CorpusReader
from nltk.corpus.reader.api import CategorizedCorpusReader

CORPUS_PATH = "data/baleen_sample"
PKL_PATTERN = r'(?!\.)[a-z_\s]+/[a-f0-9]+\.pickle'
CAT_PATTERN = r'([a-z_\s]+)/.*'



In [4]:
class PickledCorpus(CategorizedCorpusReader, CorpusReader):
    
    def __init__(self, root, fileids=PKL_PATTERN, cat_pattern=CAT_PATTERN):
        CategorizedCorpusReader.__init__(self, {"cat_pattern": cat_pattern})
        CorpusReader.__init__(self, root, fileids)
        
        self.punct = set(string.punctuation) | {'“', '—', '’', '”', '…'}
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.wordnet = nltk.WordNetLemmatizer() 
    
    def _resolve(self, fileids, categories):
        if fileids is not None and categories is not None:
            raise ValueError("Specify fileids or categories, not both")

        if categories is not None:
            return self.fileids(categories=categories)
        return fileids
    
    def lemmatize(self, token, tag):
        token = token.lower()
        
        if token not in self.stopwords:
            if not all(c in self.punct for c in token):
                tag =  {
                    'N': wn.NOUN,
                    'V': wn.VERB,
                    'R': wn.ADV,
                    'J': wn.ADJ
                }.get(tag[0], wn.NOUN)
                return self.wordnet.lemmatize(token, tag)
    
    def tokenize(self, doc):
        # Expects a preprocessed document, removes stopwords and punctuation
        # makes all tokens lowercase and lemmatizes them. 
        return list(filter(None, [
            self.lemmatize(token, tag)
            for paragraph in doc 
            for sentence in paragraph 
            for token, tag in sentence 
        ]))
    
    def docs(self, fileids=None, categories=None):
        # Resolve the fileids and the categories
        fileids = self._resolve(fileids, categories)

        # Create a generator, loading one document into memory at a time.
        for path, enc, fileid in self.abspaths(fileids, True, True):
            with open(path, 'rb') as f:
                yield self.tokenize(pickle.load(f))

The `PickledCorpus` is a Python class that reads a continuous stream of pickle files from disk. The files themselves are preprocessed documents from RSS feeds in various topics (and is actually just a small sample of the documents that are in the larger corpus). If you're interestd in the ingestion and curation of this corpus, see [baleen.districtdatalabs.com](http://baleen.districtdatalabs.com). 

Just to get a feel for this data set, I'll load the corpus and print out the number of documents per category:

In [5]:
# Create the Corpus Reader
corpus = PickledCorpus(CORPUS_PATH)

In [6]:
# Count the total number of documents
total_docs = 0

# Count the number of documents per category. 
for category in corpus.categories():
    num_docs = sum(1 for doc in corpus.fileids(categories=[category]))
    total_docs += num_docs 
    
    print("{}: {:,} documents".format(category, num_docs))
    
print("\n{:,} documents in the corpus".format(total_docs))

books: 71 documents
business: 389 documents
cinema: 100 documents
cooking: 30 documents
data_science: 41 documents
design: 55 documents
do_it_yourself: 122 documents
gaming: 128 documents
news: 1,159 documents
politics: 149 documents
sports: 118 documents
tech: 176 documents

2,538 documents in the corpus


Our corpus reader object handles text preprocessing with NLTK (the natural language toolkit), namely by converting each document as follows:

- tokenizing the document 
- making all tokens lower case 
- removes stopwords and punctuation 
- converts words to their lemma 

Here is an example document:

In [7]:
fid = random.choice(corpus.fileids())
doc = next(corpus.docs(fileids=[fid]))
print(" ".join(doc))



The next step is to convert these documents into vectors so that we can apply machine learning. We'll use a bag-of-words (bow) model with TF-IDF, implemented by the Gensim library.

In [8]:
# Create the lexicon from the corpus 
lexicon = gensim.corpora.Dictionary(corpus.docs())

# Create the document vectors 
docvecs = [lexicon.doc2bow(doc) for doc in corpus.docs()]

# Train the TF-IDF model and convert vectors to TF-IDF
tfidf = gensim.models.TfidfModel(docvecs, id2word=lexicon, normalize=True)
tfidfvecs = [tfidf[doc] for doc in docvecs]

# Save the lexicon and TF-IDF model to disk.
lexicon.save('data/topics/lexicon.dat')
tfidf.save('data/topics/tfidf_model.pkl')

Documents are now described by the words that are most important to that document relative to the rest of the corpus. The document above has been transformed into the following vector with associated weights: 

In [9]:
# Covert random document from above into TF-IDF vector 
dv = tfidf[lexicon.doc2bow(doc)]

# Print the document terms and their weights. 
print(" ".join([
    "{} ({:0.2f})".format(lexicon[tid], score)
    for tid, score in sorted(dv, key=itemgetter(1), reverse=True)
]))



### Topic Visualization with LDA

We have a lot of documents in our corpus, so let's see if we can cluster them into related topics using the Latent Dirichlet Model that comes with Gensim. This model is widely used for "topic modeling" -- that is clustering on documents. 

In [10]:
# Select the number of topics to train the model on.
NUM_TOPICS = 10 

# Create the LDA model from the docvecs corpus and save to disk.
model = gensim.models.LdaModel(docvecs, id2word=lexicon, alpha='auto', num_topics=NUM_TOPICS)
model.save('data/topics/lda_model.pkl')

Each topic is represented as a vector - where each word is a dimension and the probability of that word beloning to the topic is the value. We can use the model to query the topics for a document, our random document from above is assigned the following topics with associated probabilities:

In [11]:
model[lexicon.doc2bow(doc)]

[(2, 0.72882756700044149), (8, 0.2632769507616482)]

We can assign the most probable topic to each document in our corpus by selecting the topic with the maximal probability: 

In [12]:
topics = [
    max(model[doc], key=itemgetter(1))[0]
    for doc in docvecs
]

Topics themselves can be described by their highest probability words:

In [13]:
for tid, topic in model.print_topics():
    print("Topic {}:\n{}\n".format(tid, topic))

Topic 0:
0.010*"game" + 0.007*"say" + 0.006*"team" + 0.005*"get" + 0.005*"one" + 0.005*"season" + 0.005*"go" + 0.005*"first" + 0.005*"make" + 0.005*"new"

Topic 1:
0.007*"data" + 0.006*"say" + 0.004*"one" + 0.004*"use" + 0.004*"also" + 0.003*"make" + 0.003*"like" + 0.003*"people" + 0.003*"new" + 0.003*"find"

Topic 2:
0.009*"say" + 0.006*"year" + 0.005*"one" + 0.004*"people" + 0.004*"state" + 0.004*"two" + 0.003*"eng" + 0.003*"also" + 0.003*"time" + 0.003*"get"

Topic 3:
0.011*"say" + 0.008*"year" + 0.004*"state" + 0.003*"take" + 0.003*"also" + 0.003*"make" + 0.003*"time" + 0.003*"would" + 0.003*"go" + 0.003*"new"

Topic 4:
0.014*"trump" + 0.012*"say" + 0.005*"republican" + 0.005*"one" + 0.005*"get" + 0.004*"go" + 0.004*"like" + 0.004*"clinton" + 0.004*"make" + 0.004*"state"

Topic 5:
0.006*"one" + 0.005*"make" + 0.005*"may" + 0.004*"time" + 0.004*"say" + 0.004*"get" + 0.004*"1" + 0.003*"like" + 0.003*"take" + 0.003*"two"

Topic 6:
0.011*"say" + 0.006*"trump" + 0.005*"new" + 0.005*"yea

We can plot each topic by using decomposition methods (TruncatedSVD in this case) to reduce the probability vector for each topic into 2 dimensions, then size the radius of each topic according to how much probability documents it contains donates to it. Also try with PCA, explored below!

In [14]:
# Create a sum dictionary that adds up the total probability 
# of each document in the corpus to each topic. 
tsize = defaultdict(float)
for doc in docvecs:
    for tid, prob in model[doc]:
        tsize[tid] += prob

In [15]:
# Create a numpy array of topic vectors where each vector 
# is the topic probability of all terms in the lexicon. 
tvecs = np.array([
    sparse2full(model.get_topic_terms(tid, len(lexicon)), len(lexicon)) 
    for tid in range(NUM_TOPICS)
])

In [16]:
# Import the model family 
from sklearn.decomposition import TruncatedSVD 

# Instantiate the model form, fit and transform 
topic_svd = TruncatedSVD(n_components=2)
svd_tvecs = topic_svd.fit_transform(tvecs)

In [17]:
# Create the Bokeh columnar data source with our various elements. 
# Note the resize/normalization of the topics so the radius of our
# topic circles fits int he graph a bit better. 
tsource = ColumnDataSource(
        data=dict(
            x=svd_tvecs[:, 0],
            y=svd_tvecs[:, 1],
            w=[model.print_topic(tid, 10) for tid in range(10)],
            c=brewer['Spectral'][10],
            r=[tsize[idx]/700000.0 for idx in range(10)],
        )
    )

# Create the hover tool so that we can visualize the topics. 
hover = HoverTool(
        tooltips=[
            ("Words", "@w"),
        ]
    )


# Create the figure to draw the graph on. 
plt = figure(
    title="Topic Model Decomposition", 
    width=960, height=540, 
    tools="pan,box_zoom,reset,resize,save"
)

# Add the hover tool 
plt.add_tools(hover)

# Plot the SVD topic dimensions as a scatter plot 
plt.scatter(
    'x', 'y', source=tsource, size=9,
    radius='r', line_color='c', fill_color='c',
    marker='circle', fill_alpha=0.85,
)

# Show the plot to render the JavaScript 
show(plt)

### Corpus Visualization with PCA

The bag of words model means that every token (string representation of a word) is a dimension and a document is represented by a vector that maps the relative weight of that dimension to the document by the TF-IDF metric. In order to visualize documents in this high dimensional space, we must use decomposition methods to reduce the dimensionality to something we can plot. 

One good first attempt is toi use principle component analysis (PCA) to reduce the data set dimensions (the number of vocabulary words in the corpus) to 2 dimensions in order to map the corpus as a scatter plot. 

We'll use the Scikit-Learn PCA transformer to do this work:

In [18]:
# In order to use Scikit-Learn we need to transform Gensim vectors into a numpy Matrix. 
docarr = np.array([sparse2full(vec, len(lexicon)) for vec in tfidfvecs])

In [19]:
# Import the model family 
from sklearn.decomposition import PCA 

# Instantiate the model form, fit and transform 
tfidf_pca = PCA(n_components=2)
pca_dvecs = topic_svd.fit_transform(docarr)

We can now use Bokeh to create an interactive plot that will allow us to explore documents according to their position in decomposed TF-IDF space, coloring by their topic. 

In [20]:
# Create a map using the ColorBrewer 'Paired' Palette to assign 
# Topic IDs to specific colors. 
cmap = {
    i: brewer['Paired'][10][i]
    for i in range(10)
}

# Create a tokens listing for our hover tool. 
tokens = [
    " ".join([
        lexicon[tid] for tid, _ in sorted(doc, key=itemgetter(1), reverse=True)
    ][:10])
    for doc in tfidfvecs
]

# Create a Bokeh tabular data source to describe the data we've created. 
source = ColumnDataSource(
        data=dict(
            x=pca_dvecs[:, 0],
            y=pca_dvecs[:, 1],
            w=tokens,
            t=topics,
            c=[cmap[t] for t in topics],
        )
    )

# Create an interactive hover tool so that we can see the document. 
hover = HoverTool(
        tooltips=[
            ("Words", "@w"),
            ("Topic", "@t"),
        ]
    )

# Create the figure to draw the graph on. 
plt = figure(
    title="PCA Decomposition of BoW Space", 
    width=960, height=540, 
    tools="pan,box_zoom,reset,resize,save"
)

# Add the hover tool to the figure 
plt.add_tools(hover)

# Create the scatter plot with the PCA dimensions as the points. 
plt.scatter(
    'x', 'y', source=source, size=9,
    marker='circle_x', line_color='c', 
    fill_color='c', fill_alpha=0.5,
)

# Show the plot to render the JavaScript 
show(plt)

Another approach is to use the TSNE model for stochastic neighbor embedding. This is a very popular text clustering visualization/projection mechanism.

In [25]:
# Import the TSNE model family from the manifold package 
from sklearn.manifold import TSNE 
from sklearn.pipeline import Pipeline

# Instantiate the model form, it is usually recommended 
# To apply PCA (for dense data) or TruncatedSVD (for sparse)
# before TSNE to reduce noise and improve performance. 
tsne = Pipeline([
    ('svd', TruncatedSVD(n_components=75)),
    ('tsne', TSNE(n_components=2)),
])
                     
# Transform our TF-IDF vectors.
tsne_dvecs = tsne.fit_transform(docarr)

In [26]:
# Create a map using the ColorBrewer 'Paired' Palette to assign 
# Topic IDs to specific colors. 
cmap = {
    i: brewer['Paired'][10][i]
    for i in range(10)
}

# Create a tokens listing for our hover tool. 
tokens = [
    " ".join([
        lexicon[tid] for tid, _ in sorted(doc, key=itemgetter(1), reverse=True)
    ][:10])
    for doc in tfidfvecs
]

# Create a Bokeh tabular data source to describe the data we've created. 
source = ColumnDataSource(
        data=dict(
            x=tsne_dvecs[:, 0],
            y=tsne_dvecs[:, 1],
            w=tokens,
            t=topics,
            c=[cmap[t] for t in topics],
        )
    )

# Create an interactive hover tool so that we can see the document. 
hover = HoverTool(
        tooltips=[
            ("Words", "@w"),
            ("Topic", "@t"),
        ]
    )

# Create the figure to draw the graph on. 
plt = figure(
    title="TSNE Decomposition of BoW Space", 
    width=960, height=540, 
    tools="pan,box_zoom,reset,resize,save"
)

# Add the hover tool to the figure 
plt.add_tools(hover)

# Create the scatter plot with the PCA dimensions as the points. 
plt.scatter(
    'x', 'y', source=source, size=9,
    marker='circle_x', line_color='c', 
    fill_color='c', fill_alpha=0.5,
)

# Show the plot to render the JavaScript 
show(plt)