# TensorBoard Visualizations


In this tutorial, we will learn how to visualize different types of NLP based Embeddings via TensorBoard. TensorBoard is a data visualization framework for visualizing and inspecting the TensorFlow runs and graphs. We will use a built-in Tensorboard visualizer called *Embedding Projector* in this tutorial. It lets you interactively visualize and analyze high-dimensional data like embeddings.


## Read Data 

For this tutorial, a transformed MovieLens dataset<sup>[1]</sup> is used. You can download the final prepared csv from [here](https://github.com/parulsethi/DocViz/blob/master/movie_plots.csv).

In [49]:
import gensim
import pandas as pd
import smart_open
import random

# read data
dataframe = pd.read_csv('/Users/Shemelis/10-16-2017-GW-Arlington-Class-Repository-DATA/Machine-Learning-Project/All_Journal_data.csv',index_col='pmid')
# /Users/Shemelis/10-16-2017-GW-Arlington-Class-Repository-DATA/Machine-Learning-Project/All_Journal_data.csv
dataframe=dataframe. sample(n=10, axis=0)
dataframe.head()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0.1,title,abstract,journal,label
pmid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
29235502,10911,6708,Stress-induced release of the S100A8/A9 alarmi...,Psychological stress is thought to be an impor...,sci_rep,2
29487327,6746,2371,Molecular detection of colistin resistance gen...,Antimicrobial resistance against colistin has ...,sci_rep,2
29511259,6383,1948,Plant-made Salmonella bacteriocins salmocins f...,Salmonella enterica causes an estimated 1 mill...,sci_rep,2
29044132,14884,10805,Enhancement mechanisms of short-time aerobic d...,"Cocoamidopropyl betaine (CAPB), which is a bio...",sci_rep,2
29346419,29844,4511,SAXS analysis of a soluble cytosolic NgBR cons...,The Nogo-B receptor (NgBR) is involved in onco...,plos_one,2


# 1. Visualizing Doc2Vec
In this part, we will learn about visualizing Doc2Vec Embeddings aka [Paragraph Vectors](https://arxiv.org/abs/1405.4053) via TensorBoard. The input documents for training will be the synopsis of movies, on which Doc2Vec model is trained. 

<img src="Tensorboard.png">

The visualizations will be a scatterplot as seen in the above image, where each datapoint is labelled by the movie title and colored by it's corresponding genre. You can also visit this [Projector link](http://projector.tensorflow.org/?config=https://raw.githubusercontent.com/parulsethi/DocViz/master/movie_plot_config.json) which is configured with my embeddings for the above mentioned dataset. 


## Preprocess Text

Below, we define a function to read the training documents, pre-process each document using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.

In [50]:
def read_corpus(documents):
    for i, plot in enumerate(documents):
        yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(plot, max_len=30), [i])

In [51]:
train_corpus = list(read_corpus(dataframe.abstract))

Let's take a look at the training corpus.

In [52]:
train_corpus[:2]

[TaggedDocument(words=['psychological', 'stress', 'is', 'thought', 'to', 'be', 'an', 'important', 'trigger', 'of', 'cardiovascular', 'events', 'yet', 'the', 'involved', 'pathways', 'and', 'mediators', 'are', 'largely', 'unknown', 'elevated', 'systemic', 'levels', 'of', 'the', 'pro', 'inflammatory', 'alarmin', 'correlate', 'with', 'poor', 'prognosis', 'in', 'coronary', 'artery', 'disease', 'cad', 'patients', 'here', 'we', 'investigated', 'the', 'links', 'between', 'release', 'and', 'parameters', 'of', 'anti', 'inflammatory', 'glucocorticoid', 'secretion', 'in', 'two', 'different', 'cohorts', 'subjected', 'to', 'psychological', 'stress', 'test', 'in', 'the', 'first', 'cohort', 'of', 'cad', 'patients', 'psychological', 'stress', 'induced', 'rapid', 'increase', 'of', 'circulating', 'this', 'rapid', 'response', 'strongly', 'correlated', 'with', 'elevated', 'evening', 'saliva', 'cortisol', 'levels', 'suggesting', 'an', 'association', 'with', 'dysregulated', 'hypothalamic', 'pituitary', 'adre

## Training the Doc2Vec Model
We'll instantiate a Doc2Vec model with a vector size with 50 words and iterating over the training corpus 55 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time. Small datasets with short documents, like this one, can benefit from more training passes.

In [53]:
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)

2018-04-22 11:25:37,408 : MainThread : INFO : collecting all words and their counts
2018-04-22 11:25:37,410 : MainThread : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2018-04-22 11:25:37,411 : MainThread : INFO : collected 811 word types and 10 unique tags from a corpus of 10 examples and 1834 words
2018-04-22 11:25:37,412 : MainThread : INFO : Loading a fresh vocabulary
2018-04-22 11:25:37,415 : MainThread : INFO : min_count=2 retains 274 unique words (33% of original 811, drops 537)
2018-04-22 11:25:37,416 : MainThread : INFO : min_count=2 leaves 1297 word corpus (70% of original 1834, drops 537)
2018-04-22 11:25:37,420 : MainThread : INFO : deleting the raw counts dictionary of 811 items
2018-04-22 11:25:37,421 : MainThread : INFO : sample=0.001 downsamples 87 most-common words
2018-04-22 11:25:37,423 : MainThread : INFO : downsampling leaves estimated 825 word corpus (63.7% of prior 1297)
2018-04-22 11:25:37,425 : MainThread : INFO : estimated requ

2018-04-22 11:25:37,634 : MainThread : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-22 11:25:37,637 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-22 11:25:37,638 : MainThread : INFO : EPOCH - 14 : training on 1834 raw words (820 effective words) took 0.0s, 94434 effective words/s
2018-04-22 11:25:37,644 : MainThread : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-22 11:25:37,646 : MainThread : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-22 11:25:37,647 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-22 11:25:37,649 : MainThread : INFO : EPOCH - 15 : training on 1834 raw words (821 effective words) took 0.0s, 162243 effective words/s
2018-04-22 11:25:37,659 : MainThread : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-22 11:25:37,662 : MainThread : INFO : worker thread finished; awaiting finish of 

2018-04-22 11:25:37,873 : MainThread : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-22 11:25:37,875 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-22 11:25:37,878 : MainThread : INFO : EPOCH - 32 : training on 1834 raw words (845 effective words) took 0.0s, 106678 effective words/s
2018-04-22 11:25:37,884 : MainThread : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-22 11:25:37,886 : MainThread : INFO : worker thread finished; awaiting finish of 1 more threads
2018-04-22 11:25:37,888 : MainThread : INFO : worker thread finished; awaiting finish of 0 more threads
2018-04-22 11:25:37,889 : MainThread : INFO : EPOCH - 33 : training on 1834 raw words (839 effective words) took 0.0s, 129318 effective words/s
2018-04-22 11:25:37,894 : MainThread : INFO : worker thread finished; awaiting finish of 2 more threads
2018-04-22 11:25:37,900 : MainThread : INFO : worker thread finished; awaiting finish of

Now, we'll save the document embedding vectors per doctag.

In [76]:
model.save_word2vec_format('doc_tensor.w2v', doctag_vec=True, word_vec=False)  

AttributeError: 'LdaModel' object has no attribute 'save_word2vec_format'

## Prepare the Input files for Tensorboard

Tensorboard takes two Input files. One containing the embedding vectors and the other containing relevant metadata. We'll use a gensim script to directly convert the embedding file saved in word2vec format above to the tsv format required in Tensorboard.

In [55]:
%run ../../gensim/scripts/word2vec2tensor.py -i doc_tensor.w2v -o abstract_plot

2018-04-22 11:26:11,493 : MainThread : INFO : running ../../gensim/scripts/word2vec2tensor.py -i doc_tensor.w2v -o abstract_plot
2018-04-22 11:26:11,496 : MainThread : INFO : loading projection weights from doc_tensor.w2v
2018-04-22 11:26:11,500 : MainThread : INFO : loaded (10, 50) matrix from doc_tensor.w2v


TypeError: write() argument must be str, not bytes

The script above generates two files, `movie_plot_tensor.tsv` which contain the embedding vectors and `movie_plot_metadata.tsv`  containing doctags. But, these doctags are simply the unique index values and hence are not really useful to interpret what the document was while visualizing. So, we will overwrite `movie_plot_metadata.tsv` to have a custom metadata file with two columns. The first column will be for the movie titles and the second for their corresponding genres.

In [57]:
with open('movie_plot_metadata.tsv','w') as w:
    w.write('Titles\tGenres\n')
    for i,j in zip(dataframe.title, dataframe.label):
        w.write("%s\t%s\n" % (i,j))

Now you can go to http://projector.tensorflow.org/ and upload the two files by clicking on *Load data* in the left panel.

For demo purposes I have uploaded the Doc2Vec embeddings generated from the model trained above [here](https://github.com/parulsethi/DocViz). You can access the Embedding projector configured with these uploaded embeddings at this [link](http://projector.tensorflow.org/?config=https://raw.githubusercontent.com/parulsethi/DocViz/master/movie_plot_config.json).

# Using Tensorboard

For the visualization purpose, the multi-dimensional embeddings that we get from the Doc2Vec model above, needs to be  downsized to 2 or 3 dimensions. So that we basically end up with a new 2d or 3d embedding which tries to preserve information from the original multi-dimensional embedding. As these vectors are reduced to a much smaller dimension, the exact cosine/euclidean distances between them are not preserved, but rather relative, and hence as you’ll see below the nearest similarity results may change.

TensorBoard has two popular dimensionality reduction methods for visualizing the embeddings and also provides a custom method based on text searches:

- **Principal Component Analysis**: PCA aims at exploring the global structure in data, and could end up losing the local similarities between neighbours. It maximizes the total variance in the lower dimensional subspace and hence, often preserves the larger pairwise distances better than the smaller ones. See an intuition behind it in this nicely explained [answer](https://stats.stackexchange.com/questions/176672/what-is-meant-by-pca-preserving-only-large-pairwise-distances) on stackexchange.


- **T-SNE**: The idea of T-SNE is to place the local neighbours close to each other, and almost completely ignoring the global structure. It is useful for exploring local neighborhoods and finding local clusters. But the global trends are not represented accurately and the separation between different groups is often not preserved (see the t-sne plots of our data below which testify the same).


- **Custom Projections**: This is a custom bethod based on the text searches you define for different directions. It could be useful for finding meaningful directions in the vector space, for example, female to male, currency to country etc.

You can refer to this [doc](https://www.tensorflow.org/get_started/embedding_viz) for instructions on how to use and navigate through different panels available in TensorBoard.

## Visualize using PCA

The Embedding Projector computes the top 10 principal components. The menu at the left panel lets you project those components onto any combination of two or three. 
<img src="pca.png">
The above plot was made using the first two principal components with total variance covered being 36.5%.


## Visualize using T-SNE

Data is visualized by animating through every iteration of the t-sne algorithm. The t-sne menu at the left lets you adjust the value of it's two hyperparameters. The first one is **Perplexity**, which is basically a measure of information. It may be viewed as a knob that sets the number of effective nearest neighbors<sup>[2]</sup>. The second one is **learning rate** that defines how quickly an algorithm learns on encountering new examples/data points.

<img src="tsne.png">

The above plot was generated with perplexity 8, learning rate 10 and iteration 500. Though the results could vary on successive runs, and you may not get the exact plot as above with same hyperparameter settings. But some small clusters will start forming as above, with different orientations.

# 2. Visualizing LDA

In this part, we will see how to visualize LDA in Tensorboard. We will be using the Document-topic distribution as the embedding vector of a document. Basically, we treat topics as the dimensions and the value in each dimension represents the topic proportion of that topic in the document.

## Preprocess Text

We use the journal abstract as our documents in corpus and remove rare words and common words based on their document frequency. Below we remove words that appear in less than 2 documents or in more than 30% of the documents.

In [61]:
import pandas as pd
import re
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation
from gensim.models import ldamodel
from gensim.corpora.dictionary import Dictionary

# read data
# /Users/Shemelis/10-16-2017-GW-Arlington-Class-Repository-DATA/Machine-Learning-Project/All_Journal_data.csv

dataframe = pd.read_csv('/Users/Shemelis/10-16-2017-GW-Arlington-Class-Repository-DATA/Machine-Learning-Project/All_Journal_data.csv')

# remove stopwords and punctuations
def preprocess(row):
    return strip_punctuation(remove_stopwords(row.lower()))
    
dataframe['Plots'] = dataframe['abstract'].apply(preprocess)

# Convert data to required input format by LDA
texts = []
for line in dataframe.abstract:
    lowered = line.lower()
    words = re.findall(r'\w+', lowered, flags = re.UNICODE )
    texts.append(words)
# Create a dictionary representation of the documents.
dictionary = Dictionary(texts)

# Filter out words that occur less than 2 documents, or more than 30% of the documents.
dictionary.filter_extremes(no_below=2, no_above=0.3)
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(text) for text in texts]

2018-04-22 11:35:34,320 : MainThread : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-04-22 11:35:36,718 : MainThread : INFO : adding document #10000 to Dictionary(55774 unique tokens: ['2', 'a', 'aggregation', 'als', 'also']...)
2018-04-22 11:35:38,550 : MainThread : INFO : adding document #20000 to Dictionary(81265 unique tokens: ['2', 'a', 'aggregation', 'als', 'also']...)
2018-04-22 11:35:40,480 : MainThread : INFO : adding document #30000 to Dictionary(103483 unique tokens: ['2', 'a', 'aggregation', 'als', 'also']...)
2018-04-22 11:35:43,443 : MainThread : INFO : adding document #40000 to Dictionary(127570 unique tokens: ['2', 'a', 'aggregation', 'als', 'also']...)
2018-04-22 11:35:45,604 : MainThread : INFO : adding document #50000 to Dictionary(146454 unique tokens: ['2', 'a', 'aggregation', 'als', 'also']...)
2018-04-22 11:35:48,017 : MainThread : INFO : adding document #60000 to Dictionary(169468 unique tokens: ['2', 'a', 'aggregation', 'als', 'also']...)
20

## Train LDA Model


In [65]:
# Set training parameters.
num_topics = 6
chunksize = 2000
passes = 5 #another name of epoch
iterations = 20
eval_every = None

# Train model
model = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, chunksize=chunksize, alpha='auto', eta='auto', iterations=iterations, num_topics=num_topics, passes=passes, eval_every=eval_every)

2018-04-22 12:28:41,770 : MainThread : INFO : using autotuned alpha, starting with [0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667]
2018-04-22 12:28:41,791 : MainThread : INFO : using serial LDA version on this node
2018-04-22 12:28:41,848 : MainThread : INFO : running online (multi-pass) LDA training, 6 topics, 5 passes over the supplied corpus of 66620 documents, updating model once every 2000 documents, evaluating perplexity every 0 documents, iterating 20x with a convergence threshold of 0.001000
2018-04-22 12:28:41,850 : MainThread : INFO : PROGRESS: pass 0, at document #2000/66620
2018-04-22 12:28:42,607 : MainThread : INFO : optimized alpha [0.30347776, 0.30538231, 0.3038584, 0.30899113, 0.3063333, 0.30563092]
2018-04-22 12:28:42,611 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:28:42,659 : MainThread : INFO : topic #0 (0.303): 0.008*"cell" + 0.005*"cells" + 0.004*"protein" + 0.004*"can" + 0.004*"c

2018-04-22 12:28:48,132 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:28:48,178 : MainThread : INFO : topic #4 (0.193): 0.009*"cancer" + 0.008*"cells" + 0.007*"patients" + 0.006*"cell" + 0.005*"tumor" + 0.005*"t" + 0.004*"1" + 0.004*"treatment" + 0.004*"human" + 0.004*"expression"
2018-04-22 12:28:48,180 : MainThread : INFO : topic #5 (0.217): 0.004*"species" + 0.004*"1" + 0.004*"between" + 0.003*"more" + 0.003*"than" + 0.003*"data" + 0.003*"can" + 0.003*"social" + 0.003*"but" + 0.003*"0"
2018-04-22 12:28:48,183 : MainThread : INFO : topic #0 (0.350): 0.010*"cells" + 0.010*"cell" + 0.007*"protein" + 0.005*"dna" + 0.005*"1" + 0.004*"binding" + 0.004*"signaling" + 0.003*"mice" + 0.003*"expression" + 0.003*"human"
2018-04-22 12:28:48,186 : MainThread : INFO : topic #1 (0.373): 0.004*"can" + 0.004*"show" + 0.004*"between" + 0.003*"their" + 0.003*"cells" + 0.003*"cell" + 0.003*"neurons" + 0.003*"but" + 0.003*"model" + 0.003*"such"
20

2018-04-22 12:28:53,231 : MainThread : INFO : topic #0 (0.198): 0.012*"cells" + 0.012*"cell" + 0.007*"protein" + 0.005*"show" + 0.005*"dna" + 0.005*"1" + 0.004*"expression" + 0.004*"activation" + 0.004*"binding" + 0.004*"mice"
2018-04-22 12:28:53,233 : MainThread : INFO : topic #3 (0.214): 0.005*"materials" + 0.005*"high" + 0.005*"1" + 0.004*"can" + 0.004*"2" + 0.004*"energy" + 0.003*"surface" + 0.003*"metal" + 0.003*"chemistry" + 0.003*"chemical"
2018-04-22 12:28:53,236 : MainThread : INFO : topic #1 (0.248): 0.006*"quantum" + 0.005*"can" + 0.004*"show" + 0.004*"such" + 0.003*"between" + 0.003*"single" + 0.003*"their" + 0.003*"demonstrate" + 0.003*"spin" + 0.003*"magnetic"
2018-04-22 12:28:53,239 : MainThread : INFO : topic diff=1.295948, rho=0.301511
2018-04-22 12:28:53,324 : MainThread : INFO : PROGRESS: pass 0, at document #24000/66620
2018-04-22 12:28:54,066 : MainThread : INFO : optimized alpha [0.18543524, 0.23696855, 0.17563009, 0.20307086, 0.11911661, 0.12447152]
2018-04-22 12

2018-04-22 12:28:57,812 : MainThread : INFO : topic #0 (0.221): 0.011*"cells" + 0.010*"cell" + 0.008*"protein" + 0.007*"1" + 0.006*"expression" + 0.005*"signaling" + 0.005*"activation" + 0.005*"induced" + 0.004*"activity" + 0.004*"receptor"
2018-04-22 12:28:57,815 : MainThread : INFO : topic diff=1.855193, rho=0.250000
2018-04-22 12:28:57,894 : MainThread : INFO : PROGRESS: pass 0, at document #34000/66620
2018-04-22 12:28:58,683 : MainThread : INFO : optimized alpha [0.20605628, 0.16350824, 0.19886397, 0.18071762, 0.14549308, 0.12915292]
2018-04-22 12:28:58,689 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:28:58,734 : MainThread : INFO : topic #5 (0.129): 0.007*"0" + 0.005*"between" + 0.004*"species" + 0.004*"data" + 0.004*"1" + 0.004*"than" + 0.003*"more" + 0.003*"during" + 0.003*"their" + 0.003*"but"
2018-04-22 12:28:58,736 : MainThread : INFO : topic #4 (0.145): 0.011*"cells" + 0.009*"cancer" + 0.008*"patients" + 0.008*"cell

2018-04-22 12:29:03,698 : MainThread : INFO : topic #5 (0.165): 0.009*"0" + 0.005*"between" + 0.004*"1" + 0.004*"species" + 0.003*"than" + 0.003*"data" + 0.003*"p" + 0.003*"during" + 0.003*"more" + 0.003*"but"
2018-04-22 12:29:03,700 : MainThread : INFO : topic #4 (0.167): 0.011*"patients" + 0.009*"cells" + 0.008*"cancer" + 0.008*"1" + 0.007*"cell" + 0.006*"0" + 0.006*"expression" + 0.006*"treatment" + 0.005*"p" + 0.005*"2"
2018-04-22 12:29:03,703 : MainThread : INFO : topic #2 (0.175): 0.008*"protein" + 0.006*"proteins" + 0.006*"binding" + 0.005*"genes" + 0.005*"2" + 0.004*"gene" + 0.004*"membrane" + 0.004*"its" + 0.004*"molecular" + 0.004*"activity"
2018-04-22 12:29:03,707 : MainThread : INFO : topic #1 (0.177): 0.007*"can" + 0.004*"based" + 0.004*"system" + 0.004*"between" + 0.004*"two" + 0.003*"time" + 0.003*"model" + 0.003*"imaging" + 0.003*"field" + 0.003*"method"
2018-04-22 12:29:03,710 : MainThread : INFO : topic #3 (0.184): 0.007*"1" + 0.006*"high" + 0.006*"2" + 0.004*"3" + 0.

2018-04-22 12:29:08,797 : MainThread : INFO : topic #4 (0.187): 0.016*"patients" + 0.011*"0" + 0.009*"1" + 0.008*"p" + 0.007*"cancer" + 0.006*"treatment" + 0.006*"cells" + 0.006*"2" + 0.005*"cell" + 0.005*"clinical"
2018-04-22 12:29:08,802 : MainThread : INFO : topic #5 (0.219): 0.010*"0" + 0.006*"1" + 0.005*"between" + 0.004*"health" + 0.004*"species" + 0.003*"data" + 0.003*"than" + 0.003*"95" + 0.003*"ci" + 0.003*"2"
2018-04-22 12:29:08,806 : MainThread : INFO : topic diff=1.057628, rho=0.192450
2018-04-22 12:29:08,882 : MainThread : INFO : PROGRESS: pass 0, at document #56000/66620
2018-04-22 12:29:09,755 : MainThread : INFO : optimized alpha [0.15060996, 0.18085577, 0.16651134, 0.16612901, 0.19394489, 0.23919733]
2018-04-22 12:29:09,761 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:29:09,807 : MainThread : INFO : topic #0 (0.151): 0.014*"cells" + 0.011*"cell" + 0.010*"expression" + 0.006*"induced" + 0.006*"mice" + 0.006*"pro

2018-04-22 12:29:14,500 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:29:14,544 : MainThread : INFO : topic #0 (0.150): 0.017*"cells" + 0.012*"cell" + 0.012*"expression" + 0.007*"mice" + 0.006*"induced" + 0.006*"protein" + 0.005*"1" + 0.004*"levels" + 0.004*"role" + 0.004*"increased"
2018-04-22 12:29:14,546 : MainThread : INFO : topic #3 (0.154): 0.007*"1" + 0.005*"high" + 0.005*"2" + 0.004*"c" + 0.004*"water" + 0.004*"3" + 0.004*"temperature" + 0.004*"0" + 0.004*"5" + 0.004*"surface"
2018-04-22 12:29:14,549 : MainThread : INFO : topic #1 (0.189): 0.006*"can" + 0.005*"model" + 0.005*"based" + 0.005*"method" + 0.004*"time" + 0.004*"between" + 0.004*"two" + 0.004*"it" + 0.004*"system" + 0.003*"different"
2018-04-22 12:29:14,552 : MainThread : INFO : topic #4 (0.227): 0.018*"patients" + 0.018*"0" + 0.011*"p" + 0.011*"1" + 0.007*"2" + 0.006*"treatment" + 0.006*"cancer" + 0.005*"3" + 0.005*"clinical" + 0.005*"disease"
2018-04-22 12:2

2018-04-22 12:29:19,616 : MainThread : INFO : topic #1 (0.201): 0.007*"can" + 0.004*"model" + 0.004*"between" + 0.004*"based" + 0.004*"two" + 0.004*"such" + 0.004*"system" + 0.003*"time" + 0.003*"has" + 0.003*"it"
2018-04-22 12:29:19,619 : MainThread : INFO : topic #5 (0.237): 0.008*"0" + 0.006*"1" + 0.005*"between" + 0.005*"health" + 0.004*"data" + 0.004*"years" + 0.004*"95" + 0.004*"than" + 0.004*"ci" + 0.004*"more"
2018-04-22 12:29:19,622 : MainThread : INFO : topic diff=0.383151, rho=0.168287
2018-04-22 12:29:19,694 : MainThread : INFO : PROGRESS: pass 1, at document #10000/66620
2018-04-22 12:29:20,637 : MainThread : INFO : optimized alpha [0.17156737, 0.20658158, 0.19245568, 0.13529198, 0.15892954, 0.22317299]
2018-04-22 12:29:20,643 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:29:20,691 : MainThread : INFO : topic #3 (0.135): 0.006*"1" + 0.005*"high" + 0.005*"2" + 0.005*"water" + 0.004*"energy" + 0.004*"surface" + 0.004*

2018-04-22 12:29:25,420 : MainThread : INFO : optimized alpha [0.17526905, 0.2204814, 0.18723164, 0.13962667, 0.12449836, 0.16176216]
2018-04-22 12:29:25,425 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:29:25,468 : MainThread : INFO : topic #4 (0.124): 0.015*"patients" + 0.012*"0" + 0.011*"cancer" + 0.010*"1" + 0.008*"p" + 0.007*"2" + 0.006*"treatment" + 0.006*"disease" + 0.005*"clinical" + 0.005*"tumor"
2018-04-22 12:29:25,470 : MainThread : INFO : topic #3 (0.140): 0.006*"high" + 0.005*"1" + 0.005*"materials" + 0.005*"energy" + 0.005*"surface" + 0.004*"2" + 0.004*"water" + 0.004*"carbon" + 0.004*"temperature" + 0.004*"can"
2018-04-22 12:29:25,472 : MainThread : INFO : topic #0 (0.175): 0.018*"cells" + 0.016*"cell" + 0.007*"expression" + 0.006*"mice" + 0.005*"show" + 0.005*"protein" + 0.005*"signaling" + 0.004*"human" + 0.004*"activation" + 0.004*"1"
2018-04-22 12:29:25,475 : MainThread : INFO : topic #2 (0.187): 0.008*"protei

2018-04-22 12:29:29,795 : MainThread : INFO : topic #5 (0.125): 0.005*"between" + 0.005*"0" + 0.005*"species" + 0.004*"1" + 0.004*"data" + 0.004*"than" + 0.004*"more" + 0.003*"health" + 0.003*"during" + 0.003*"their"
2018-04-22 12:29:29,798 : MainThread : INFO : topic #1 (0.180): 0.007*"can" + 0.004*"quantum" + 0.004*"between" + 0.004*"show" + 0.004*"such" + 0.004*"model" + 0.004*"single" + 0.003*"two" + 0.003*"based" + 0.003*"their"
2018-04-22 12:29:29,800 : MainThread : INFO : topic #0 (0.205): 0.016*"cells" + 0.013*"cell" + 0.008*"expression" + 0.006*"protein" + 0.006*"mice" + 0.005*"1" + 0.005*"signaling" + 0.005*"activation" + 0.005*"induced" + 0.004*"receptor"
2018-04-22 12:29:29,803 : MainThread : INFO : topic #2 (0.212): 0.009*"protein" + 0.009*"binding" + 0.006*"proteins" + 0.005*"domain" + 0.004*"molecular" + 0.004*"its" + 0.004*"complex" + 0.004*"membrane" + 0.004*"2" + 0.004*"dna"
2018-04-22 12:29:29,806 : MainThread : INFO : topic diff=0.289634, rho=0.168287
2018-04-22 12:

2018-04-22 12:29:34,119 : MainThread : INFO : topic #1 (0.190): 0.007*"can" + 0.004*"based" + 0.004*"between" + 0.004*"model" + 0.004*"system" + 0.004*"two" + 0.003*"time" + 0.003*"such" + 0.003*"method" + 0.003*"show"
2018-04-22 12:29:34,122 : MainThread : INFO : topic diff=0.271510, rho=0.168287
2018-04-22 12:29:34,185 : MainThread : INFO : PROGRESS: pass 1, at document #42000/66620
2018-04-22 12:29:34,924 : MainThread : INFO : optimized alpha [0.17690453, 0.19102556, 0.18390936, 0.15312253, 0.14366063, 0.15196308]
2018-04-22 12:29:34,929 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:29:34,970 : MainThread : INFO : topic #4 (0.144): 0.013*"patients" + 0.013*"0" + 0.011*"1" + 0.008*"p" + 0.008*"cancer" + 0.007*"2" + 0.006*"treatment" + 0.005*"3" + 0.005*"disease" + 0.005*"associated"
2018-04-22 12:29:34,973 : MainThread : INFO : topic #5 (0.152): 0.005*"between" + 0.005*"species" + 0.005*"0" + 0.004*"1" + 0.004*"than" + 0.004*"

2018-04-22 12:29:39,351 : MainThread : INFO : topic #3 (0.155): 0.006*"1" + 0.006*"high" + 0.005*"2" + 0.005*"surface" + 0.004*"temperature" + 0.004*"materials" + 0.004*"c" + 0.004*"energy" + 0.004*"3" + 0.004*"water"
2018-04-22 12:29:39,353 : MainThread : INFO : topic #4 (0.163): 0.017*"0" + 0.015*"patients" + 0.012*"1" + 0.010*"p" + 0.007*"2" + 0.007*"cancer" + 0.006*"treatment" + 0.005*"3" + 0.005*"associated" + 0.005*"clinical"
2018-04-22 12:29:39,356 : MainThread : INFO : topic #2 (0.176): 0.008*"protein" + 0.007*"genes" + 0.006*"gene" + 0.006*"proteins" + 0.005*"binding" + 0.004*"molecular" + 0.004*"two" + 0.004*"dna" + 0.004*"its" + 0.003*"identified"
2018-04-22 12:29:39,358 : MainThread : INFO : topic #5 (0.177): 0.005*"between" + 0.005*"species" + 0.005*"0" + 0.004*"1" + 0.004*"health" + 0.004*"than" + 0.003*"data" + 0.003*"during" + 0.003*"more" + 0.003*"their"
2018-04-22 12:29:39,361 : MainThread : INFO : topic #1 (0.194): 0.007*"can" + 0.004*"based" + 0.004*"model" + 0.004*

2018-04-22 12:29:43,940 : MainThread : INFO : topic #4 (0.204): 0.023*"0" + 0.017*"patients" + 0.014*"1" + 0.012*"p" + 0.008*"2" + 0.006*"3" + 0.005*"treatment" + 0.005*"5" + 0.005*"cancer" + 0.005*"clinical"
2018-04-22 12:29:43,943 : MainThread : INFO : topic #5 (0.248): 0.006*"health" + 0.005*"between" + 0.004*"0" + 0.004*"species" + 0.004*"data" + 0.004*"1" + 0.003*"more" + 0.003*"their" + 0.003*"than" + 0.003*"years"
2018-04-22 12:29:43,946 : MainThread : INFO : topic diff=0.265975, rho=0.168287
2018-04-22 12:29:44,011 : MainThread : INFO : PROGRESS: pass 1, at document #64000/66620
2018-04-22 12:29:44,785 : MainThread : INFO : optimized alpha [0.15514517, 0.1978649, 0.17166328, 0.14362288, 0.21170402, 0.26015949]
2018-04-22 12:29:44,790 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:29:44,830 : MainThread : INFO : topic #3 (0.144): 0.006*"1" + 0.006*"high" + 0.005*"2" + 0.004*"water" + 0.004*"temperature" + 0.004*"surface" +

2018-04-22 12:29:48,599 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:29:48,640 : MainThread : INFO : topic #3 (0.135): 0.006*"high" + 0.006*"1" + 0.005*"2" + 0.004*"water" + 0.004*"temperature" + 0.004*"energy" + 0.004*"carbon" + 0.004*"surface" + 0.004*"materials" + 0.004*"c"
2018-04-22 12:29:48,642 : MainThread : INFO : topic #0 (0.165): 0.021*"cells" + 0.016*"cell" + 0.010*"expression" + 0.007*"mice" + 0.005*"human" + 0.005*"induced" + 0.004*"protein" + 0.004*"activation" + 0.004*"cancer" + 0.004*"t"
2018-04-22 12:29:48,644 : MainThread : INFO : topic #4 (0.186): 0.024*"0" + 0.018*"patients" + 0.014*"1" + 0.012*"p" + 0.009*"2" + 0.007*"3" + 0.007*"group" + 0.006*"95" + 0.005*"5" + 0.005*"cancer"
2018-04-22 12:29:48,646 : MainThread : INFO : topic #1 (0.208): 0.006*"can" + 0.005*"model" + 0.004*"between" + 0.004*"based" + 0.004*"two" + 0.004*"time" + 0.004*"system" + 0.003*"such" + 0.003*"it" + 0.003*"method"
2018-04-22 12:29

2018-04-22 12:29:52,949 : MainThread : INFO : topic #2 (0.209): 0.008*"protein" + 0.006*"binding" + 0.006*"proteins" + 0.006*"dna" + 0.005*"genes" + 0.005*"gene" + 0.005*"molecular" + 0.004*"genome" + 0.004*"complex" + 0.004*"rna"
2018-04-22 12:29:52,951 : MainThread : INFO : topic #1 (0.221): 0.007*"can" + 0.004*"model" + 0.004*"between" + 0.004*"show" + 0.004*"such" + 0.004*"based" + 0.003*"two" + 0.003*"time" + 0.003*"brain" + 0.003*"system"
2018-04-22 12:29:52,954 : MainThread : INFO : topic diff=0.198295, rho=0.165954
2018-04-22 12:29:53,015 : MainThread : INFO : PROGRESS: pass 2, at document #18000/66620
2018-04-22 12:29:53,729 : MainThread : INFO : optimized alpha [0.18635894, 0.22376497, 0.20430067, 0.1330592, 0.13087107, 0.16605017]
2018-04-22 12:29:53,734 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:29:53,774 : MainThread : INFO : topic #4 (0.131): 0.019*"0" + 0.016*"patients" + 0.013*"1" + 0.010*"p" + 0.008*"2" + 0.0

2018-04-22 12:29:57,218 : MainThread : INFO : PROGRESS: pass 2, at document #28000/66620
2018-04-22 12:29:58,202 : MainThread : INFO : optimized alpha [0.20923954, 0.19225766, 0.21956354, 0.13887903, 0.12009884, 0.12407732]
2018-04-22 12:29:58,212 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:29:58,277 : MainThread : INFO : topic #4 (0.120): 0.014*"1" + 0.013*"patients" + 0.013*"0" + 0.009*"2" + 0.008*"p" + 0.008*"cancer" + 0.006*"3" + 0.006*"treatment" + 0.006*"disease" + 0.006*"clinical"
2018-04-22 12:29:58,280 : MainThread : INFO : topic #5 (0.124): 0.005*"species" + 0.004*"between" + 0.004*"data" + 0.004*"more" + 0.004*"than" + 0.004*"health" + 0.003*"their" + 0.003*"population" + 0.003*"climate" + 0.003*"during"
2018-04-22 12:29:58,288 : MainThread : INFO : topic #1 (0.192): 0.007*"can" + 0.004*"quantum" + 0.004*"between" + 0.004*"show" + 0.004*"such" + 0.004*"model" + 0.004*"single" + 0.003*"two" + 0.003*"based" + 0.003*"t

2018-04-22 12:30:03,128 : MainThread : INFO : topic #5 (0.141): 0.005*"species" + 0.005*"between" + 0.004*"data" + 0.004*"than" + 0.004*"more" + 0.003*"their" + 0.003*"during" + 0.003*"population" + 0.003*"health" + 0.003*"but"
2018-04-22 12:30:03,130 : MainThread : INFO : topic #1 (0.195): 0.007*"can" + 0.004*"between" + 0.004*"model" + 0.004*"based" + 0.004*"system" + 0.004*"two" + 0.003*"time" + 0.003*"such" + 0.003*"method" + 0.003*"show"
2018-04-22 12:30:03,132 : MainThread : INFO : topic #0 (0.196): 0.018*"cells" + 0.014*"cell" + 0.010*"expression" + 0.006*"mice" + 0.006*"induced" + 0.005*"protein" + 0.005*"1" + 0.005*"activation" + 0.004*"signaling" + 0.004*"receptor"
2018-04-22 12:30:03,135 : MainThread : INFO : topic #2 (0.202): 0.009*"protein" + 0.007*"binding" + 0.006*"proteins" + 0.005*"genes" + 0.004*"dna" + 0.004*"molecular" + 0.004*"gene" + 0.004*"domain" + 0.004*"its" + 0.004*"two"
2018-04-22 12:30:03,138 : MainThread : INFO : topic diff=0.206224, rho=0.165954
2018-04-2

2018-04-22 12:30:07,518 : MainThread : INFO : topic #1 (0.203): 0.007*"can" + 0.004*"based" + 0.004*"model" + 0.004*"between" + 0.004*"two" + 0.004*"system" + 0.004*"time" + 0.004*"method" + 0.003*"imaging" + 0.003*"brain"
2018-04-22 12:30:07,520 : MainThread : INFO : topic diff=0.184479, rho=0.165954
2018-04-22 12:30:07,582 : MainThread : INFO : PROGRESS: pass 2, at document #50000/66620
2018-04-22 12:30:08,352 : MainThread : INFO : optimized alpha [0.18702793, 0.20260404, 0.18885751, 0.16247961, 0.16046739, 0.15835103]
2018-04-22 12:30:08,357 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:30:08,398 : MainThread : INFO : topic #5 (0.158): 0.006*"species" + 0.005*"between" + 0.004*"during" + 0.004*"than" + 0.003*"more" + 0.003*"data" + 0.003*"their" + 0.003*"but" + 0.003*"may" + 0.003*"population"
2018-04-22 12:30:08,399 : MainThread : INFO : topic #4 (0.160): 0.017*"0" + 0.014*"patients" + 0.013*"1" + 0.010*"p" + 0.008*"2" + 0.0

2018-04-22 12:30:13,605 : MainThread : INFO : topic #3 (0.148): 0.006*"1" + 0.006*"high" + 0.005*"2" + 0.005*"surface" + 0.004*"temperature" + 0.004*"energy" + 0.004*"water" + 0.004*"c" + 0.004*"materials" + 0.004*"3"
2018-04-22 12:30:13,608 : MainThread : INFO : topic #0 (0.167): 0.019*"cells" + 0.015*"cell" + 0.013*"expression" + 0.008*"mice" + 0.007*"induced" + 0.005*"protein" + 0.004*"increased" + 0.004*"human" + 0.004*"levels" + 0.004*"role"
2018-04-22 12:30:13,613 : MainThread : INFO : topic #1 (0.204): 0.007*"can" + 0.005*"model" + 0.005*"based" + 0.004*"method" + 0.004*"between" + 0.004*"time" + 0.004*"two" + 0.004*"system" + 0.004*"it" + 0.004*"data"
2018-04-22 12:30:13,616 : MainThread : INFO : topic #4 (0.208): 0.025*"0" + 0.015*"patients" + 0.015*"1" + 0.012*"p" + 0.009*"2" + 0.007*"3" + 0.006*"95" + 0.006*"5" + 0.005*"4" + 0.005*"ci"
2018-04-22 12:30:13,623 : MainThread : INFO : topic #5 (0.227): 0.006*"health" + 0.005*"between" + 0.005*"species" + 0.004*"data" + 0.004*"th

2018-04-22 12:30:18,707 : MainThread : INFO : topic #4 (0.219): 0.025*"0" + 0.016*"patients" + 0.015*"1" + 0.012*"p" + 0.009*"2" + 0.007*"3" + 0.006*"95" + 0.006*"ci" + 0.006*"5" + 0.005*"risk"
2018-04-22 12:30:18,710 : MainThread : INFO : topic #5 (0.253): 0.006*"health" + 0.005*"between" + 0.005*"species" + 0.004*"data" + 0.004*"more" + 0.004*"their" + 0.003*"population" + 0.003*"than" + 0.003*"during" + 0.003*"among"
2018-04-22 12:30:18,712 : MainThread : INFO : topic diff=0.174888, rho=0.163715
2018-04-22 12:30:18,773 : MainThread : INFO : PROGRESS: pass 3, at document #4000/66620
2018-04-22 12:30:19,709 : MainThread : INFO : optimized alpha [0.1749206, 0.21803755, 0.19310768, 0.14007263, 0.20387559, 0.24210045]
2018-04-22 12:30:19,714 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:30:19,756 : MainThread : INFO : topic #3 (0.140): 0.006*"1" + 0.006*"high" + 0.005*"2" + 0.005*"energy" + 0.004*"temperature" + 0.004*"water" + 0.

2018-04-22 12:30:24,360 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:30:24,401 : MainThread : INFO : topic #3 (0.133): 0.006*"high" + 0.005*"1" + 0.005*"energy" + 0.005*"water" + 0.005*"2" + 0.005*"surface" + 0.004*"temperature" + 0.004*"carbon" + 0.004*"materials" + 0.004*"can"
2018-04-22 12:30:24,403 : MainThread : INFO : topic #4 (0.150): 0.022*"0" + 0.017*"patients" + 0.015*"1" + 0.011*"p" + 0.009*"2" + 0.007*"3" + 0.006*"group" + 0.006*"95" + 0.006*"5" + 0.005*"4"
2018-04-22 12:30:24,405 : MainThread : INFO : topic #0 (0.193): 0.019*"cells" + 0.016*"cell" + 0.008*"expression" + 0.007*"mice" + 0.005*"cancer" + 0.005*"signaling" + 0.005*"induced" + 0.005*"human" + 0.004*"protein" + 0.004*"activation"
2018-04-22 12:30:24,408 : MainThread : INFO : topic #2 (0.217): 0.008*"protein" + 0.006*"dna" + 0.006*"binding" + 0.006*"proteins" + 0.006*"genes" + 0.005*"gene" + 0.005*"molecular" + 0.004*"genome" + 0.004*"rna" + 0.004*"comple

2018-04-22 12:30:28,585 : MainThread : INFO : topic #2 (0.198): 0.009*"protein" + 0.007*"dna" + 0.006*"binding" + 0.006*"proteins" + 0.005*"molecular" + 0.005*"gene" + 0.005*"genes" + 0.005*"genome" + 0.004*"complex" + 0.004*"rna"
2018-04-22 12:30:28,587 : MainThread : INFO : topic #1 (0.232): 0.007*"can" + 0.005*"quantum" + 0.004*"between" + 0.004*"show" + 0.004*"model" + 0.004*"such" + 0.004*"single" + 0.004*"two" + 0.003*"based" + 0.003*"systems"
2018-04-22 12:30:28,590 : MainThread : INFO : topic diff=0.127611, rho=0.163715
2018-04-22 12:30:28,651 : MainThread : INFO : PROGRESS: pass 3, at document #26000/66620
2018-04-22 12:30:29,392 : MainThread : INFO : optimized alpha [0.20142414, 0.21261694, 0.21139199, 0.14443746, 0.12290002, 0.13485058]
2018-04-22 12:30:29,397 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:30:29,437 : MainThread : INFO : topic #4 (0.123): 0.016*"0" + 0.015*"patients" + 0.015*"1" + 0.010*"p" + 0.009*"2"

2018-04-22 12:30:32,870 : MainThread : INFO : topic diff=0.174185, rho=0.163715
2018-04-22 12:30:32,935 : MainThread : INFO : PROGRESS: pass 3, at document #36000/66620
2018-04-22 12:30:33,706 : MainThread : INFO : optimized alpha [0.20977175, 0.1956991, 0.21665369, 0.14980951, 0.14172661, 0.13949524]
2018-04-22 12:30:33,712 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:30:33,754 : MainThread : INFO : topic #5 (0.139): 0.006*"species" + 0.004*"between" + 0.004*"data" + 0.004*"more" + 0.004*"their" + 0.004*"than" + 0.003*"during" + 0.003*"population" + 0.003*"but" + 0.003*"health"
2018-04-22 12:30:33,756 : MainThread : INFO : topic #4 (0.142): 0.018*"0" + 0.015*"1" + 0.013*"patients" + 0.010*"p" + 0.009*"2" + 0.007*"3" + 0.005*"treatment" + 0.005*"clinical" + 0.005*"associated" + 0.005*"5"
2018-04-22 12:30:33,759 : MainThread : INFO : topic #1 (0.196): 0.007*"can" + 0.004*"between" + 0.004*"model" + 0.004*"based" + 0.004*"system"

2018-04-22 12:30:38,796 : MainThread : INFO : topic #4 (0.161): 0.019*"0" + 0.014*"1" + 0.014*"patients" + 0.010*"p" + 0.009*"2" + 0.006*"3" + 0.005*"associated" + 0.005*"treatment" + 0.005*"clinical" + 0.005*"5"
2018-04-22 12:30:38,798 : MainThread : INFO : topic #0 (0.198): 0.019*"cells" + 0.014*"cell" + 0.011*"expression" + 0.007*"mice" + 0.007*"induced" + 0.005*"cancer" + 0.004*"1" + 0.004*"protein" + 0.004*"activation" + 0.004*"human"
2018-04-22 12:30:38,800 : MainThread : INFO : topic #2 (0.199): 0.008*"protein" + 0.007*"binding" + 0.006*"genes" + 0.006*"proteins" + 0.005*"gene" + 0.005*"dna" + 0.005*"molecular" + 0.004*"its" + 0.004*"two" + 0.003*"c"
2018-04-22 12:30:38,803 : MainThread : INFO : topic #1 (0.205): 0.007*"can" + 0.004*"based" + 0.004*"model" + 0.004*"between" + 0.004*"system" + 0.004*"two" + 0.004*"time" + 0.004*"method" + 0.003*"imaging" + 0.003*"brain"
2018-04-22 12:30:38,805 : MainThread : INFO : topic diff=0.139517, rho=0.163715
2018-04-22 12:30:38,866 : MainT

2018-04-22 12:30:43,720 : MainThread : INFO : topic #1 (0.206): 0.007*"can" + 0.005*"model" + 0.005*"based" + 0.004*"method" + 0.004*"between" + 0.004*"time" + 0.004*"two" + 0.004*"system" + 0.003*"it" + 0.003*"different"
2018-04-22 12:30:43,723 : MainThread : INFO : topic diff=0.170777, rho=0.163715
2018-04-22 12:30:43,798 : MainThread : INFO : PROGRESS: pass 3, at document #58000/66620
2018-04-22 12:30:44,828 : MainThread : INFO : optimized alpha [0.17667943, 0.20839195, 0.18521307, 0.15266523, 0.20784533, 0.21422079]
2018-04-22 12:30:44,833 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:30:44,877 : MainThread : INFO : topic #3 (0.153): 0.006*"high" + 0.006*"1" + 0.005*"2" + 0.005*"surface" + 0.004*"temperature" + 0.004*"energy" + 0.004*"materials" + 0.004*"water" + 0.004*"c" + 0.003*"properties"
2018-04-22 12:30:44,880 : MainThread : INFO : topic #0 (0.177): 0.019*"cells" + 0.015*"cell" + 0.012*"expression" + 0.008*"mice" + 0.

2018-04-22 12:30:50,000 : MainThread : INFO : topic #3 (0.145): 0.006*"1" + 0.006*"high" + 0.005*"2" + 0.004*"energy" + 0.004*"surface" + 0.004*"temperature" + 0.004*"water" + 0.004*"c" + 0.004*"degrees" + 0.003*"3"
2018-04-22 12:30:50,002 : MainThread : INFO : topic #0 (0.174): 0.020*"cells" + 0.014*"cell" + 0.014*"expression" + 0.008*"mice" + 0.007*"induced" + 0.005*"levels" + 0.005*"increased" + 0.005*"human" + 0.005*"protein" + 0.004*"cancer"
2018-04-22 12:30:50,004 : MainThread : INFO : topic #1 (0.212): 0.006*"can" + 0.006*"model" + 0.005*"based" + 0.005*"method" + 0.004*"between" + 0.004*"time" + 0.004*"two" + 0.004*"data" + 0.004*"used" + 0.004*"system"
2018-04-22 12:30:50,006 : MainThread : INFO : topic #4 (0.258): 0.026*"0" + 0.016*"1" + 0.016*"patients" + 0.012*"p" + 0.009*"2" + 0.007*"3" + 0.007*"95" + 0.006*"5" + 0.006*"ci" + 0.005*"risk"
2018-04-22 12:30:50,009 : MainThread : INFO : topic #5 (0.271): 0.006*"health" + 0.005*"between" + 0.005*"species" + 0.004*"data" + 0.00

2018-04-22 12:30:54,825 : MainThread : INFO : topic #2 (0.214): 0.008*"protein" + 0.007*"dna" + 0.006*"genes" + 0.006*"gene" + 0.006*"binding" + 0.006*"proteins" + 0.005*"molecular" + 0.005*"genome" + 0.005*"rna" + 0.004*"complex"
2018-04-22 12:30:54,827 : MainThread : INFO : topic #1 (0.220): 0.007*"can" + 0.005*"model" + 0.004*"between" + 0.004*"based" + 0.004*"two" + 0.004*"time" + 0.003*"system" + 0.003*"it" + 0.003*"such" + 0.003*"data"
2018-04-22 12:30:54,830 : MainThread : INFO : topic diff=0.152602, rho=0.161564
2018-04-22 12:30:54,897 : MainThread : INFO : PROGRESS: pass 4, at document #12000/66620
2018-04-22 12:30:55,659 : MainThread : INFO : optimized alpha [0.19742762, 0.22164194, 0.21727395, 0.13518915, 0.16446528, 0.19677073]
2018-04-22 12:30:55,664 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:30:55,706 : MainThread : INFO : topic #3 (0.135): 0.006*"high" + 0.005*"1" + 0.005*"energy" + 0.005*"2" + 0.005*"surface" 

2018-04-22 12:30:59,944 : MainThread : INFO : optimized alpha [0.19278307, 0.23193583, 0.20767713, 0.14758195, 0.12989928, 0.15303794]
2018-04-22 12:30:59,950 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:30:59,994 : MainThread : INFO : topic #4 (0.130): 0.020*"0" + 0.016*"patients" + 0.015*"1" + 0.010*"p" + 0.009*"2" + 0.007*"3" + 0.006*"5" + 0.006*"risk" + 0.005*"clinical" + 0.005*"group"
2018-04-22 12:30:59,996 : MainThread : INFO : topic #3 (0.148): 0.006*"high" + 0.006*"materials" + 0.005*"energy" + 0.005*"1" + 0.005*"surface" + 0.004*"2" + 0.004*"can" + 0.004*"temperature" + 0.004*"carbon" + 0.003*"electron"
2018-04-22 12:30:59,998 : MainThread : INFO : topic #0 (0.193): 0.019*"cells" + 0.017*"cell" + 0.008*"expression" + 0.006*"cancer" + 0.006*"mice" + 0.005*"show" + 0.004*"activation" + 0.004*"human" + 0.004*"induced" + 0.004*"t"
2018-04-22 12:31:00,000 : MainThread : INFO : topic #2 (0.208): 0.009*"protein" + 0.007*"dna

2018-04-22 12:31:04,316 : MainThread : INFO : topic #1 (0.193): 0.007*"can" + 0.004*"between" + 0.004*"model" + 0.004*"based" + 0.004*"show" + 0.004*"quantum" + 0.004*"two" + 0.003*"time" + 0.003*"such" + 0.003*"system"
2018-04-22 12:31:04,318 : MainThread : INFO : topic #0 (0.224): 0.017*"cells" + 0.014*"cell" + 0.009*"expression" + 0.006*"mice" + 0.005*"induced" + 0.005*"activation" + 0.005*"signaling" + 0.005*"protein" + 0.005*"1" + 0.005*"cancer"
2018-04-22 12:31:04,321 : MainThread : INFO : topic #2 (0.237): 0.009*"protein" + 0.008*"binding" + 0.006*"proteins" + 0.005*"dna" + 0.005*"domain" + 0.005*"molecular" + 0.004*"complex" + 0.004*"its" + 0.004*"gene" + 0.004*"genes"
2018-04-22 12:31:04,324 : MainThread : INFO : topic diff=0.178486, rho=0.161564
2018-04-22 12:31:04,391 : MainThread : INFO : PROGRESS: pass 4, at document #34000/66620
2018-04-22 12:31:05,125 : MainThread : INFO : optimized alpha [0.21924771, 0.19606087, 0.22915457, 0.15053551, 0.14112653, 0.13698591]
2018-04-22

2018-04-22 12:31:08,841 : MainThread : INFO : topic diff=0.125071, rho=0.161564
2018-04-22 12:31:08,926 : MainThread : INFO : PROGRESS: pass 4, at document #44000/66620
2018-04-22 12:31:10,099 : MainThread : INFO : optimized alpha [0.2042653, 0.20552157, 0.20644666, 0.16369739, 0.1624368, 0.15381818]
2018-04-22 12:31:10,106 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:31:10,151 : MainThread : INFO : topic #5 (0.154): 0.006*"species" + 0.004*"between" + 0.004*"their" + 0.004*"data" + 0.003*"during" + 0.003*"more" + 0.003*"than" + 0.003*"but" + 0.003*"may" + 0.003*"population"
2018-04-22 12:31:10,153 : MainThread : INFO : topic #4 (0.162): 0.019*"0" + 0.014*"1" + 0.014*"patients" + 0.010*"p" + 0.009*"2" + 0.007*"3" + 0.005*"associated" + 0.005*"5" + 0.005*"treatment" + 0.005*"clinical"
2018-04-22 12:31:10,156 : MainThread : INFO : topic #0 (0.204): 0.019*"cells" + 0.015*"cell" + 0.011*"expression" + 0.007*"mice" + 0.007*"induced"

2018-04-22 12:31:14,639 : MainThread : INFO : topic #0 (0.187): 0.019*"cells" + 0.015*"cell" + 0.012*"expression" + 0.007*"mice" + 0.007*"induced" + 0.005*"cancer" + 0.005*"human" + 0.004*"protein" + 0.004*"increased" + 0.004*"1"
2018-04-22 12:31:14,641 : MainThread : INFO : topic #4 (0.192): 0.023*"0" + 0.015*"1" + 0.015*"patients" + 0.011*"p" + 0.009*"2" + 0.007*"3" + 0.005*"associated" + 0.005*"95" + 0.005*"5" + 0.005*"risk"
2018-04-22 12:31:14,644 : MainThread : INFO : topic #2 (0.193): 0.008*"protein" + 0.007*"genes" + 0.007*"gene" + 0.006*"proteins" + 0.005*"binding" + 0.005*"dna" + 0.004*"molecular" + 0.004*"two" + 0.004*"identified" + 0.004*"its"
2018-04-22 12:31:14,646 : MainThread : INFO : topic #1 (0.210): 0.007*"can" + 0.005*"model" + 0.005*"based" + 0.004*"method" + 0.004*"between" + 0.004*"two" + 0.004*"time" + 0.004*"system" + 0.003*"it" + 0.003*"different"
2018-04-22 12:31:14,649 : MainThread : INFO : topic diff=0.172720, rho=0.161564
2018-04-22 12:31:14,713 : MainThrea

2018-04-22 12:31:19,170 : MainThread : INFO : topic #5 (0.251): 0.006*"health" + 0.005*"species" + 0.005*"between" + 0.004*"data" + 0.004*"their" + 0.004*"more" + 0.003*"than" + 0.003*"population" + 0.003*"during" + 0.003*"among"
2018-04-22 12:31:19,173 : MainThread : INFO : topic diff=0.119878, rho=0.161564
2018-04-22 12:31:19,236 : MainThread : INFO : PROGRESS: pass 4, at document #66000/66620
2018-04-22 12:31:20,004 : MainThread : INFO : optimized alpha [0.18051812, 0.21609548, 0.18863943, 0.14953333, 0.25492153, 0.26118666]
2018-04-22 12:31:20,009 : MainThread : INFO : merging changes from 2000 documents into a model of 66620 documents
2018-04-22 12:31:20,052 : MainThread : INFO : topic #3 (0.150): 0.006*"1" + 0.006*"high" + 0.005*"2" + 0.004*"surface" + 0.004*"temperature" + 0.004*"energy" + 0.004*"water" + 0.004*"c" + 0.003*"can" + 0.003*"degrees"
2018-04-22 12:31:20,054 : MainThread : INFO : topic #0 (0.181): 0.019*"cells" + 0.014*"cell" + 0.013*"expression" + 0.008*"mice" + 0.0

You can refer to [this notebook](lda_training_tips.ipynb) also before training the LDA model. It contains tips and suggestions for pre-processing the text data, and how to train the LDA model to get good results.

## Doc-Topic distribution

Now we will use `get_document_topics` which infers the topic distribution of a document. It basically returns a list of (topic_id, topic_probability) for each document in the input corpus.

In [66]:
# Get document topics
all_topics = model.get_document_topics(corpus, minimum_probability=0)
all_topics[0]

[(0, 0.39545888),
 (1, 0.0017428111),
 (2, 0.59718239),
 (3, 0.0011685896),
 (4, 0.0022194292),
 (5, 0.0022278947)]

The above output shows the topic distribution of first document in the corpus as a list of (topic_id, topic_probability).

Now, using the topic distribution of a document as it's vector embedding, we will plot all the documents in our corpus using Tensorboard.

## Prepare the Input files for Tensorboard

Tensorboard takes two input files, one containing the embedding vectors and the other containing relevant metadata. As described above we will use the topic distribution of documents as their embedding vector. Metadata file will consist of Movie titles with their genres.

In [69]:
# create file for tensors
with open('doc_lda_tensor.tsv','w') as w:
    for doc_topics in all_topics:
        for topics in doc_topics:
            w.write(str(topics[1])+ "\t")
        w.write("\n")
        
# create file for metadata
with open('doc_lda_metadata.tsv','w') as w:
    w.write('Titles\ttitle\n')
    for j, k in zip(dataframe.title, dataframe.abstract):
        w.write("%s\t%s\n" % (j, k))

Now you can go to http://projector.tensorflow.org/ and upload these two files by clicking on Load data in the left panel.

For demo purposes I have uploaded the LDA doc-topic embeddings generated from the model trained above [here](https://github.com/parulsethi/LdaProjector/). You can also access the Embedding projector configured with these uploaded embeddings at this [link](http://projector.tensorflow.org/?config=https://raw.githubusercontent.com/parulsethi/LdaProjector/master/doc_lda_config.json).

## Visualize using PCA

The Embedding Projector computes the top 10 principal components. The menu at the left panel lets you project those components onto any combination of two or three.
<img src="doc_lda_pca.png">
From PCA, we get a simplex (tetrahedron in this case) where each data point represent a document. These data points are  colored according to their Genres which were given in the Movie dataset. 

As we can see there are a lot of points which cluster at the corners of the simplex. This is primarily due to the sparsity of vectors we are using. The documents at the corners primarily belongs to a single topic (hence, large weight in a single dimension and other dimensions have approximately zero weight.) You can modify the metadata file as explained below to see the dimension weights along with the Movie title.

Now, we will append the topics with highest probability (topic_id, topic_probability) to the document's title, in order to explore what topics do the cluster corners or edges dominantly belong to. For this, we just need to overwrite the metadata file as below:

In [70]:
tensors = []
for doc_topics in all_topics:
    doc_tensor = []
    for topic in doc_topics:
        if round(topic[1], 3) > 0:
            doc_tensor.append((topic[0], float(round(topic[1], 3))))
    # sort topics according to highest probabilities
    doc_tensor = sorted(doc_tensor, key=lambda x: x[1], reverse=True)
    # store vectors to add in metadata file
    tensors.append(doc_tensor[:5])

# overwrite metadata file
i=0
with open('doc_lda_metadata.tsv','w') as w:
    w.write('Titles\tGenres\n')
    for j,k in zip(dataframe.title, dataframe.label):
        w.write("%s\t%s\n" % (''.join((str(j), str(tensors[i]))),k))
        i+=1

Next, we upload the previous tensor file "doc_lda_tensor.tsv" and this new metadata file to http://projector.tensorflow.org/ .
<img src="topic_with_coordinate.png">
Voila! Now we can click on any point to see it's top topics with their probabilty in that document, along with the title. As we can see in the above example, "Beverly hill cops" primarily belongs to the 0th and 1st topic as they have the highest probability amongst all.



## Visualize using T-SNE

In T-SNE, the data is visualized by animating through every iteration of the t-sne algorithm. The t-sne menu at the left lets you adjust the value of it's two hyperparameters. The first one is Perplexity, which is basically a measure of information. It may be viewed as a knob that sets the number of effective nearest neighbors[2]. The second one is learning rate that defines how quickly an algorithm learns on encountering new examples/data points.

Now, as the topic distribution of a document is used as it’s embedding vector, t-sne ends up forming clusters of documents belonging to same topics. In order to understand and interpret about the theme of those topics, we can use `show_topic()` to explore the terms that the topics consisted of.

<img src="doc_lda_tsne.png">

The above plot was generated with perplexity 11, learning rate 10 and iteration 1100. Though the results could vary on successive runs, and you may not get the exact plot as above even with same hyperparameter settings. But some small clusters will start forming as above, with different orientations.

I named some clusters above based on the genre of it's movies and also using the `show_topic()` to see relevant terms of the topic which was most prevelant in a cluster. Most of the clusters had doocumets belonging dominantly to a single topic. For ex. The cluster with movies belonging primarily to topic 0 could be named Fantasy/Romance based on terms displayed below for topic 0. You can play with the visualization yourself on this [link](http://projector.tensorflow.org/?config=https://raw.githubusercontent.com/parulsethi/LdaProjector/master/doc_lda_config.json) and try to conclude a label for clusters based on movies it has and their dominant topic. You can see the top 5 topics of every point by hovering over it.

Now, we can notice that their are more than 10 clusters in the above image, whereas we trained our model for `num_topics=10`. It's because their are few clusters, which has documents belonging to more than one topic with an approximately close topic probability values.

In [71]:
model.show_topic(topicid=0, topn=15)

[('cells', 0.019363862),
 ('cell', 0.014119518),
 ('expression', 0.01352202),
 ('mice', 0.007646814),
 ('induced', 0.0065615214),
 ('levels', 0.0050926651),
 ('increased', 0.004751916),
 ('human', 0.0045997747),
 ('protein', 0.0044072135),
 ('cancer', 0.0042670672),
 ('il', 0.0038744663),
 ('role', 0.0038231616),
 ('1', 0.0036225135),
 ('activity', 0.003570609),
 ('t', 0.0035296546)]

You can even use pyLDAvis to deduce topics more efficiently. It provides a deeper inspection of the terms highly associated with each individual topic. For this, it uses a measure called **relevance** of a term to a topic that allows users to flexibly rank terms best suited for a meaningful topic interpretation. It's weight parameter called λ can be adjusted to display useful terms which could help in differentiating topics efficiently.

In [73]:
import pyLDAvis.gensim

viz = pyLDAvis.gensim.prepare(model, corpus, dictionary)
pyLDAvis.display(viz)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


In [84]:
pyLDAvis.save_json(viz,"Viz_InterTopic_distance_map")

The weight parameter λ can be viewed as a knob to adjust the ranks of the terms based on whether they are simply ranked according to their probability in the topic (λ=1) or are normalized by their marginal probability across the corpus (λ=0). Setting λ=1 could result in similar ranking of terms for large no. of topics hence making it difficult to differentiate between them, and setting λ=0 ranks terms solely based on their exclusiveness to current topic which could result in such rare terms that occur in only a single topic and hence the topics may remain difficult to interpret. [(Sievert and Shirley 2014)](https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) suggested the optimal value of λ=0.6 based on a user study.

# Conclusion

We learned about visualizing the Document Embeddings and LDA Doc-topic distributions through Tensorboard's Embedding Projector. It is a useful tool for visualizing different types of data for example, word embeddings, document embeddings or the gene expressions and biological sequences. It just needs an input of 2D tensors and then you can explore your data using provided algorithms. You can also perform nearest neighbours search to find most similar data points to your query point.

# References
 1. https://grouplens.org/datasets/movielens/
 2. https://lvdmaaten.github.io/tsne/
