Topic Modeling Using Distributed Word Embeddings
================================================
Notebook version of https://github.com/rsrandhawa/Vec2Topic code, based on the article "Topic Modeling Using Distributed Word Embeddings" by R. S. Randhawa, P. Jain, and G. Madan. 

The basic approach is to first create a language model based on a large (ideally billions of words) text corpus. The technology used, distributed word embeddings, is a shallow neural network that seems to perform best on large datasets (trades simple but fast computation for tons of data).

The user generated content (which is usually a much smaller corpus) is likewise trained with consistent parameters. Vectors corresponding to the same vocabulary word are concatenated together to provide a model of the user generated content.

Word vectors that cluster together are interperted as topics of the user generated content. Some clusters appear better than others because they consist of coherent lists of words -- main goal is to score the importance of each topic.

Performing a hierarchical clustering provides a measure of depth for each word and computing a co-occurance graph (edge between two words if they belong to the same sentenence) provides a degree of co-occurance. Each word is scored by a (normalized) product of depth and degree. KMeans is used to cluster words into topics, and the scoring function is used to order the words and the topics.

Required standard packages
--------------------------

In [None]:
import logging, re, os, bz2
from collections import Counter
from operator import itemgetter
import itertools

In [None]:
## Unicode wrapper for reading & writing csv files.
import unicodecsv as csv

## Lighter weight than pandas -- tabular display of tables.
from tabulate import tabulate

## In order to strip out the text from the xml formated data.
from bs4 import BeautifulSoup
import lxml

Required data science packages
------------------------------

In [None]:
## First the usual suspects: numpy, scipy, and gensim
import numpy as np
import scipy as sp
import gensim

## For scraping text out of a wikipedia dump. Get dumps at https://dumps.wikimedia.org/backup-index.html
from gensim.corpora import WikiCorpus

## Latest greatest word vectors (see https://pypi.python.org/pypi/fasttext).
import fasttext

## Latest greatest hierarchical clustering package. 
## Word vectors are clustered, with deeper trees indicating core topics.
import fastcluster

## Use scikit-learn to generate co-occurancy graph (edge if words in same sentence).
## The degree of each word indicates how strong it co-occurs.
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

## Use scikit-learn for K-Means clustering: identify topics.
from sklearn.cluster import KMeans

import bhtsne

Base data directory and logging.
--------------------------------
The approach currently uses a lot of intermediate files (which is annoying, but means that the project can work on machines with smaller physical memory). The initial data (knowledge base as well as user generated content) and the intermediate files are all kept in the data directory.

In [None]:
data_directory = 'data/'
model_directory = 'models/'

In [None]:
from imp import reload
reload(logging)

LOG_FILENAME = data_directory + 'vec2topic.log'
#logging.basicConfig(filename=LOG_FILENAME,level=logging.INFO)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s %(message)s',"%b-%d-%Y %H:%M:%S")
logger.handlers[0].setFormatter(formatter)

List of intermediate files.
---------------------------
The (global) knowledge base is built off a (large) dataset.

In [None]:
if True:
    knowledge_base = 'viwiki-20160920-pages-articles.xml.bz2'
else:
    knowledge_base = 'vie_newscrawl_2011_1M.tkn.wseg'
    
knowledge_base_vector_dimension = 300    # Word vector dimensionality for knowledge base.
#knowledge_base_prefix = 'vie_newscrawl_2011_1M'
knowledge_base_prefix = 'viwiki-20160920-pages-articles'

knowledge_base_text = data_directory + knowledge_base_prefix + '.xml.txt.wseg'
knowledge_base_phrases = data_directory + knowledge_base_prefix + '_phrases.txt'

knowledge_base_model = model_directory + knowledge_base_prefix + '.bin'
knowledge_base_vectors = model_directory + knowledge_base_prefix + '.vec'
knowledge_base_vectors_tsne = model_directory + knowledge_base_prefix + '_vec_tsne.txt'
knowledge_base_vocab = model_directory + knowledge_base_prefix + '_vocab.txt'

The (local) user generated content.

In [None]:
local_content_name = 'OpenSubtitles2016_xml_vi_2015_369610_6346303'
local_content_vector_dimension = 25

local_content = data_directory + local_content_name

## Intermediate files associated in proccessing input with external Java package JVnTextPro.
local_content_xml = local_content + '.xml'
local_content_txt = local_content + '.txt'
local_content_txt_sent = local_content + '.txt.sent'
local_content_txt_sent_tkn = local_content + '.txt.sent.tkn'
local_content_txt_sent_tkn_wseg = local_content + '.txt.sent.tkn.wseg'
local_content_txt_sent_tkn_wseg_pos = local_content + '.txt.sent.tkn.wseg.pos'

## Intermediate files resulting from computation of word embeddings using fastText package.
local_content_vectors = model_directory + local_content_name + '.vec'
local_content_model = model_directory + local_content_name + '.bin'

## Projected 2D vectors useful for visualization.
local_content_vectors_tsne = model_directory + local_content_name + '_vec_tsne.txt'

In [None]:
combined_vectors = model_directory + local_content_name + '.combined_vectors.txt'
combined_vectors_tsne = model_directory + local_content_name + '.combined_vectors_tsne.txt'

Global knowledge vectors -- wikipedia & Leipzig Corpora
-------------------------------------------------------
First step is to compute word embeddings of a global knowledge base (e.g. wikipedia or the Leipzig Corpora) to capture the generic meaning of words in widely used contexts.

The gensim package has examples of processing wikipedia dumps as well as streaming corpus implementation. The article just glosses over these steps and the sample github code grabs an undocumented data set from the authors drobbox account. In the cells below we rely on word2vec:
<pre>
git clone https://github.com/tmikolov/word2vec.git
</pre>
Also, in order to compute the t-sne embeddings with a c-language program, used bhtsne:
<pre>
git clone https://github.com/lvdmaaten/bhtsne.git
</pre>
When using jupyter-gallery docker image, usually install these in the /root directory. Hardwired into this notebook. 

**TODO:** 
* Parse the wikipedia dump name and use it as the prefix for the other intermediate files.
* Download a wikipedia dump if it doesn't already exist.
* Make things work for other languages (hundreds of wikipedias).
* Check that WikiCorpus does lowercase each word.
* Handle stopwords and substitution lists consistently.
* Stem global and local data sets.
* Check to see any value of using textblob over nltk.

### Process wikipedia dump
First download the wikipedia dump and place it in the data directory before running this notebook. The cell below will use the gensim class WikiCorpus to strip the wikipedia markup and store each article as one line of the output text file. Only do these computations once if possible.

In [None]:
knowledge_base

In [None]:
if not os.path.isfile(knowledge_base_text):
    space = ' '
    i = 0
    output = open(knowledge_base_text, 'w')
    logger.info('Processing knowledge base %s', knowledge_base)
    wiki = WikiCorpus(data_directory + knowledge_base, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")
else:
    logger.info('Knowledge base %s already on disk.', knowledge_base_text)

### Compute word vectors for knowledge base

In [None]:
!ls {knowledge_base_text}

In [None]:
if not os.path.isfile(knowledge_base_vectors):
    knowledge_base_skipgram = fasttext.skipgram(knowledge_base_text, knowledge_base_vectors, lr=0.02, 
        dim=knowledge_base_vector_dimension, ws=5,
        epoch=1, min_count=5, neg=5, loss='ns', bucket=2000000, minn=3, maxn=6,
        thread=8, t=1e-4, lr_update_rate=100)
else:
    logger.info('Knowledge vectors %s already on disk.', knowledge_base_vectors)
    knowledge_base_skipgram = fasttext.load_model(knowledge_base_model)

Simple test to see if the model created/read ok.

In [None]:
print u'siêu_thị' in knowledge_base_skipgram
print u'supermarket' in knowledge_base_skipgram

Create a counter to keep track of the knowledge base vocabulary. Later the sample code uses this to find the vocabulary in common between the knowledge base and the user generated data. Try to process both data sets in the same way.

In [None]:
knowledge_base_exist = Counter()
for w in knowledge_base_skipgram.words:
    knowledge_base_exist[w.lower()] = w.lower()
knowledge_base_vocab_lowercase = knowledge_base_exist.keys()

In [None]:
logger.info(u'siêu_thị: %s', knowledge_base_exist[u'siêu_thị'])
logger.info('funky: %s', knowledge_base_exist[u'funky'])
logger.info('san_diego: %s', knowledge_base_exist[u'san_diego'])

User content vectors -- OpenSubtitles2016
-----------------------------------------
OpenSubtitles is a very useful project for language analysis since it has a decent collection of parrallel sentences -- the foreign language captions that enthusiasts have created for their favorite movies.

---
Start with an `input.xml`, file listing captions from foreign film' obtained from the OpenSubtitle project. The final segmented text is input_text.txt.wseg (local_content_txt_sent_tkn_wseg). The parts of speech tagging is required in order to strip out the nouns (better labels for topics).

<pre>
BeautifulSoup:                     input.xml -> input.txt 
                           local_content_xml -> local_content_txt   
JVnSenSegmenter:                   input.txt -> input.txt.sent
                           local_content_txt -> local_content_txt_sent   
JVnTokenizer:                 input.txt.sent -> input.txt.sent.tkn
                      local_content_txt_sent -> local_content_txt_sent_tkn         
JVnSegmenter:             input.txt.sent.tkn -> input.txt.sent.tkn.wseg
                  local_content_txt_sent_tkn -> local_content_txt_sent_tkn_wseg  
POSTagging:          input.txt.sent.tkn.wseg -> input.txt.sent.tkn.wseg.pos
             local_content_txt_sent_tkn_wseg -> local_content_txt_sent_tkn_wseg_pos
</pre>

**Extract text from data**

Uses BeautifulSoup library to find all tags `'r'` and strip the text from them.

TODO:
1. Stream text through memory so that larger files can be proccessed.
2. Allow for a directory of subfiles.

In [None]:
with open(local_content_xml,'r') as fp:
    soup = BeautifulSoup(fp,'lxml')

In [None]:
with open(local_content_txt,'w') as fp:
    for s in soup.findAll('s'): 
        fp.write(s.text.strip().encode('utf-8') + '\n')

**Sentence Segmentation**

In [None]:
cp = '/root/notebooks/stage/JVnTextPro/target/jvn-text-pro-2.0.jar:/root/.m2/repository/args4j/args4j/2.33/args4j-2.33.jar'
model_dir = '/root/notebooks/stage/JVnTextPro/models/jvnsensegmenter/'
java_class = 'jvnsensegmenter.JVnSenSegmenter'
cmd = 'java -cp {} {} -modeldir {} -inputfile {}'.format(cp, java_class, model_dir, local_content_txt)
output = os.popen(cmd)
for line in output: print line.strip()

**Sentence Tokenization**

Note: JVnTokenizer basically separates punctuation from words. Does not bother, for example, numbers like 22,216.

In [None]:
cp = '/root/notebooks/stage/JVnTextPro/target/jvn-text-pro-2.0.jar:/root/.m2/repository/args4j/args4j/2.33/args4j-2.33.jar'
java_class = 'jvntokenizer.JVnTokenizer'
cmd = 'java -cp {} {} -inputfile {}'.format(cp, java_class, local_content_txt_sent)
output = os.popen(cmd)

**Word Segmentation**

In [None]:
cp = '/root/notebooks/stage/JVnTextPro/target/jvn-text-pro-2.0.jar:/root/.m2/repository/args4j/args4j/2.33/args4j-2.33.jar'
model_dir = '/root/notebooks/stage/JVnTextPro/models/jvnsegmenter/'
java_class = 'jvnsegmenter.WordSegmenting'
cmd = 'java -cp {} {} -modeldir {}  -inputfile {}'.format(cp, java_class, model_dir, local_content_txt_sent_tkn)
output = os.popen(cmd)
#for line in output: print line.strip()

Note: Seems to be a time delay before local_content_txt_sent_tkn_wseg is written to disk. Wait for it.

**Part of Speech Tagging**

In [None]:
cp = '/root/notebooks/stage/JVnTextPro/target/jvn-text-pro-2.0.jar:/root/.m2/repository/args4j/args4j/2.33/args4j-2.33.jar'
model_dir = '/root/notebooks/stage/JVnTextPro/models/jvnpostag/maxent/'
java_class = 'jvnpostag.POSTagging'
cmd = 'java -cp {} {} -tagger maxent -modeldir {}  -inputfile {}'.format(cp, java_class, model_dir, 
                                                                         local_content_txt_sent_tkn_wseg)
output = os.popen(cmd)

**List of stop words**

The list below came from `elasticsearch` Vietnamese plugin.

In [None]:
stopwords = ["bị", "bởi", "cả", "các", "cái", "cần", "càng", "chỉ", "chiếc", "cho", "chứ", "chưa", "chuyện",
             "có", "có_thể", "cứ", "của", "cùng", "cũng", "đã", "đang", "đây", "để", "đến nỗi", "đều", "điều",
             "do", "đó", "được", "dưới", "gì", "khi", "không", "là", "lại", "lên", "lúc", "mà", "mỗi", "một_cách",
             "này", "nên", "nếu", "ngay", "nhiều", "như", "nhưng", "những", "nơi", "nữa", "phải", "qua", "ra",
             "rằng", "rằng", "rất", "rất", "rồi", "sau", "sẽ", "so", "sự", "tại", "theo", "thì", "trên", "trước",
             "từ", "từng", "và", "vẫn", "vào", "vậy", "vì", "việc", "với", "vừa"]

**Read Segmented Senteces**

In [None]:
## After processing, each line is a sentence. 
## Read in lines, skipping empty lines, to yield
text_lines = []
with open(local_content_txt_sent_tkn_wseg,'rb') as local_content_file:
    for line in local_content_file:
        line = line.strip()
        if line != '':
            text_lines.append(line)
for line in text_lines[0:3]:
    logger.info('%s', line)
logger.info("Text lines: %d", len(text_lines))

## The 
sentences = [[w for w in line.split()] for line in text_lines]
for sent in sentences[0:3]:
    logger.info('%s', sent)

**Vietnamese Parts of Speech**
1. N: Noun (danh từ)
2. Np: Personal Noun (danh từ riêng)
3. Nc: Classification Noun (danh từ chỉ loại)
4. Nu: Unit Noun (danh từ đơn vị)
5. V: verb (động từ)
6. A: Adjective (tính từ)
7. P: Pronoun (đại từ)
8. L: attribute (định từ)
9. M: Numeral (số từ)
10. R: Adjunct (phụ từ) 
11. E: Preposition (giới từ)
12. C: conjunction (liên từ)
13. I: Interjection (thán từ)
14. T: Particle, modal particle (trợ từ, tiểu từ)
15. B: Words from foreign countries (Từ mượn tiếng nước ngoài ví dụ Internet, ...)
16. Y: abbreviation (từ viết tắt)
17. X: un-known (các từ không phân loại được)
18. Mrk: punctuations (các dấu câu)

In [None]:
def read_nouns_from_pos_data(path):
    with open(path, 'rU') as data:
        reader = csv.reader(data, delimiter=' ')
        for row in reader:
            nouns = []
            for field in row:
                try:
                    word,pos = field.split('/')
                    if pos in ['N','Np']: #,'Nc','Nu']:
                        nouns.append(word)
                except Exception as ex:
                    logger.error('%s %s', ex, field)
            yield nouns

In [None]:
sentences_nouns = []
count = 0
for nouns in read_nouns_from_pos_data(local_content_txt_sent_tkn_wseg_pos): 
    if False and count > 10: 
        break
    count += 1
    sentences_nouns.append(nouns)
    
logger.info('%d', len(sentences_nouns))

Compute word vectors
--------------------

In [None]:
recompute = True
if recompute == True or not os.path.isfile(local_content_vectors):
    local_content_skipgram = fasttext.skipgram(local_content_txt_sent_tkn_wseg, model_directory + local_content_name, 
        lr=0.02, dim=local_content_vector_dimension, ws=5,
        epoch=1, min_count=0, neg=5, loss='ns', bucket=2000000, minn=3, maxn=6,
        thread=8, t=1e-5, lr_update_rate=100)
else:
    logger.info('Local vectors %s already on disk.', local_content_vectors)
    local_content_skipgram = fasttext.load_model(local_content_model)

In [None]:
logger.info('Creating word vecs')

words=[w for text in sentences_nouns for w in text]
Vocab=set(words)

model_comb={}
model_comb_vocab=[]

common_vocab=set(knowledge_base_vocab_lowercase).intersection(local_content_skipgram.words).intersection(Vocab)

for w in common_vocab:
    if len(w)>2:
        model_comb[w]=np.array(np.concatenate((knowledge_base_skipgram[w],local_content_skipgram[w])))
        model_comb_vocab.append(w)
        
logger.info('Length of common_vocab = %d', len(common_vocab))

In [None]:
print len(set(knowledge_base_skipgram.words))
print len(set(local_content_skipgram.words))

In [None]:
writer = csv.writer(open(combined_vectors,'w'),delimiter='\t')
for k in model_comb.keys():
    writer.writerow(model_comb[k])

In [None]:
combined_vectors

In [None]:
##Create a frequency count of words in email
words=[w for text in sentences_nouns for w in text]
Vocab=set(words)

In [None]:
###Helper Functions
def norm(a):
    return np.sqrt(np.sum(np.square(a)))

def cosine(a,b):
    return 1-np.dot(a,b)/np.sqrt(np.sum(a**2)*np.sum(b**2))

def l1(a,b):
    return abs(a-b).sum()

def l2(a,b):
    return np.sqrt(np.square(a-b).sum())

In [None]:
### Create a list of words to be clustered based on a model with some l2_threshold and can normalize the vectors 
### and also repeat or no
def create_word_list(model,vocab,features,Texts,repeat=True,l2_threshold=0,normalized=True,min_count=100,min_length=0):
    data_d2v=[]
    word_d2v=[]
    words_text=[w for text in Texts for w in text]
    count=Counter(words_text)
    if repeat:
        for text in Texts:
            for w in text:
                if w in vocab and count[w]>min_count:
                    if len(w)>min_length and l2(model[w],np.zeros(features))>l2_threshold:
                        if normalized:
                            data_d2v.append(model[w]/l2(model[w],np.zeros(features)))
                        else:
                            data_d2v.append(model[w])
                        word_d2v.append(w)
    else:
        A=set(words_text)
        for w in vocab:
            if w in A and len(w)>min_length and l2(model[w],np.zeros(features))>l2_threshold and count[w]>min_count:
                if normalized:
                    data_d2v.append(model[w]/l2(model[w],np.zeros(features)))
                else:
                    data_d2v.append(model[w])
                word_d2v.append(w)

    return data_d2v, word_d2v

In [None]:
#Run Agglomerative clustering
logger.info('Clustering for depth...')
local_vec = True

data_d2v,word_d2v=create_word_list(model_comb,model_comb_vocab,25*local_vec+300,sentences_nouns,repeat=False,normalized=True,min_count=0,l2_threshold=0)
spcluster=fastcluster.linkage(data_d2v,method='average',metric='cosine')

In [None]:
def calculate_depth(spcluster,words, num_points):
    cluster=[[] for w in xrange(2*num_points)]
    c=Counter()
    for i in xrange(num_points):
        cluster[i]=[i]

    for i in xrange(len(spcluster)):
        x=int(spcluster[i,0])
        y=int(spcluster[i,1])
        xval=[w for w in cluster[x]]
        yval=[w for w in cluster[y]]
        cluster[num_points+i]=xval+yval
        for w in cluster[num_points+i]:
            c[words[w]]+=1
        cluster[x][:]=[]
        cluster[y][:]=[]    
    return c

In [None]:
##Calculate depth of words
num_points=len(data_d2v)
depth=calculate_depth(spcluster,word_d2v,num_points)

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram

In [None]:
plt.figure()
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    spcluster,
    truncate_mode='lastp',
    p=20,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=12.,  # font size for the x axis labels
    show_contracted=True,
)
plt.tight_layout()

In [None]:
logger.info('Computing co-occurence graph')

T=[' '.join(w) for w in sentences_nouns]

In [None]:
for line in T[0:10]: print line

In [None]:
logger.info(len(T))

In [None]:
##Co-occurence matrix
cv=CountVectorizer(token_pattern=u'(?u)\\b([^\\s]+)')
bow_matrix = cv.fit_transform(T)
id2word={}
for key, value in cv.vocabulary_.items():
    id2word[value]=key

ids=[]
for key,value in cv.vocabulary_.iteritems():
    if key in model_comb_vocab:
        ids.append(value)

sort_ids=sorted(ids)
bow_reduced=bow_matrix[:,sort_ids]
normalized = TfidfTransformer().fit_transform(bow_reduced)
similarity_graph_reduced=bow_reduced.T * bow_reduced

In [None]:
##Depth-rank weighting of edges, weight of edge i,j=cosine of angle between them
logger.info('Computing degree')
m,n=similarity_graph_reduced.shape

cx=similarity_graph_reduced.tocoo()
keyz=[id2word[sort_ids[w]] for w in xrange(len(sort_ids))]
data=[]
ro=[]
co=[]
for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
    if v>0 and i!=j:
        value=1
        if value>0:
            ro.append(i)
            co.append(j)
            data.append(value)

SS=sp.sparse.coo_matrix((data, (ro, co)), shape=(m,n))
SP_full=SS.tocsc()
id_word={w:id2word[sort_ids[w]] for w in xrange(len(sort_ids))}
word_id={value:key for key,value in id_word.items()}

In [None]:
logger.info('Computing metrics')
#compute metrics
degsum=SP_full.sum(axis=1)
deg={}
for x in xrange(len(sort_ids)):
    deg[id2word[sort_ids[x]]]=int(degsum[x])

max_deg=max(deg.values())
max_depth=max(depth.values())

temp_deg_mod={w:np.log(1+deg[w])/np.log(1+max_deg) for w in deg.iterkeys()}
alpha=np.log(0.5)/np.log(np.median(temp_deg_mod.values()))
deg_mod={key:value**alpha for key, value in temp_deg_mod.iteritems()}

temp={key:value*1./max_depth for key, value in depth.iteritems()}
alpha=np.log(0.5)/np.log(np.median(temp.values()))
depth_mod={key:value**alpha for key, value in temp.iteritems()}

temp={key:deg_mod[key]*depth_mod[key] for key in depth_mod.iterkeys()}
max_metric=np.max(temp.values())
metric={key:value*1./max_metric for key,value in temp.iteritems()}

In [None]:
logger.info('max_deg = %s, max_depth = %s',max_deg, max_depth)

In [None]:
##Kmeans
NUM_TOPICS = 20
K=NUM_TOPICS
kmeans=KMeans(n_clusters=K)
kmeans.fit([w for w in data_d2v])
kmeans_label={word_d2v[x]:kmeans.labels_[x] for x in xrange(len(word_d2v))}

kmeans_label_ranked={}

topic=[[] for i in xrange(K)]
clust_depth=[[] for i in xrange(K)]
for i in xrange(K):
    topic[i]=[word_d2v[x] for x in xrange(len(word_d2v)) if kmeans.labels_[x]==i]
    temp_score=[metric[w] for w in topic[i]]
    clust_depth[i]=-np.mean(sorted(temp_score,reverse=True)[:])#int(np.sqrt(len(topic[i])))])
index=np.argsort(clust_depth)
for num,i in enumerate(xrange(K)):
    for w in topic[index[i]]:
        kmeans_label_ranked[w]=i

In [None]:
print np.mean(sorted(temp_score,reverse=True)[:]), index
for w in topic[index[11]]: print w,
print 
for w in [w[0] for w in sorted([[w,metric[w]] for w in topic[index[11]]],key=itemgetter(1),reverse=True)]:
    print w,

In [None]:
logger.info('Done...Generating output')
lister=[]
to_show=K
to_show_words=200 #the maximum number of words of each type to display
for i in xrange(to_show):
    top=topic[index[i]]
    sort_top=[w[0] for w in sorted([[w,metric[w]] for w in top],key=itemgetter(1),reverse=True)]
    lister.append(['Topic %d' %(i+1)]+sort_top[:to_show_words])

max_len=max([len(w) for w in lister])
new_list=[]
for list_el in lister:
    new_list.append(list_el + [''] * (max_len - len(list_el)))
Topics=list(itertools.izip_longest(*new_list))
#X.insert(len(X),[-int(clust_depth[index[w]]*100)*1./100 for w in xrange(K)])
sorted_words=[w[0] for w in sorted(metric.items(),key=itemgetter(1),reverse=True)][:to_show_words]

In [None]:
import pandas as pd

df_tmp = pd.DataFrame(new_list).T
df_new = pd.DataFrame(df_tmp[1:len(new_list)].values,columns=[l[0] for l in new_list])
df_new = pd.DataFrame(df_new['Topic 12'])

In [None]:
score_words = sorted_words
deep_words = [w[0] for w in depth.most_common(to_show_words)]
filer = 'wiki_simple.txt'
outfile_topics = data_directory + filer.split('.')[0] + '_topics.csv'
outfile_score = data_directory + filer.split('.')[0] + '_score.csv'
outfile_depth = data_directory + filer.split('.')[0] + '_depth.csv'
b = open(outfile_topics, 'wb')
a = csv.writer(b)
a.writerows(Topics)
b = open(outfile_score, 'wb')
a = csv.writer(b)
a.writerows([[w] for w in score_words])
b = open(outfile_depth, 'wb')
a = csv.writer(b)
a.writerows([[w] for w in deep_words])

In [None]:
step = 4
for j in range(K/step + 1):
    first = j*step + 1;last = j*step + step
    print 'Total number of Topics = {}. Displaying Topics {} thru {}.'.format(K, first, last)
    print tabulate([Topics[i][first-1:last] for i in range(0,21)], tablefmt=u'psql')

In [None]:
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
from bokeh.models import HoverTool, BoxZoomTool, WheelZoomTool, ResetTool, PanTool, BoxSelectTool
from bokeh.models import ColorBar, LinearColorMapper, FixedTicker, LabelSet
import bokeh.palettes
from bokeh.models.widgets import Div, DataTable, TableColumn
from bokeh.layouts import gridplot, widgetbox
import pandas as pd

In [None]:
output_notebook()

In [None]:
X_2D = bhtsne.tsne(np.array(data_d2v))

In [None]:
from google.cloud import translate

sorted_words_all = [w[0] for w in sorted(metric.items(),key=itemgetter(1),reverse=True)]

api_key = 'AIzaSyDjvpA3foc4mW4ogLUAkCVyiCLVgR3syBI'
translate_client = translate.Client(api_key)

sorted_words_all_translated = translate_client.translate(['quả_cầu'], source_language='vi', target_language='en')
for w in zip(sorted_words_all,sorted_words_all_translated)[0:10]: print w

In [None]:
import pickle, time
if not os.path.isfile('models/translations.pkl'):
    logger.info('Translating sorted words: %s', len(sorted_words_all))
    step = 50
    sorted_words_all_translated = []
    for j in range(len(sorted_words_all)/step + 1):
        first = j*step + 1;last = j*step + step
        logger.info('Translating: %s, %s, %s',j,first,last)
        tmp = translate_client.translate(sorted_words_all[first:last], 
                                         source_language='vi', target_language='en')
        sorted_words_all_translated.extend(tmp)
    with open('models/translations.pkl','w') as fp:
        pickle.dump(sorted_words_all_translated, fp)
else:
    with open('models/translations.pkl','r') as fp:
        sorted_words_all_translated = pickle.load(fp)
    logger.info('Read %s translations.', len(sorted_words_all_translated))

In [None]:
sorted_words_all_translated

In [None]:
for w in sorted_words_all_translated: pass

In [None]:
for w in word_d2v: 
    if w not in all_translated: print w,

In [None]:
all_translated = dict()
for w in sorted_words_all_translated:
    all_translated[w['input']] = w['translatedText'] 
#[all_translated[w] for w in word_d2v]
trans = []
for w in word_d2v: 
    try:
        trans.append((w,all_translated[w]))
    except Exception as ex:
        trans.append((w,''))
        logger.info('%s %s',ex, w,)

In [None]:
metrics_clean = []
for w in word_d2v:
    if w in metric:
        metrics_clean.append(metric[w])

df = pd.DataFrame(zip(word_d2v, [t[1] for t in trans],X_2D[:,0],X_2D[:,1],[kmeans_label[w] for w in word_d2v],
                      metrics_clean),columns=['word','translation','x','y','topic','metric'])
print df[(df.word==u'khủng_long')]
df.tail(20)

In [None]:
num_top_scoring_words = 500
kmeans_label_color = [kmeans_label_ranked[w]+1 for w in word_d2v]
top_scoring_words = [(w,t[1], d[0],d[1],k,m) for w,t,d,k,m in zip(word_d2v,trans, X_2D,kmeans_label_color,metrics_clean) 
                     if w in score_words]
top_scoring_words_df = pd.DataFrame(top_scoring_words, columns=['word','translation','x','y','topic','metric'])
top_scoring_words_df = top_scoring_words_df[top_scoring_words_df['topic']==12].sort_values('metric',ascending=False)

In [None]:
topics = [0,1,2,3,4]
pd.DataFrame(top_scoring_words, columns=['word','translation','x','y','topic','metric']).sort_values('metric',ascending=False).head()

In [None]:
import bokeh

bokeh.__version__

In [None]:
#kmeans_colors = bokeh.palettes.plasma(50) #bokeh.palettes.brewer['RdYlGn'][10]
kmeans_colors = bokeh.palettes.brewer['RdYlGn'][10] + bokeh.palettes.plasma(50)
colors = [kmeans_colors[kmeans_label_color[x]-1] for x in range(len(word_d2v))]

source = ColumnDataSource(
        data=dict(
            x=df.x,
            y=df.y,
            word=df.word,
            translation=df.translation,
            topic = df.topic,
            metric = df.metric,
        )
    )

top_scores = ColumnDataSource(
    data=dict(
        x=top_scoring_words_df.x,
        y=top_scoring_words_df.y,
        names=list(top_scoring_words_df.word)
    )
)

labels = LabelSet(x='x', y='y', text='names', level='glyph',
              x_offset=5, y_offset=5, source=top_scores, render_mode='canvas', 
              border_line_color='black', border_line_alpha=1.0,
              background_fill_color='white', background_fill_alpha=1.0)

radii = [m for m in metrics_clean]

source_topics = ColumnDataSource(data=df_new)
columns = [
        TableColumn(field="word", title="words"),
        TableColumn(field="translation",title="translation"),
        TableColumn(field="topic", title="topics"),
        TableColumn(field="metric", title="metrics"),
]

#data_columns = [TableColumn(field=df_new.columns[i],title=df_new.columns[i]) for i in range(K)]
#data_table = DataTable(source=source_topics, columns=data_columns, width=400, height=280)

data_columns = [TableColumn(field='Topic 12',title='Topic 12')]
data_table = DataTable(source=source_topics, columns=data_columns, width=400, height=600)
table = widgetbox(data_table)

color_mapper = LinearColorMapper(kmeans_colors,low=0,high=9)
p = figure(plot_width=800, plot_height=800, title="Distributed Word Embeddings: " + str(K) + " Topics", 
           tools=['box_zoom', 'box_select', 'pan','wheel_zoom','reset',
                  'save','lasso_select'])

cr = p.circle('x', 'y', source=source, radius=radii, color=colors,
                #fill_color={'field': 'topic', 'transform': color_mapper}, hover_fill_color="firebrick",
                fill_alpha=0.7, hover_alpha=0.3,
                line_color=None, hover_line_color="white",
                )

hover = HoverTool(
        tooltips=[
            #("index", "$index"),
            #("(x,y)", "($x, $y)"),
            ("word", "@word"),
            ("translation","@translation"),
            ("topic","@topic"),
            ("metric", "@metric"),
        ],
        renderers=[cr],
        mode='mouse',
    )

p.add_tools(hover)
#p.add_layout(labels)
p.xaxis.visible=False
p.yaxis.visible=False
p.grid.visible=False
color_bar = ColorBar(color_mapper=color_mapper, orientation='horizontal',
                     location='bottom_left', scale_alpha=0.7)
                     #ticker=FixedTicker(ticks=[2,6,10,14,18]))
#p.add_layout(color_bar)


topic_legend = "topics"
#p.circle('x', 'y', source=source, size=10, color=colors, legend=topic_legend)

first=1;last=20
html = tabulate([Topics[i][first-1:last] for i in range(0,21)], tablefmt=u'html')
div = widgetbox(Div(text=html,width=500))
#grid = gridplot([[p, table],[div]])
#grid = gridplot([[p],[div]])
grid = gridplot([[p,table]])
show(grid);

In [None]:
from bokeh.embed import components
from bokeh.resources import CDN
# Generate the script and HTML for the plot
script, div = components(grid)

# Return the webpage
html = """
<!doctype html>
<head>
 <title></title>
 {bokeh_css}
</head>
<body>
 {div}
 {bokeh_js}
 {script}
</body>
 """.format(script=script, div=div, bokeh_css=CDN.render_css(),
 bokeh_js=CDN.render_js())

with open('models/sample_output.html','w') as fp:
    fp.write(html)

from IPython.core.display import HTML
#HTML(html)

In [None]:
import bs4 as BeautifulSoup

In [None]:
soup = BeautifulSoup.BeautifulSoup(html,'lxml')

In [None]:
print soup.prettify()

In [None]:
kmeans_label_color = [kmeans_label_ranked[w]+1 for w in word_d2v]
kmeans_colors = bokeh.palettes.plasma(50) #bokeh.palettes.brewer['RdYlGn'][10]
colors = [kmeans_colors[kmeans_label_color[x]-1] for x in range(len(word_d2v))]
x=X_2D[:,0]
y=X_2D[:,1]
source = ColumnDataSource(
        data=dict(
            x=x,
            y=y,
            word=word_d2v,
            topic = kmeans_label_color,
            metric = df['metric'],
        )
    )

radii = [m for m in metrics_clean]

source_topics = ColumnDataSource(data=dict())
columns = [
        TableColumn(field="word", title="words"),
        TableColumn(field="topic", title="topics"),
        TableColumn(field="metric", title="metrics"),
]
data_table = DataTable(source=source_topics, columns=columns, width=400, height=280)
table = widgetbox(data_table)

color_mapper = LinearColorMapper(kmeans_colors,low=0,high=9)
p = figure(plot_width=800, plot_height=500, title="Distributed Word Embeddings: " + str(K) + " Topics", 
           tools=['box_zoom', 'box_select', 'pan','wheel_zoom','reset',
                  'save','lasso_select'])

cr = p.circle('x', 'y', source=source, radius=radii, color=colors,
                #fill_color={'field': 'topic', 'transform': color_mapper}, hover_fill_color="firebrick",
                fill_alpha=0.7, hover_alpha=0.3,
                line_color=None, hover_line_color="white",
                )

hover = HoverTool(
        tooltips=[
            #("index", "$index"),
            #("(x,y)", "($x, $y)"),
            ("word", "@word"),
            ("topic","@topic"),
            ("metric", "@metric"),
        ],
        renderers=[cr],
        mode='mouse',
    )

p.add_tools(hover)
color_bar = ColorBar(color_mapper=color_mapper, orientation='horizontal',
                     location='bottom_left', scale_alpha=0.7)
                     #ticker=FixedTicker(ticks=[2,6,10,14,18]))
#p.add_layout(color_bar)


topic_legend = "topics"
#p.circle('x', 'y', source=source, size=10, color=colors, legend=topic_legend)

first=1;last=10
html = tabulate([Topics[i][first-1:last] for i in range(0,11)], tablefmt=u'html')
div = widgetbox(Div(text=html,width=500))
#grid = gridplot([[p, table],[div]])
grid = gridplot([[p],[div]])
show(grid);