Topic Modeling Using Distributed Word Embeddings
================================================
Notebook version of https://github.com/rsrandhawa/Vec2Topic code, based on the article "Topic Modeling Using Distributed Word Embeddings" by R. S. Randhawa, P. Jain, and G. Madan. 

The basic approach is to first create a language model based on a large (ideally billions of words) text corpus. The technology used, distributed word embeddings, is a shallow neural network that seems to perform best on large datasets (trades simple but fast computation for tons of data).

The user generated content (which is usually a much smaller corpus) is likewise trained with consistent parameters. Vectors corresponding to the same vocabulary word are concatenated together to provide a model of the user generated content.

Word vectors that cluster together are interperted as topics of the user generated content. Some clusters appear better than others because they consist of coherent lists of words -- main goal is to score the importance of each topic.

Performing a hierarchical clustering provides a measure of depth for each word and computing a co-occurance graph (edge between two words if they belong to the same sentenence) provides a degree of co-occurance. Each word is scored by a (normalized) product of depth and degree. The awesome `hdbscan` is used to cluster words into topics, and the scoring function is used to order the words and the topics. Sample below with `min_cluster_size=5`.

![Topics](topics.png)

![Sentences](sentences.png)

Required standard packages
--------------------------

In [None]:
import logging, re, os, bz2, gzip, subprocess, uuid
from collections import Counter
from operator import itemgetter
import itertools

In [None]:
## Unicode wrapper for reading & writing csv files.
import unicodecsv as csv

## Lighter weight than pandas -- tabular display of tables.
from tabulate import tabulate

## In order to strip out the text from the xml formated data.
from bs4 import BeautifulSoup
import lxml

Required data science packages
------------------------------

In [None]:
## First the usual suspects: numpy, pandas, scipy, and gensim
import numpy as np
import pandas as pd
import scipy as sp
import gensim

## For scraping text out of a wikipedia dump. Get dumps at https://dumps.wikimedia.org/backup-index.html
from gensim.corpora import WikiCorpus

## Latest greatest word vectors (see https://pypi.python.org/pypi/fasttext).
import fasttext

## Package for segmenenting and findin parts-of-speech for Chinese.
import jieba
import jieba.posseg as pseg

## Latest greatest hierarchical clustering package. 
## Word vectors are clustered, with deeper trees indicating core topics.
import hdbscan

## Use scikit-learn to generate co-occurancy graph (edge if words in same sentence).
## The degree of each word indicates how strong it co-occurs.
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

import bhtsne

Base data directory and logging.
--------------------------------
The approach currently uses a lot of intermediate files (which is annoying, but means that the project can work on machines with smaller physical memory). The initial data (knowledge base as well as user generated content) and the intermediate files are all kept in the data directory.

In [None]:
data_directory = 'data/'
model_directory = 'models/'

In [None]:
from imp import reload
reload(logging)

LOG_FILENAME = data_directory + 'vec2topic.log'
#logging.basicConfig(filename=LOG_FILENAME,level=logging.INFO)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s %(message)s',"%b-%d-%Y %H:%M:%S")
logger.handlers[0].setFormatter(formatter)

List of intermediate files.
---------------------------
The (global) knowledge base is built off a (large) dataset.

In [None]:
knowledge_base = 'zhwiki-20160920-pages-articles.xml.bz2'
    
knowledge_base_vector_dimension = 200    # Word vector dimensionality for knowledge base.
knowledge_base_prefix = 'zhwiki-20160920-pages-articles'

knowledge_base_text = data_directory + knowledge_base_prefix + '.xml.txt'
knowledge_base_phrases = data_directory + knowledge_base_prefix + '_phrases.txt'

knowledge_base_model = model_directory + knowledge_base_prefix + '.bin'
knowledge_base_vectors = model_directory + knowledge_base_prefix + '.vec'
knowledge_base_vectors_tsne = model_directory + knowledge_base_prefix + '_vec_tsne.txt'
knowledge_base_vocab = model_directory + knowledge_base_prefix + '_vocab.txt'

The (local) user generated content.

In [None]:
local_content_name = 'OpenSubtitles2016_xml_zh_2015_369610_6206526'
local_content_vector_dimension = 25

local_content = data_directory + local_content_name

## Intermediate files associated in proccessing input with external Java package JVnTextPro.
local_content_xml = local_content + '.xml'
local_content_txt = local_content + '.txt'

## Intermediate files resulting from computation of word embeddings using fastText package.
local_content_vectors = model_directory + local_content_name + '.vec'
local_content_model = model_directory + local_content_name + '.bin'

## Projected 2D vectors useful for visualization.
local_content_vectors_tsne = model_directory + local_content_name + '_vec_tsne.txt'

In [None]:
combined_vectors = model_directory + local_content_name + '.combined_vectors.txt'
combined_vectors_tsne = model_directory + local_content_name + '.combined_vectors_tsne.txt'

Global knowledge vectors -- wikipedia & Leipzig Corpora
-------------------------------------------------------
First step is to compute word embeddings of a global knowledge base (e.g. wikipedia or the Leipzig Corpora) to capture the generic meaning of words in widely used contexts.

The gensim package has examples of processing wikipedia dumps as well as streaming corpus implementation. The article just glosses over these steps and the sample github code grabs an undocumented data set from the authors drobbox account. In the cells below we rely on word2vec:
<pre>
git clone https://github.com/tmikolov/word2vec.git
</pre>
Also, in order to compute the t-sne embeddings with a c-language program, used bhtsne:
<pre>
git clone https://github.com/lvdmaaten/bhtsne.git
</pre>
When using jupyter-gallery docker image, usually install these in the /root directory. Hardwired into this notebook. 

**TODO:** 
* Parse the wikipedia dump name and use it as the prefix for the other intermediate files.
* Download a wikipedia dump if it doesn't already exist.
* Make things work for other languages (hundreds of wikipedias).
* Check that WikiCorpus does lowercase each word.
* Handle stopwords and substitution lists consistently.
* Stem global and local data sets.
* Check to see any value of using textblob over nltk.

### Process wikipedia dump
First download the wikipedia dump and place it in the data directory before running this notebook. The cell below will use the gensim class WikiCorpus to strip the wikipedia markup and store each article as one line of the output text file. Only do these computations once if possible.

<pre>
+----------------------------------------+-----------------+--------+
| File                                   | No. Articles    | Time   |
|----------------------------------------+-----------------+--------|
| zhwiki-20160920-pages-articles.xml.bz2 | 271318 articles | 1hr    |
+----------------------------------------+-----------------+--------+
</pre>

In [None]:
knowledge_base

In [None]:
print tabulate([['File','No. Articles','Time'],['zhwiki-20160920-pages-articles.xml.bz2','271318 articles','1hr']],tablefmt=u'psql',headers='firstrow')

In [None]:
if not os.path.isfile(knowledge_base_text):
    space = ' '
    i = 0
    output = open(knowledge_base_text, 'wb')
    logger.info('Processing knowledge base %s', knowledge_base)
    wiki = WikiCorpus(data_directory + knowledge_base, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        for part in text:
            part = part.decode("utf-8")
            part = ' '.join([w[0] for w in jieba.tokenize(part)])
            output.write(part.encode('utf-8') + "\n")      
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")
else:
    logger.info('Knowledge base %s already on disk.', knowledge_base_text)

In [None]:
### Compute word vectors for knowledge base

In [None]:
!ls -lh {knowledge_base_vectors}
!ls -lh {knowledge_base_model}

In [None]:
if not os.path.isfile(knowledge_base_model):
    knowledge_base_skipgram = fasttext.skipgram(knowledge_base_text, model_directory + knowledge_base_prefix, 
        lr=0.02, dim=knowledge_base_vector_dimension, ws=5,
        epoch=1, min_count=5, neg=5, loss='ns', bucket=2000000, minn=3, maxn=6,
        thread=8, t=1e-4, lr_update_rate=100)
else:
    logger.info('Knowledge vectors %s already on disk.', knowledge_base_model)
    knowledge_base_skipgram = fasttext.load_model(knowledge_base_model)

Simple test to see if the model created/read ok.

In [None]:
print u'超级市场' in knowledge_base_skipgram
print u'siêu_thị' in knowledge_base_skipgram
print u'supermarket' in knowledge_base_skipgram

Create a counter to keep track of the knowledge base vocabulary. Later the sample code uses this to find the vocabulary in common between the knowledge base and the user generated data. Try to process both data sets in the same way.

In [None]:
knowledge_base_exist = Counter()
for w in knowledge_base_skipgram.words:
    knowledge_base_exist[w.lower()] = w.lower()
knowledge_base_vocab_lowercase = knowledge_base_exist.keys()

In [None]:
logger.info(u'超级市场: %s', knowledge_base_exist[u'超级市场'])
logger.info('funky: %s', knowledge_base_exist[u'funky'])
logger.info('san_diego: %s', knowledge_base_exist[u'san_diego'])

User content vectors -- OpenSubtitles2016
-----------------------------------------
OpenSubtitles is a very useful project for language analysis since it has a decent collection of parrallel sentences -- the foreign language captions that enthusiasts have created for their favorite movies.

---
Start with an `input.xml`, file listing captions from foreign film' obtained from the OpenSubtitle project. The final segmented text is input_text.txt.wseg (local_content_txt_sent_tkn_wseg). The parts of speech tagging is required in order to strip out the nouns (better labels for topics).

<pre>
BeautifulSoup:                     input.xml -> input.txt 
                           local_content_xml -> local_content_txt   
JVnSenSegmenter:                   input.txt -> input.txt.sent
                           local_content_txt -> local_content_txt_sent   
JVnTokenizer:                 input.txt.sent -> input.txt.sent.tkn
                      local_content_txt_sent -> local_content_txt_sent_tkn         
JVnSegmenter:             input.txt.sent.tkn -> input.txt.sent.tkn.wseg
                  local_content_txt_sent_tkn -> local_content_txt_sent_tkn_wseg  
POSTagging:          input.txt.sent.tkn.wseg -> input.txt.sent.tkn.wseg.pos
             local_content_txt_sent_tkn_wseg -> local_content_txt_sent_tkn_wseg_pos
</pre>

**Extract text from data**

Uses BeautifulSoup library to find all tags `'r'` and strip the text from them.

TODO:
1. Stream text through memory so that larger files can be proccessed.
2. Allow for a directory of subfiles.

In [None]:
def prepare_data_directories(root_dir):
    for dirpath, dirnames, filenames in os.walk(root_dir):
        logger.info(dirpath)
        prepare_data_directory(dirpath)
        
def text_from_subtitle_xml(sub_dir):
    for fn in os.listdir(sub_dir):
        if fn.endswith('.xml.gz'):
            print fn
            with gzip.open(sub_dir + '/' + fn,'r') as fp:
                soup = BeautifulSoup(fp,'lxml')
            with open(sub_dir + '/' + fn + '.txt','w') as fp:
                for s in soup.findAll('s'): 
                    fp.write(s.text.lower().strip().encode('utf-8') + '\n')
        
def prepare_data_directory(sub_dir):
    ## BeautifulSoup:                     input.xml -> input.txt  
    ## JVnSenSegmenter:                   input.txt -> input.txt.sent  
    ## JVnTokenizer:                 input.txt.sent -> input.txt.sent.tkn        
    ## JVnSegmenter:             input.txt.sent.tkn -> input.txt.sent.tkn.wseg
    ## POSTagging:          input.txt.sent.tkn.wseg -> input.txt.sent.tkn.wseg.pos
    
    #text_from_subtitle_xml(sub_dir)
    
    ### Sentence Segmentation
    #cp = '../JVnTextPro/target/jvn-text-pro-2.0.jar:/root/.m2/repository/args4j/args4j/2.33/args4j-2.33.jar'
    #model_dir = '../JVnTextPro/models/jvnsensegmenter/'
    #java_class = 'jvnsensegmenter.JVnSenSegmenter'
    #cmd = 'java -cp {} {} -modeldir {} -inputdir '.format(cp, java_class, model_dir)
    #cmd_list = cmd.split()
    #cmd_list.append(sub_dir)
    #ret_val = subprocess.call(cmd_list)
    #print ret_val,
    
    ## Sentence Tokenization
    ## Note: JVnTokenizer basically separates punctuation from words. Does not bother, 
    ##       for example, numbers like 22,216.
    cp = '../JVnTextPro/target/jvn-text-pro-2.0.jar:/root/.m2/repository/args4j/args4j/2.33/args4j-2.33.jar'
    java_class = 'jvntokenizer.JVnTokenizer'
    cmd = 'java -cp {} {} -inputdir '.format(cp, java_class)
    cmd_list = cmd.split()
    cmd_list.append(sub_dir)
    ret_val = subprocess.call(cmd_list)
    print ret_val,

    ## Word Segmentation
    cp = '../JVnTextPro/target/jvn-text-pro-2.0.jar:/root/.m2/repository/args4j/args4j/2.33/args4j-2.33.jar'
    model_dir = '../JVnTextPro/models/jvnsegmenter/'
    java_class = 'jvnsegmenter.WordSegmenting'
    cmd = 'java -cp {} {} -modeldir {}  -inputdir '.format(cp, java_class, model_dir)
    cmd_list = cmd.split()
    cmd_list.append(sub_dir)
    ret_val = subprocess.call(cmd_list)
    print ret_val,
    
    ## Part of Speech Tagging
    cp = '../JVnTextPro/target/jvn-text-pro-2.0.jar:/root/.m2/repository/args4j/args4j/2.33/args4j-2.33.jar'
    model_dir = '../JVnTextPro/models/jvnpostag/maxent/'
    java_class = 'jvnpostag.POSTagging'
    cmd = 'java -cp {} {} -tagger maxent -modeldir {}  -inputdir '.format(cp, java_class, model_dir)
    cmd_list = cmd.split()
    cmd_list.append(sub_dir)
    ret_val = subprocess.call(cmd_list)
    print ret_val,

In [None]:
with open('369610.xml','r') as fp:
#with open('1218844.xml','r') as fp:
    soup = BeautifulSoup(fp,'lxml')
dirname_root = 'OpenSubtitles2016/xml/'
for tag in soup.findAll('linkgrp'):
    if '369610' in tag['fromdoc']:
        tag_369610 = tag
        fromdoc = tag['fromdoc']
        todoc = tag['todoc']
        
print fromdoc, todoc

In [None]:
print fromdoc, todoc
soup = BeautifulSoup(str(tag_369610),'lxml')
with gzip.open(dirname_root + fromdoc,'r') as fp:
    soup_from = BeautifulSoup(fp,'lxml')
    from_tags = soup_from.findAll('s')
    
with gzip.open(dirname_root + todoc,'r') as fp:
    soup_to = BeautifulSoup(fp,'lxml')
    to_tags = soup_to.findAll('s')

aligned_data = []
with open(dirname_root + todoc + '.txt','w') as fp:
    for tag in soup.findAll('link'):
        tag_indices = tag['xtargets'].split(';')
        from_sents_ids = [int(n) for n in tag_indices[0].split()]
        to_sents_ids = [int(n) for n in tag_indices[1].split()]
        from_sents = [' '.join(from_tags[i-1].text.split()) for i in from_sents_ids]
        to_sents = [' '.join(to_tags[i-1].text.split()) for i in to_sents_ids]
        to_sents_out = ' '.join(to_sents)
        fp.write(to_sents_out.encode('utf-8') + '\n')
        aligned_data.append([tag['id'], ' '.join(to_sents), ' '.join(from_sents)])

aligned_sentences_df = pd.DataFrame(aligned_data,columns=['id',todoc,fromdoc])
aligned_sentences_df

In [None]:
from cjklib.dictionary import CEDICT
d = CEDICT()

for t in d.getForHeadword(u'生存'):
    print t.HeadwordSimplified, t.Reading, t.Translation

In [None]:
print t.HeadwordSimplified, t.Reading, t.Translation

In [None]:
if False:
    #prepare_data_directories('OpenSubtitles2016/raw/vi/2006/')
    #prepare_data_directories('OpenSubtitles2016/raw/vi/2015/')
    prepare_data_directories('OpenSubtitles2016/raw/vi/2015/369610/')

**List of stop words**

The list below came from `elasticsearch` Vietnamese plugin.

In [None]:
stopwords = ["bị", "bởi", "cả", "các", "cái", "cần", "càng", "chỉ", "chiếc", "cho", "chứ", "chưa", "chuyện",
             "có", "có_thể", "cứ", "của", "cùng", "cũng", "đã", "đang", "đây", "để", "đến_nỗi", "đều", "điều",
             "do", "đó", "được", "dưới", "gì", "khi", "không", "là", "lại", "lên", "lúc", "mà", "mỗi", "một_cách",
             "này", "nên", "nếu", "ngay", "nhiều", "như", "nhưng", "những", "nơi", "nữa", "phải", "qua", "ra",
             "rằng", "rằng", "rất", "rất", "rồi", "sau", "sẽ", "so", "sự", "tại", "theo", "thì", "trên", "trước",
             "từ", "từng", "và", "vẫn", "vào", "vậy", "vì", "việc", "với", "vừa"]

**Read Segmented Sentences**

The java code `JVnTextPro` processes each raw input file, creating new files for each task (sentence segmentation, word segmentation, and part of speech tagging). Subsequent processing will prepare the data for three things: `mallet` for peforming LDA topic modeling, word embeddings (with `fasttext`), and generate coocurrence matrix of sentence nouns. For `mallet`, one file is generated with each text put into one line and the file identifier is located at the beginning of each line. An alternate `mallet` input file has only the "nouns" of each sentence. Word vectors are computed from a flat file that has a sentence (of nouns only) for each line -- the document mapping isn't used in the word vector computation. The vocabulary of each file does need to be tracked for later visualization and analysis. The list of sentence nouns is also used to generate the cooccurance matrix (any time two nouns appear together in a sentence, an edge is drawn).

TODO: Make this code work as a generator -- only stream the data through memory, don't load it all at once.

In [None]:
import itertools, string
from elasticsearch import Elasticsearch, helpers

INDEX_NAME = '369610-zh-v6'
es = Elasticsearch('elasticsearch', http_auth=('elastic', 'changeme'))

In [None]:
mapping ='''
{
    "mappings":{
        "words":{
            "properties":{
                "word":{"type":"string", "index":"not_analyzed"},
                "translation":{"type":"string", "index":"not_analyzed"},
                "reading":{"type":"string", "index":"not_analyzed"},
                "topic":{"type":"integer"},
                "weight":{"type":"float"},
                "coord":{"type":"geo_point"},
                "doc_id":{"type":"string", "index":"not_analyzed"}
            }
        },
        "document":{
            "properties":{
                "word":{"type":"string", "index":"not_analyzed"},
                "doc_id":{"type":"string", "index":"not_analyzed"}
            }
        }
    }
}'''

In [None]:
es.indices.create(index=INDEX_NAME, body=mapping)

In [None]:
#k = ({'_type':'foo', '_index':'test', 'letters':''.join(letters)} for letters in itertools.permutations(string.letters,2))
#es.indices.create('test')
#helpers.bulk(es,k)

In [None]:
from gensim import corpora
from gensim.corpora.dictionary import Dictionary

In [None]:
jieba.set_dictionary('dict.txt.big')
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) 

In [None]:
class text_data():
    def __init__(self, root_dir=None):
        self.root_dir = root_dir
        self.tmp_dir = 'tmp_dir/'
        self.dictionary = Dictionary()
        
        ## TODO: Instead of keeping multiple copies of data in memory, write generators to stream off of disk.
        
        ## Collect metadata on documents.
        self.data_stats = {}
        
        ## One (full) sentence per line, input for skipgram.
        self.text_sentences = []
        
        ## Format for mallet topic modeling. Input all text (actually nouns) all on one line.
        ## <doc_id> <tab> <text_one_line>
        self.mallet_input = []
        
        ## Restrict vocabulary to "nouns".
        self.sentences_nouns = []
        
        ## For each noun, keep a mapping of all documents that it appears in.
        self.document_noun = {}
        self.document_topic = {}
        
        ## Temporary directory to store all the intermediate files.
        if not os.path.isdir(self.tmp_dir):
            os.mkdir(self.tmp_dir)
            
        ## Walk the input directory and process files.
        ## Side effects: statistics on input directory
        ##               temporary directory full of files formated for topic modeling
        for dirpath, dirnames, filenames in os.walk(self.root_dir):
            for fn in filenames:
                ## Assumption (based on model from external JvnTextPro) is the segmented
                ## files have extension '.wseg'. Keep this model for each language?
                if fn.endswith('.txt'):

                    ## Generate a document identifier, real document names can be ugly.
                    key = str(uuid.uuid4())

                    ## Pull out information encoded in directory name path.
                    ## OpenSubtitles2016/raw/vi/2006/761212/3826993.xml.gz
                    parts = dirpath.split('/')
                    root = parts[0]
                    subtitle_language = parts[2]
                    subtitle_year = parts[3]
                    subtitle_id = parts[4]
                    
                    ## Initialize statistics, add to his as each file is processed.
                    self.data_stats[key] = {'_type':'document', '_index':INDEX_NAME, 
                                            'doc_id':key, 'subtitle_language':subtitle_language,
                                            'subtitle_year':subtitle_year, 'subtitle_id':subtitle_id, 
                                            'filename':fn}
                    
                    ## Load and process segmented text.
                    local_content_file = dirpath + '/' + fn
                    self.process_local_content_file(fn=local_content_file, key=key)
                    
                    ## Load and process parts-of-speech.
                    pos_file = local_content_file # + '.pos'
                    self.process_parts_of_speech_file(fn=pos_file, key=key)
                    
        self.dictionary = Dictionary(self.get_texts())
    
    def read_nouns_from_pos_data(self, data):
        reader = csv.reader(data, delimiter=' ')
        for row in reader:
            nouns = []
            blob = pseg.cut(' '.join(row))
            for b in blob:
                tag = b.flag
                word = b.word
                if tag in ['n', 'ng','nr','nrfg','nrt','ns','nz']:
                    nouns.append(word) 
            yield nouns
                
    def process_local_content_file(self, fn=None, key=None):
        ## Keep track of lines (actual sentences because of preprocessing) infile.
        text_lines_file = []
                    
        ## Read lines of document. Actual sentences because of preprocessing step.
        ## Need to look at using csv library -- better perfomance for streaming?
        ## See https://districtdatalabs.silvrback.com/simple-csv-data-wrangling-with-python
        with open(fn,'rb') as fp:
            for line in fp:
                line = line.strip()
                text_lines_file.append(line)
                
        self.text_sentences.append(text_lines_file)
        text_all_one_line = ' '.join(text_lines_file)
        text_vocab = set([w for text in text_lines_file for w in text.split()])
        
        self.data_stats[key]['text'] = text_all_one_line.decode('utf-8')
        self.data_stats[key]['no_sents'] = len(text_lines_file)
        self.data_stats[key]['no_words'] = len(text_vocab)
        
    def process_parts_of_speech_file(self, fn=None, key=None):
        with open(fn,'rb') as fp:
            ## read_nouns_from_pos_data returns list of nouns in each sentence.
            sentences_nouns = [s for s in self.read_nouns_from_pos_data(fp)]
            with open(self.tmp_dir + key + '.txt','w') as fp:
                for s in sentences_nouns:
                    output_s = ' '.join(s)
                    fp.write(output_s.encode('utf-8') + '\n')
                words = [w for s in sentences_nouns for w in s]
                text_vocab_nouns = set(words)
                for w in text_vocab_nouns:
                    if w in self.document_noun:
                        self.document_noun[w].append(key)
                    else:
                        self.document_noun[w] = [key]
                text_vocab_nouns_oneline = ' '.join(list(text_vocab_nouns))
                text_all_one_line = ' '.join([w for s in sentences_nouns for w in s])
            
            self.sentences_nouns.append(sentences_nouns)
            self.mallet_input.append((key,text_all_one_line))
        
            self.data_stats[key]['word'] = list(text_vocab_nouns)
            self.data_stats[key]['no_nouns'] = len(text_vocab_nouns)
        
    def load_segmented_sentences(self):
        for key in self.data_stats:
            yield self.data_stats[key]
            
    def __iter__(self):
        for line in open('test.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())
            
    def get_texts(self):
        sentences_nouns = [s for f in self.sentences_nouns for s in f]
        for line in sentences_nouns:
            yield line

In [None]:
root_dir = 'OpenSubtitles2016/xml/zh/2015/369610/'
sample_data = text_data(root_dir)

In [None]:
df = pd.DataFrame([sample_data.data_stats[key] for key in sample_data.data_stats])
df[['subtitle_language','subtitle_year','subtitle_id','filename', 'no_sents','no_words','no_nouns']].tail()

In [None]:
print sample_data.dictionary

In [None]:
with open('test.txt','w') as fp:
    for text_id,text in sample_data.mallet_input:
        fp.write(text_id + '\t' + text.encode('utf-8') + '\n')

In [None]:
sentences_nouns = [s for f in sample_data.sentences_nouns for s in f]

In [None]:
print len(sample_data.sentences_nouns)

In [None]:
with open('tmp.txt','w') as fp:
    for s in sentences_nouns:
        s_filtered = []
        for w in s: 
            try:
                if cluster_label_ranked[w] > -1:
                    s_filtered.append(w)
            except:
                pass
        s_string = ' '.join(s_filtered)
        fp.write(s_string.encode('utf-8') + '\n')

In [None]:
with open('tmp.txt','w') as fp:
    for s in sentences_nouns:
        s_string = ' '.join(s)
        fp.write(s_string.encode('utf-8') + '\n')

In [None]:
from gensim.models.word2vec import LineSentence

lines = LineSentence('tmp.txt')
dictionary = corpora.Dictionary(lines)

with open('lengths.txt','w') as fp:
    for sentence_id, sentence in enumerate(sentences_nouns):
        fp.write(str(len(sentence)) + '\n')

with open('sets.txt','w') as fp:
    for sentence_id, sentence in enumerate(sentences_nouns):
        #print len(sentence), sentence
        term_ids = [w[0] for w in dictionary.doc2bow(sentence)]
        for term_id in term_ids:
            fp.write(str(sentence_id+1) + ' ' + str(term_id+1) + '\n')

In [None]:
with open('vocab.txt','w') as fp:
    for n in range(0,len(dictionary)):
        w = dictionary.get(n)
        try:
            fp.write(str(metric[w])+'\n')
        except:
            pass

<pre>
 ./occams_v5 -z non_zeros -m M -n N -s summary_length -D data_file -W weights_file -L lengths_file -b lower_bound
   non-zeros is the size of the instance
   M is the number of terms
   N is the number of sets
   summary_length is the length of the summary
   data_file is the name of the file describing the sets, one pair (set_id, term) per line
   weights_file is the name of the file containing the non-negative weights of terms, one per line
   lengths_file is the name of the file lengths of sentences
   lower_bound is the minimum length a sentence must have in order to be
      eligable for summarization.
</pre>

OCCAMS needs the names of the data files, the dimensions (`MxN`) of the term-sentence matrix, and the size of the summary. 

The sparse term-sentence matrix is input as a list of sentence-id, term pairs (`sets.txt`, with one pair per line). Internally, the data is stored in an array such that the i-th element points to an array of integers holding the i-th sentence.

The `sets_lengths` array describes the costs of sets or lenghts of sentences. Populated by the `read_lengths` routine that parses the `lengths.txt` file -- one lenght for each sentence.

Parameters:
* non-zeros
* number of terms
* number of sets
* summary length
* weights
* lenghts
* lower bound

In [None]:
!pwd
!wc -l tmp.txt
!wc -l vocab.txt
!wc -l lengths.txt
!tail -2 sets.txt

In [None]:
nz = 1500
M = 599
N = 1246
s = 50
D = 'sets.txt'
W = 'vocab.txt'
L = 'lengths.txt'
b = 5
cmd = '/root/notebooks/stage/mad-science/OCCAMSV5/occams_v5 -z {} -m {} -n {} -s {} -D {} -W {} -L {} -b {}'.format(nz, M, N, s, D, W, L, b)
print cmd
cmd_list = cmd.split()
ret_val = subprocess.check_output(cmd_list, stderr=subprocess.STDOUT)
for line in ret_val.split('\n'):
    if 'Chosen sentences' in line:
        print line
        sentence_ids = line.split(':')[1].split()

In [None]:
import re

summary_nouns = []
for n in [int(n)-1 for n in sentence_ids]:
    summary_nouns.append(sample_data.sentences_nouns[0][n])
    #print n, ' '.join(sample_data.sentences_nouns[0][n])

summary_nouns = set([w for line in summary_nouns for w in line])
readings = []
translations = []
for w in summary_nouns:
    try:
        t = d.getForHeadword(w).next()
        readings.append(t.Reading)
        translations.append(t.Translation)
    except:
        readings.append(' ')
        translations.append(' ')
        
topic_numbers = []
for w in summary_nouns:
    try:
        topic_numbers.append(cluster_label_ranked[w]+1)
    except:
        topic_numbers.append(-1)
        
weights = []
for w in summary_nouns:
    try:
        weights.append(metric[w])
    except:
        weights.append(-1)
terms_df = pd.DataFrame(zip(summary_nouns,readings,translations,weights,topic_numbers),
                        columns=['Noun','Reading','Translation','Weight','Topic'])
print len(terms_df)
grouped = terms_df.groupby('Topic')
for name,group in grouped:
    print name, #group.sum(),
#    print group
terms_df.sort_values(by='Weight',ascending=False)

In [None]:
sids = [int(sid)-1 for sid in sentence_ids]
df =pd.DataFrame([aligned_data[sid] for n,sid in enumerate(sids)],columns=['ID',todoc,fromdoc])
#print tabulate(df, headers='keys', tablefmt='psql')
df

In [None]:
[(s[0],s[2]) for s in aligned_data if 'Drones' in s[2]]

In [None]:
for sid,sent in [(s[0],s[1]) for s in aligned_data if u'zì yóu jī' in s[1]]:
    print sid,sent

In [None]:
class MyCorpus(object):
    def __iter__(self):
        for line in open('test.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

In [None]:
corpus_memory_friendly = MyCorpus()

print(dictionary)

In [None]:
dictionary = sample_data.dictionary

In [None]:
corpus = [dictionary.doc2bow(text) for text in sentences_nouns]

In [None]:
corpora.MmCorpus.serialize('test.mm', corpus)
dictionary.save('test.dict')

In [None]:
from gensim import corpora, models, similarities

In [None]:
dictionary = corpora.Dictionary.load('test.dict')
corpus = corpora.MmCorpus('test.mm')

In [None]:
import pyLDAvis
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, passes=10)
                                      
lda.save('test.model')

In [None]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

vis_data = gensimvis.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)

Compute word vectors
--------------------

In [None]:
recompute = True
if recompute == True or not os.path.isfile('tmp.txt'):
    local_content_skipgram = fasttext.skipgram('tmp.txt', model_directory + local_content_name, 
        lr=0.02, dim=local_content_vector_dimension, ws=5,
        epoch=1, min_count=0, neg=5, loss='ns', bucket=2000000, minn=1, maxn=4,
        thread=8, t=1e-5, lr_update_rate=100)
else:
    logger.info('Local vectors %s already on disk.', local_content_vectors)
    local_content_skipgram = fasttext.load_model(local_content_model)

In [None]:
print local_content_model
print model_directory + local_content_name

In [None]:
logger.info('Creating word vecs')

words=[w for text in sentences_nouns for w in text]
#nouns = [n for f in nouns_all for n in f]
Vocab=set(words)

model_comb={}
model_comb_vocab=[]

common_vocab=set(knowledge_base_vocab_lowercase).intersection(local_content_skipgram.words).intersection(Vocab)

for w in common_vocab:
    model_comb[w]=np.array(np.concatenate((knowledge_base_skipgram[w],local_content_skipgram[w])))
    model_comb_vocab.append(w)
        
logger.info('Length of common_vocab = %d', len(common_vocab))

In [None]:
print len(set(knowledge_base_skipgram.words))
print len(set(local_content_skipgram.words))

In [None]:
writer = csv.writer(open(combined_vectors,'w'),delimiter='\t')
for k in model_comb.keys():
    writer.writerow(model_comb[k])

In [None]:
combined_vectors

In [None]:
###Helper Functions
def norm(a):
    return np.sqrt(np.sum(np.square(a)))

def cosine(a,b):
    return 1-np.dot(a,b)/np.sqrt(np.sum(a**2)*np.sum(b**2))

def l1(a,b):
    return abs(a-b).sum()

def l2(a,b):
    return np.sqrt(np.square(a-b).sum())

In [None]:
### Create a list of words to be clustered based on a model with some l2_threshold and can normalize the vectors 
### and also repeat or no
def create_word_list(model,vocab,features,Texts,repeat=True,l2_threshold=0,normalized=True,min_count=100,min_length=0):
    data_d2v=[]
    word_d2v=[]
    words_text=[w for text in Texts for w in text]
    count=Counter(words_text)
    if repeat:
        for text in Texts:
            for w in text:
                if w in vocab and count[w]>min_count:
                    if len(w)>min_length and l2(model[w],np.zeros(features))>l2_threshold:
                        if normalized:
                            data_d2v.append(model[w]/l2(model[w],np.zeros(features)))
                        else:
                            data_d2v.append(model[w])
                        word_d2v.append(w)
    else:
        A=set(words_text)
        for w in vocab:
            if w in A and len(w)>min_length and l2(model[w],np.zeros(features))>l2_threshold and count[w]>min_count:
                if normalized:
                    data_d2v.append(model[w]/l2(model[w],np.zeros(features)))
                else:
                    data_d2v.append(model[w])
                word_d2v.append(w)

    return data_d2v, word_d2v

In [None]:
#Run Agglomerative clustering
logger.info('Clustering for depth...')
local_vec = True

data_d2v,word_d2v=create_word_list(model_comb,model_comb_vocab,25*local_vec+200,sentences_nouns,repeat=False,normalized=True,min_count=0,l2_threshold=0)
#spcluster=fastcluster.linkage(data_d2v,method='average',metric='cosine')

In [None]:
len(model_comb_vocab)

In [None]:
%%time
min_cluster_size_opt = 0
min_samples_opt = 0
count = 0
while True:
    X_2D = bhtsne.tsne(np.array(data_d2v), dimensions=2)
    print 'Attempt: ', count
    count += 1
    cluster_params = []
    label_values = []
    for min_cluster_size in range(4,30):
        for min_samples in range(4,min_cluster_size):
            clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples)
            #labels = clusterer.fit_predict(np.array(data_d2v))
            labels = clusterer.fit_predict(X_2D)
            label_max = clusterer.labels_.max()
            if label_max >= 10 and label_max <= 100:
                label_values.append(label_max)
                cluster_params.append((min_cluster_size,min_samples))
                print min_samples,min_cluster_size,label_max
    label_values.reverse()
    cluster_params.reverse()
    if len(label_values) > 0:
        i = np.argmax(label_values)
        min_cluster_size_opt, min_samples_opt = cluster_params[i]
        break

In [None]:
min_cluster_size = min_cluster_size_opt
min_samples = min_samples_opt
clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples) #, algorithm='generic')
#labels = clusterer.fit_predict(np.array(data_d2v))
labels = clusterer.fit_predict(X_2D)
print min_cluster_size, clusterer.labels_.max()
#labels

In [None]:
def calculate_depth(spcluster,words, num_points):
    cluster=[[] for w in xrange(2*num_points)]
    c=Counter()
    for i in xrange(num_points):
        cluster[i]=[i]

    for i in xrange(len(spcluster)):
        x=int(spcluster[i,0])
        y=int(spcluster[i,1])
        xval=[w for w in cluster[x]]
        yval=[w for w in cluster[y]]
        cluster[num_points+i]=xval+yval
        for w in cluster[num_points+i]:
            c[words[w]]+=1
        cluster[x][:]=[]
        cluster[y][:]=[]    
    return c

In [None]:
##Calculate depth of words
num_points=len(data_d2v)
#depth=calculate_depth(spcluster,word_d2v,num_points)
depth = calculate_depth(clusterer.single_linkage_tree_.to_numpy(),word_d2v,num_points)

In [None]:
logger.info('Computing co-occurence graph')

T=[' '.join(w) for w in sentences_nouns]

In [None]:
logger.info(len(T))

In [None]:
##Co-occurence matrix
cv=CountVectorizer(token_pattern=u'(?u)\\b([^\\s]+)')
bow_matrix = cv.fit_transform(T)
id2word={}
for key, value in cv.vocabulary_.items():
    id2word[value]=key

ids=[]
for key,value in cv.vocabulary_.iteritems():
    if key in model_comb_vocab:
        ids.append(value)

sort_ids=sorted(ids)
bow_reduced=bow_matrix[:,sort_ids]
normalized = TfidfTransformer().fit_transform(bow_reduced)
similarity_graph_reduced=bow_reduced.T * bow_reduced

In [None]:
##Depth-rank weighting of edges, weight of edge i,j=cosine of angle between them
logger.info('Computing degree')
m,n=similarity_graph_reduced.shape

cx=similarity_graph_reduced.tocoo()
keyz=[id2word[sort_ids[w]] for w in xrange(len(sort_ids))]
data=[]
ro=[]
co=[]
for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
    if v>0 and i!=j:
        value=1
        if value>0:
            ro.append(i)
            co.append(j)
            data.append(value)

SS=sp.sparse.coo_matrix((data, (ro, co)), shape=(m,n))
SP_full=SS.tocsc()
id_word={w:id2word[sort_ids[w]] for w in xrange(len(sort_ids))}
word_id={value:key for key,value in id_word.items()}

In [None]:
logger.info('Computing metrics')
#compute metrics
degsum=SP_full.sum(axis=1)
deg={}
for x in xrange(len(sort_ids)):
    deg[id2word[sort_ids[x]]]=int(degsum[x])

max_deg=max(deg.values())
max_depth=max(depth.values())

temp_deg_mod={w:np.log(1+deg[w])/np.log(1+max_deg) for w in deg.iterkeys()}
alpha=np.log(0.5)/np.log(np.median(temp_deg_mod.values()))
deg_mod={key:value**alpha for key, value in temp_deg_mod.iteritems()}

temp={key:value*1./max_depth for key, value in depth.iteritems()}
alpha=np.log(0.5)/np.log(np.median(temp.values()))
depth_mod={key:value**alpha for key, value in temp.iteritems()}

temp={key:deg_mod[key]*depth_mod[key] for key in depth_mod.iterkeys()}
max_metric=np.max(temp.values())
metric={key:value*1./max_metric for key,value in temp.iteritems()}

In [None]:
logger.info('max_deg = %s, max_depth = %s',max_deg, max_depth)

In [None]:
K = clusterer.labels_.max()+1
cluster_label={word_d2v[x]:labels[x] for x in xrange(len(word_d2v))}

cluster_label_ranked={}

topic=[[] for i in xrange(-1,K)]
clust_depth=[[] for i in xrange(K)]
for i in xrange(K):
    topic[i]=[word_d2v[x] for x in xrange(len(word_d2v)) if labels[x]==i]
    #temp_score=[metric[w] for w in topic[i]]
    temp_score = []
    for w in topic[i]:
        if w in metric: temp_score.append(metric[w])
    clust_depth[i]=-np.sum(sorted(temp_score,reverse=True)[:])#int(np.sqrt(len(topic[i])))])
index=np.argsort(clust_depth)
index2=np.argsort(-clusterer.cluster_persistence_)
for i in xrange(K):
    for w in topic[index[i]]:
        cluster_label_ranked[w]=i

noise = [word_d2v[x] for x in xrange(len(word_d2v)) if labels[x]==-1]
for w in noise:
    cluster_label_ranked[w] = -1

In [None]:
#print list(clusterer.labels_)
print len(xrange(-1,K))
print 'number of topics = ',len(topic)
print 'length of index list ',len(index)
print 'K = ',K
print 'length of cluster_persistence_ ',len(index2)
len(noise)

In [None]:
from cjklib.dictionary import CEDICT
d = CEDICT()

for t in d.getForHeadword(u'生存'):
    print t.HeadwordSimplified, t.Reading, t.Translation
    
trans = []
for w in word_d2v: 
    try:
        for t in d.getForHeadword(w):
            trans.append((w,t.HeadwordSimplified, t.Reading, t.Translation))
    except Exception as ex:
        trans.append((w,'','',''))
        #logger.info('%s %s',ex, w,)

In [None]:
print ' '.join(trans[0])

In [None]:
import networkx as nx
import graphviz

G = nx.from_scipy_sparse_matrix(SP_full)
#pos = nx.nx_pydot.to_pydot(G)

cores = nx.core_number(G)

#import pydot

#nx.write_gml(G, 'test.gml')

#graphviz.Source(pos.to_string())

In [None]:
coreDict = nx.core_number(G)
kcore = dict()
print("\nk-core decomposition for each node:")
for n in sorted(coreDict, key=int):
    w = word_d2v[n]
    if coreDict[n]>=5:
        kcore[w] = {'core':coreDict[n],'deg':G.degree()[n]}
        for t in d.getForHeadword(w):
            print n,
            print "\tNode %s: %d-core, %s, %s, %s, %s, %s" % (n, coreDict[n], G.degree()[n], w, 
                                                      t.HeadwordSimplified, t.Reading, t.Translation)

In [None]:
giant = max(nx.connected_component_subgraphs(G), key=len)
kcore = nx.core.find_cores(G)

In [None]:
kcoreMap ={}
for n,w in enumerate(word_d2v):
    kcoreMap[w] = kcore[n]

In [None]:
for n in sorted(coreDict, key=int):
    if n in [0,135,162,451]:
        w = word_d2v[n]
        for t in d.getForHeadword(w):
            print "\tNode %s: %d-core, %s, %s, %s, %s, %s" % (n, coreDict[n], G.degree()[n], w, 
                                                      t.HeadwordSimplified, t.Reading, t.Translation)

In [None]:
import pydot

In [None]:
from graphviz import Graph, Digraph

g = Graph('G', filename='process.gv')

g.edge('run', 'intr')
g.edge('intr', 'runbl')
g.edge('runbl', 'run')
g.edge('run', 'kernel')
g.edge('kernel', 'zombie')
g.edge('kernel', 'sleep')
g.edge('kernel', 'runmem')
g.edge('sleep', 'swap')
g.edge('swap', 'runswap')
g.edge('runswap', 'new')
g.edge('runswap', 'runmem')
g.edge('new', 'runmem')
g.edge('sleep', 'runmem')
g

In [None]:
digraph ='''digraph g {
   node [shape = record,height=.1];
   node0[label = "<f0> |<f1> G|<f2> "];
   node1[label = "<f0> |<f1> E|<f2> "];
   node2[label = "<f0> |<f1> B|<f2> "];
   node3[label = "<f0> |<f1> F|<f2> "];
   node4[label = "<f0> |<f1> R|<f2> "];
   node5[label = "<f0> |<f1> H|<f2> "];
   node6[label = "<f0> |<f1> Y|<f2> "];
   node7[label = "<f0> |<f1> A|<f2> "];
   node8[label = "<f0> |<f1> C|<f2> "];
   "node0":f2 -> "node4":f1;
   "node0":f0 -> "node1":f1;
   "node1":f0 -> "node2":f1;
   "node1":f2 -> "node3":f1;
   "node2":f2 -> "node8":f1;
   "node2":f0 -> "node7":f1;
   "node4":f2 -> "node6":f1;
   "node4":f0 -> "node5":f1;
}'''

graphviz.Source(digraph)

In [None]:
cluster_ex = '''digraph G {
    compound=true;
    subgraph cluster0 {
      a -> b;
      a -> c;
      b -> d;
      c -> d;
    }
    subgraph cluster1 {
e -> g;
e -> f; }
    b -> f [lhead=cluster1];
    d -> e;
    c -> g [ltail=cluster0,
             lhead=cluster1];
c -> e [ltail=cluster0];
d -> h; }
'''

graphviz.Source(cluster_ex)

In [None]:
g = Digraph('G', filename='cluster.gv')

c0 = Digraph('cluster_0')
c0.body.append('style=filled')
c0.body.append('color=lightgrey')
c0.node_attr.update(style='filled', color='white')
c0.edges([('a0', 'a1'), ('a1', 'a2'), ('a2', 'a3')])
c0.body.append('label = "process #1"')

c1 = Digraph('cluster_1')
c1.node_attr.update(style='filled')
c1.edges([('b0', 'b1'), ('b1', 'b2'), ('b2', 'b3')])
c1.body.append('label = "process #2"')
c1.body.append('color=blue')

g.subgraph(c0)
g.subgraph(c1)

g.edge('start', 'a0')
g.edge('start', 'b0')
g.edge('a1', 'b3')
g.edge('b2', 'a3')
g.edge('a3', 'a0')
g.edge('a3', 'end')
g.edge('b3', 'end')

g.node('start', shape='Mdiamond')
g.node('end', shape='Msquare')
g

In [None]:
print np.mean(sorted(temp_score,reverse=True)[:]), index
for w in topic[index[1]]: print w,
print 
for w in [w[0] for w in sorted([[w,metric[w]] for w in topic[index[1]]],key=itemgetter(1),reverse=True)]:
    print w,

In [None]:
logger.info('Done...Generating output')
lister=[]
to_show=K
to_show_words=200 #the maximum number of words of each type to display
for i in xrange(to_show):
    top=topic[index[i]]
    sort_top=[w[0] for w in sorted([[w,metric[w]] for w in top],key=itemgetter(1),reverse=True)]
    lister.append(['Topic %d' %(i+1)]+sort_top[:to_show_words])

max_len=max([len(w) for w in lister])
new_list=[]
for list_el in lister:
    new_list.append(list_el + [''] * (max_len - len(list_el)))
Topics=list(itertools.izip_longest(*new_list))
#X.insert(len(X),[-int(clust_depth[index[w]]*100)*1./100 for w in xrange(K)])
sorted_words=[w[0] for w in sorted(metric.items(),key=itemgetter(1),reverse=True)][:to_show_words]

In [None]:
import pandas as pd

df_tmp = pd.DataFrame(new_list).T
df_new = pd.DataFrame(df_tmp[1:len(new_list)].values,columns=[l[0] for l in new_list])
df_new = pd.DataFrame(df_new['Topic 1'])

In [None]:
score_words = sorted_words
deep_words = [w[0] for w in depth.most_common(to_show_words)]
filer = 'wiki_simple.txt'
outfile_topics = data_directory + filer.split('.')[0] + '_topics.csv'
outfile_score = data_directory + filer.split('.')[0] + '_score.csv'
outfile_depth = data_directory + filer.split('.')[0] + '_depth.csv'
b = open(outfile_topics, 'wb')
a = csv.writer(b)
a.writerows(Topics)
b = open(outfile_score, 'wb')
a = csv.writer(b)
a.writerows([[w] for w in score_words])
b = open(outfile_depth, 'wb')
a = csv.writer(b)
a.writerows([[w] for w in deep_words])

In [None]:
for w in score_words[0:10]:
    print w,

In [None]:
df = pd.DataFrame([Topics[i][0:10] for i in range(0,11)])
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
df

In [None]:
trans = {}
trans[u''] = ''
for w in word_d2v:
    try:
        if w !='':
            t = d.getForHeadword(w).next()
            trans[w] = t.Translation
            readings.append(t.Reading)
            translations.append(t.Translation)
        else:
            trans[w] = ''
    except:
        trans[w] = '<bad>'
        readings.append(' ')
        translations.append(' ')

In [None]:
print trans[u'']

In [None]:
df.applymap(lambda x: trans[unicode(x.split('/')[0])])

In [None]:
step = 10
for j in range(K/step + 1):
    first = j*step + 1;last = j*step + step
    print 'Total number of Topics = {}. Displaying Topics {} thru {}.'.format(K, first, last)
    print tabulate([Topics[i][first-1:last] for i in range(0,21)], headers='firstrow')

In [None]:
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
from bokeh.models import HoverTool, BoxZoomTool, WheelZoomTool, ResetTool, PanTool, BoxSelectTool
from bokeh.models import CustomJS, ColorBar, CategoricalColorMapper, LinearColorMapper, FixedTicker, Circle
import bokeh.palettes
from bokeh.models.widgets import Div, DataTable, TableColumn, NumberFormatter, Slider, RadioGroup
from bokeh.layouts import gridplot, widgetbox
from bokeh.io import push_notebook
from ipywidgets import interact
from IPython.display import display, clear_output
import pandas as pd

In [None]:
output_notebook()

In [None]:
d.getForHeadword(u'冒险').next().HeadwordSimplified

In [None]:
metrics_clean = []
for w in word_d2v:
    if w in metric:
        if metric[w] == 0: metric[w] = 0.01
        metrics_clean.append(metric[w])
        metrics_clean.append(round(metric[w],2))
readings = []
translations = []
for w in word_d2v:
    try:
        t = d.getForHeadword(w).next()
        readings.append(t.Reading)
        translations.append(t.Translation)
    except:
        readings.append(' ')
        translations.append(' ')

topic_numbers = [cluster_label_ranked[w]+1 for w in word_d2v]

label_colors = bokeh.palettes.viridis(K+1) 
colors = [label_colors[t] for t in topic_numbers]

#cores = [kcore[w]['core'] for w in word_d2v]
cores = [kcoreMap[w] for w in word_d2v]
degrees = [deg_mod[w] for w in word_d2v]
    
df = pd.DataFrame(zip(word_d2v, translations, readings, X_2D[:,0], X_2D[:,1], colors,
                      topic_numbers, [metric[w] for w in word_d2v], clusterer.probabilities_, cores, degrees),
                  columns=['word','translation','reading','x','y','color','topic','metric','prob','core','degree'])
df.sort_values(by='degree',ascending=False)

In [None]:
def generate_words():
    for n,w in enumerate(word_d2v):
        body = {}
        body['_type'] = 'words'
        body['_index'] = INDEX_NAME
        body['doc_id'] = sample_data.document_noun[w]
        body['word'] = w
        body['translation'] = translations[n]
        body['reading'] = readings[n]
        body['topic'] = topic_numbers[n]
        body['coord'] = {'lat':X_2D[n,1],'lon':X_2D[n,0]}
        body['weight'] = metrics_clean[n]
        yield body

In [None]:
print INDEX_NAME

In [None]:
helpers.bulk(es,generate_words())

In [None]:
helpers.bulk(es,sample_data.load_segmented_sentences())

In [None]:
len(df.topic.unique())

In [None]:
source_all  = ColumnDataSource(data=df)
#sample_topic = df[(df.word==u'khủng_long')].topic.values[0]
#sample_topic_no = df[(df.word==u'sự_nghiệp')].topic.values[0]
sample_topic_no = 6
sample_topic = df[df.topic == sample_topic_no].sort_values(by='metric',ascending=False)
html = sample_topic.to_html()
source_sample = ColumnDataSource(data=sample_topic)
ps = figure(plot_width=300, plot_height=300, 
           title="Topic: " + str(sample_topic_no), 
           tools='pan,wheel_zoom,box_zoom,box_select,lasso_select,reset,resize,save')
           #active_drag=None, active_scroll=pan, active_tap=None)
cr = ps.circle('x', 'y', source=source_sample, radius='metric', 
         fill_alpha=0.6, line_color=None, color='color')
hover = HoverTool(
        tooltips=[
            #("index", "$index"),
            ("topic","@topic"),
            ("word", "@word"),
            ("translation", "@translation"),
            ("metric", "@metric"),
            #x,y)", "(@x, @y)"),
            #("color","@color")
        ],
        renderers=[cr],
        mode='mouse',
        show_arrow=False,
    )
ps.add_tools(hover)

def callback(source_all=source_all,source_sample=source_sample):
    topic_no = cb_obj.value
    d1 = source_all.data
    d2 = source_sample.data
    #print(d1)
    for k in d2.keys():
        d2[k] = []
    source_sample.trigger('change')
slider = Slider(start=0, end=K, value=1, step=1, title="Topic Number", 
                callback=CustomJS.from_py_func(callback))

   
div = widgetbox(Div(text=html)) #,width=600, height=100))
grid = gridplot([[ps,div,widgetbox(slider)]],width=1000)
show(grid, notebook_handle=True);

In [None]:
clusters_df = df[df.topic.isin(range(1,K))]
noise_df = df[df.topic.isin([0])]
print(len(noise_df))
noise_df.sort_values(by='metric',ascending=False).head(100)

In [None]:
source_all = ColumnDataSource(data=clusters_df)
source_noise = ColumnDataSource(data=noise_df)

columns = [
        TableColumn(field="word", title="words"),
        TableColumn(field="translation", title="translations"),
        TableColumn(field="topic", title="topics"),
        TableColumn(field="degree", title="degree"),
        TableColumn(field="core", title="core"),
        TableColumn(field="metric", title="metrics", formatter=NumberFormatter(format='0.[00]'),
                   default_sort='descending'),
]

source_table = ColumnDataSource(data=dict(x=[],y=[],word=[],color=[],topic=[],degree=[],core=[],metric=[]))
data_table = DataTable(source=source_all, columns=columns, width=600, height=600, row_headers=False)
table = widgetbox(data_table)

title = "Distributed Word Embeddings: " + str(K) + " Topics (min cluster size " + str(min_cluster_size) + ")"
p = figure(plot_width=800, plot_height=800, 
           title=title, 
           tools='pan,wheel_zoom,box_zoom,box_select,lasso_select,reset,resize,save')
           #active_drag=None, active_scroll=pan, active_tap=None)
#p.toolbar.active_scroll = WheelZoomTool()
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.xaxis.visible = False
p.yaxis.visible = False

cr_noise = p.circle('x', 'y', source=source_noise, radius='metric',
         fill_alpha=0.05, line_color=None, color='gray')
selected_circle_noise = Circle(fill_alpha=0.05, line_color=None, fill_color='gray')
nonselected_circle_noise = Circle(fill_alpha=0.05, line_color=None, fill_color='gray')

cr = p.circle('x', 'y', source=source_all, radius='metric', 
         fill_alpha=0.6, line_color=None, color='color')

cr_table = p.circle('x', 'y', source=source_table, radius='metric', 
         fill_alpha=0.6, line_color=None, color='color')

selected_circle = Circle(fill_alpha=0.6, fill_color='color', line_color='color')
nonselected_circle = Circle(fill_alpha=0.2, fill_color='color', line_color=None)
cr_table.selection_glyph = selected_circle
cr_table.nonselection_glyph = nonselected_circle
cr.selection_glyph = selected_circle
cr.nonselection_glyph = nonselected_circle
cr_noise.selection_glyph = selected_circle_noise
cr_noise.nonselection_glyph = nonselected_circle_noise


hover = HoverTool(
        tooltips=[
            #("index", "$index"),
            ("topic","@topic"),
            ("word", "@word"),
            ("translation", "@translation"),
            ("metric", "@metric"),
            ("degree","@degree"),
            ("core","@core"),
            #x,y)", "(@x, @y)"),
            #("color","@color")
        ],
        renderers=[cr],
        mode='mouse',
        show_arrow=False,
    )

p.add_tools(hover)
#color_bar = ColorBar(color_mapper=CategoricalColorMapper(factors=clusters_df.topic), orientation='vertical',
#                     location='top_right', scale_alpha=0.7,
#                     ticker=FixedTicker(ticks=[2,6,10,14,18]))

source_all.callback = CustomJS(args=dict(source_table=source_table), code="""
        var inds = cb_obj.selected['1d'].indices;
        var d1 = cb_obj.data;
        var d2 = source_table.data;
        d2['x'] = [];
        d2['y'] = [];
        d2['word'] = [];
        d2['color'] = [];
        d2['topic'] = [];
        d2['metric'] = [];
        for (i = 0; i < inds.length; i++) {
            d2['x'].push(d1['x'][inds[i]]);
            d2['y'].push(d1['y'][inds[i]]);
            d2['word'].push(d1['word'][inds[i]]);
            d2['color'].push(d1['color'][inds[i]]);
            d2['topic'].push(d1['topic'][inds[i]]);
            d2['metric'].push(d1['metric'][inds[i]]);
        }
        source_table.trigger('change');
    """)

def callback(source_table=source_table, source_all=source_all):
    topic = cb_obj.value
    d1 = source_all.data
    d2 = source_table.data
    d2['topic'] = []
    d2['word'] = []
    d2['metric'] = []
    d2['x'] = []
    d2['y'] = []
    
    data = zip(d1['topic'], d1['word'], d1['translation'], d1['metric'], d1['x'], d1['y'])
    data_topic = [d for d in data if d[0] == topic]
    d2['topic'] = [d[0] for d in data_topic]
    d2['word'] = [d[1] for d in data_topic]
    d2['translation'] = [d[2] for d in data_topic]
    d2['metric'] = [d[3] for d in data_topic]
    d2['x'] = [d[4] for d in data_topic]
    d2['y'] = [d[5] for d in data_topic]
    #print(d2)
    source_table.trigger('change')

#slider = Slider(start=0, end=K, value=1, step=1, title="Topic Number", 
#                callback=CustomJS.from_py_func(callback))

first=1;last=K
html = tabulate([Topics[i][first-1:last] for i in range(0,11)], tablefmt=u'html',headers='firstrow')
div = widgetbox(Div(text=html,width=1000))
#grid = gridplot([p, widgetbox(slider), table], ncols=2, plot_width=300, plot_height=300)
grid = gridplot([[p,table]])
#grid = gridplot([[p,table],[div]])
show(grid, notebook_handle=True);

In [None]:
from bokeh.embed import components
from bokeh.resources import CDN
# Generate the script and HTML for the plot
script, div = components(grid)

# Return the webpage
html = """
<!doctype html>
<head>
 <title></title>
 {bokeh_css}
</head>
<body>
 {div}
 {bokeh_js}
 {script}
</body>
 """.format(script=script, div=div, bokeh_css=CDN.render_css(),
 bokeh_js=CDN.render_js())

with open('templates/sample_output.html','w') as fp:
    fp.write(html)

from IPython.core.display import HTML
#HTML(html)