Topic Modeling Using Distributed Word Embeddings
================================================
Notebook version of https://github.com/rsrandhawa/Vec2Topic code, based on the article "Topic Modeling Using Distributed Word Embeddings" by R. S. Randhawa, P. Jain, and G. Madan. 

The basic approach is to first create a language model based on a large (ideally billions of words) text corpus. The technology used, distributed word embeddings, is a shallow neural network that seems to perform best on large datasets (trades simple but fast computation for tons of data).

The user generated content (which is usually a much smaller corpus) is likewise trained with consistent parameters. Vectors corresponding to the same vocabulary word are concatenated together to provide a model of the user generated content.

Word vectors that cluster together are interperted as topics of the user generated content. Some clusters appear better than others because they consist of coherent lists of words -- main goal is to score the importance of each topic.

Performing a hierarchical clustering provides a measure of depth for each word and computing a co-occurance graph (edge between two words if they belong to the same sentenence) provides a degree of co-occurance. Each word is scored by a (normalized) product of depth and degree. The awesome `hdbscan` is used to cluster words into topics, and the scoring function is used to order the words and the topics. Sample below with `min_cluster_size=5`.

![Topics](topics.png)

![Sentences](sentences.png)

## Required Standard Packages

In [None]:
import logging, re, os, bz2, gzip, subprocess, uuid, json
from collections import Counter
from operator import itemgetter
import itertools
from collections import namedtuple

In [None]:
## Unicode wrapper for reading & writing csv files.
import unicodecsv as csv

## Lighter weight than pandas -- tabular display of data.
from tabulate import tabulate

## In order to strip out the text from the xml formated data.
from bs4 import BeautifulSoup
import lxml

## Required Data Science Packages

In [None]:
## First the usual suspects: numpy, pandas, scipy, and gensim
import numpy as np
import pandas as pd
import scipy as sp
import gensim

## For scraping text out of a wikipedia dump. Get dumps at https://dumps.wikimedia.org/backup-index.html
from gensim.corpora import WikiCorpus

## Latest greatest word vectors (see https://pypi.python.org/pypi/fasttext).
import fasttext

## Package for segmenenting and finding parts-of-speech for Chinese.
import jieba
import jieba.posseg as pseg

## Latest greatest hierarchical clustering package. 
## Word vectors are clustered, with deeper trees indicating core topics.
import hdbscan

## Use scikit-learn to generate co-occurancy graph (edge if words in same sentence).
## The degree of each word indicates how strong it co-occurs.
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

#import bhtsne
from MulticoreTSNE import MulticoreTSNE as TSNE

In [None]:
from gensim import corpora
from gensim.corpora.dictionary import Dictionary

## Base Data Directory and Logging

The data directory, if it exists, should contain a copy of the background language model. Especially, the `data/` directory should contain the pre-computed word vectors for the wikipedia. If there is no language model, then set a flag for later processing.

TODO: 
1. Implement more careful defaults for language models.
2. Synchronize directory and filenames with those used in sumarrization (rouge) literature.

In [None]:
data_directory = 'data/'
model_directory = 'models/'
eval_directory = 'eval/'
wikipedia_vector_path = 'models_save/zhwiki-20160920-pages-articles.vec'

In [None]:
from imp import reload
reload(logging)

LOG_TO_FILE = False
if LOG_TO_FILE:
    LOG_FILENAME = data_directory + 'vec2topic.log'
    logging.basicConfig(filename=LOG_FILENAME,level=logging.INFO)
else:
    logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s %(message)s',"%b-%d-%Y %H:%M:%S")
logger.handlers[0].setFormatter(formatter)

## Load Language Model

Typically pre-computed from wikipedia. If it does not exist, then proceed using the local (target) documents. The idea is that a background language model should capture common language ussage. Takes about 5 minutes to load.

The `fasttext` package, which is used to precompute the wikipedia word vectors, produces two outputs. The set of all word vectors (file extension `.vec`) and the binary internal state of the neural network (file extension `.bin`). The algorithm below only uses the word vectors and so avoids the more expensive (in size and procesessing resources) binary model.

TODO: 
1. Investigate using more compact format for word vectors (e.g. flat buffers or capt'n proto).
2. Wider set of language models: parameters & domain specific background corpus.

In [None]:
class WikipediaWordVector_(namedtuple('WikipediaWordVector_', ('word','vector'))):

    @classmethod
    def parse(cls, row):          
        return cls(row[0],map(float,row[1:-1]))

In [None]:
class wikipedia_word_vectors(object):

    def __init__(self, path):
        self.path = path
        self._length= 0
        self._vector_dim = 0
        self._wikipedia_vocab = []
        
        with open(self.path, 'rU') as data:
            reader = csv.reader(data, delimiter=' ')
            self._length, self._vector_dim = reader.next()
            self._length = int(self._length)
            self._vector_dim = int(self._vector_dim)
            
            self._wikipedia_vocab = {line.split()[0].decode('utf-8') for line in data}
        
        self._wikipedia_vocab = set(self._wikipedia_vocab)
        print len(self._wikipedia_vocab), self._length

    def __iter__(self, vocab=[]):
        count = 0
        oov = []
        overlap = self._wikipedia_vocab.intersection(set(vocab))
        for w in vocab:
            if w not in overlap:
                oov.append(w)
        with open(self.path, 'rU') as data:
            reader = csv.reader(data, delimiter=' ')
            self._length, self._vector_dim = reader.next()
            for row in reader:
                if count == len(overlap):
                    break
                if row[0] in overlap:
                    count += 1
                    yield WikipediaWordVector_.parse(row)

In [None]:
LANGUAGE_MODEL = True
## Check to see if working directories exist. If they don't, then create them.
if not os.path.isdir(model_directory):
    os.mkdir(model_directory)
    LANGUAGE_MODEL = False
    
if not os.path.isfile(wikipedia_vector_path):
    LANGUAGE_MODEL = False
else:
    wikipedia_reader = wikipedia_word_vectors(wikipedia_vector_path)

if not os.path.isdir(data_directory):
    os.mkdir(data_directory)

In [None]:
test_vocab = set([u'不滿', u'問題', 'zblah', u'工廠'])
for r in wikipedia_reader.__iter__(test_vocab):
    print r[0], len(r[1])
    
print wikipedia_reader.path, wikipedia_reader._length, wikipedia_reader._vector_dim
r

In [None]:
rec = dict((f, getattr(r, f)) for f in r._fields)
print rec['word']

## Prepare User Content

Assumes that Stanford CoreNLP is running and listening on a socket.

In [None]:
if False:
    for dirpath, dirnames, filenames in os.walk('wikipedia_fa/zh/text/'):
        for fn in filenames:
            if '_body' in fn:
                post_file = 'wikipedia_fa/zh/text/' + fn
                out_file = post_file + '.json'
                properties = 'localhost:9000/?properties={"annotators":"tokenize,ssplit,pos","outputFormat":"json","pipelineLanguage":"zh","ssplit.boundaryTokenRegex":"[.]|[!?]+|[。]|[！？]+"}'
                cmd = 'wget --post-file {} {} -O {} '.format(post_file, properties, out_file)
                print cmd
                cmd_list = cmd.split()
                ret_val = subprocess.call(cmd_list)
                print ret_val,

In [None]:
if False:
    for dirpath, dirnames, filenames in os.walk('wikipedia_fa/zh/text/'):
        for fn in filenames:
            if '_summary' in fn:
                post_file = 'wikipedia_fa/zh/text/' + fn
                out_file = post_file + '.json'
                properties = 'localhost:9000/?properties={"annotators":"tokenize,ssplit,pos","outputFormat":"json","pipelineLanguage":"zh","ssplit.boundaryTokenRegex":"[.]|[!?]+|[。]|[！？]+"}'
                cmd = 'wget --post-file {} {} -O {} '.format(post_file, properties, out_file)
                print cmd
                cmd_list = cmd.split()
                ret_val = subprocess.call(cmd_list)
                print ret_val,

In [None]:
class corenlp_text_data():
    '''Assumes a directory full of json output generated by Standford CoreNLP.'''
    def __init__(self, root_dir=None, file_names=None, index_name=None):
        self.root_dir = root_dir
        self.file_names = file_names  ## An optional list of file names to scope analysis.
        self.tmp_dir = 'tmp_dir/'
        self.INDEX_NAME = 'test-v1'
        self.dictionary = Dictionary()
        
        ## TODO: Instead of keeping multiple copies of data in memory, write generators to stream off of disk.
        ## Keep data as unicode strings, then serialize dictionary as utf-8 json.
        
        ## Collect metadata on documents.
        self.data_stats = {}
        
        ## One (full) sentence per line, input for skipgram.
        self.text_sentences = []
        
        ## Format for mallet topic modeling. Input all text (actually nouns) all on one line.
        ## <doc_id> <tab> <text_one_line>
        self.mallet_input = []
        
        ## Restrict vocabulary to "nouns".
        self.sentences_nouns = []
        
        ## For each noun, keep a mapping of all documents that it appears in.
        self.document_noun = {}
        self.document_topic = {}
        
        ## Temporary directory to store all the intermediate files.
        if not os.path.isdir(self.tmp_dir):
            os.mkdir(self.tmp_dir)
            
        ## Walk the input directory and process files.
        ## Side effects: statistics on input directory
        ##               temporary directory full of files formated for topic modeling
        for dirpath, dirnames, filenames in os.walk(self.root_dir):
            if self.file_names == None:
                pass
            else:
                filenames = [fn for fn in filenames if fn in self.file_names]
            for fn in filenames:
                ## Assumption: json formatted output from Stanford corenlp -- sentences, word seg., pos.
                if fn.endswith('.json'):

                    ## Generate a document identifier, real document names can be ugly.
                    key = str(uuid.uuid4())

                    ## Pull out information encoded in directory name path.
                    ## OpenSubtitles2016/raw/vi/2006/761212/3826993.xml.gz
                    parts = dirpath.split('/')
                    root = parts[0]
                    subtitle_language = 'zh' #parts[2]
                    subtitle_year = ' ' #parts[3]
                    subtitle_id = ' ' #parts[4]
                    
                    ## Initialize statistics, add to his as each file is processed.
                    self.data_stats[key] = {'_type':'document', '_index':self.INDEX_NAME, 
                                            'doc_id':key, 'subtitle_language':subtitle_language,
                                            'subtitle_year':subtitle_year, 'subtitle_id':subtitle_id, 
                                            'filename':fn}
                    
                    ## Load and process segmented text.
                    local_content_file = dirpath + '/' + fn
                    self.process_local_content_file(fn=local_content_file, key=key)
                    
                    ## Load and process parts-of-speech.
                    ##pos_file = local_content_file # + '.pos'
                    ##self.process_parts_of_speech_file(fn=pos_file, key=key)
                    
        self.dictionary = Dictionary(self.get_texts())
                
    def process_local_content_file(self, fn=None, key=None):
        ## Keep track of lines (actual sentences because of preprocessing) infile.
        text_lines_file = []
        sentences_nouns = []
                    
        with open(fn,'rb') as fp:
            data = json.load(fp)
            for s in data['sentences']:
                ## u'tokens', u'word', u'pos'
                ## print s['index'], s.keys()
                ## Type line is unicode.
                line = ' '.join([w['word'] for w in s['tokens']])
                text_lines_file.append(line)
                
                ## NR (proper noun), NT (temporal noun), NN (other noun)
                ## Strip out just the nouns from the current sentence out of file fn.
                ## Type of sentence_nouns: list of unicode strings (multi-character nouns).
                sentence_nouns = [w['word'] for w in s['tokens'] if w['pos'] in ['NR','NT','NN']]
                
                ## Keep track of one long list of all sentences in all files.
                sentences_nouns.append(' '.join(sentence_nouns))
              
        ## Save sentences (just the nouns) to temporary file.
        with open(self.tmp_dir + key + '.txt','w') as fp:
            for s in sentences_nouns:
                ## Sentence s has type unicode (sequence of code points). 
                ## For output, encode it as a sequence of utf-8 bytes.
                fp.write(s.encode('utf-8') + '\n')

        ## Find the set of all unique nouns identified in the current file.
        ## Type of text_vocab_nouns is a set of unicode strings (multi-character nouns).
        nouns = [w for s in sentences_nouns for w in s.split()]
        text_vocab_nouns = set(nouns)

        ## Keep the global array text_sentences, which is a list of lists. 
        self.text_sentences.append(text_lines_file)
        
        ## Throw all the lines (sentences) from the current file into one long line.
        text_all_one_line = ' '.join(text_lines_file)
        
        ## Complete vocabulary for current file. List of unicode strings (multi-character words).
        text_vocab = set([w for text in text_lines_file for w in text.split()])
        
        ## Keep track of files that the nouns came from.
        for w in text_vocab_nouns:
            if w in self.document_noun:
                self.document_noun[w].append(key)
            else:
                self.document_noun[w] = [key]
                
        ## Throw all the nouns of the document into one unicode string. 
        ## This will be a field stored in elasticsearch.
        text_vocab_nouns_oneline = ' '.join(list(text_vocab_nouns))
        
        self.sentences_nouns.append(sentences_nouns)
        self.mallet_input.append((key,text_all_one_line))
        
        self.data_stats[key]['text'] = text_all_one_line
        self.data_stats[key]['no_sents'] = len(text_lines_file)
        self.data_stats[key]['no_words'] = len(text_vocab)     
        self.data_stats[key]['word'] = list(text_vocab_nouns)
        self.data_stats[key]['no_nouns'] = len(text_vocab_nouns)

    def load_segmented_sentences(self):
        for key in self.data_stats:
            yield self.data_stats[key]
            
    def __iter__(self):
        for line in open('test.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())
            
    def get_texts(self):
        sentences_nouns = [s.split() for f in self.sentences_nouns for s in f]
        for line in sentences_nouns:
            yield line

In [None]:
from cjklib.dictionary import CEDICT
d = CEDICT()

for t in d.getForHeadword(u'生存'):
    print t.HeadwordSimplified, t.Reading, t.Translation

## Compute Word Vectors

In [None]:
def compute_word_vectors(text_fn='tmp.txt', vector_dim=25, language_model=LANGUAGE_MODEL, sentences_nouns=None):
    ## Three basic steps:
    ## 1) Compute word embeddings of local text file.
    ## 2) Find overlap between local vocab and backgrond language_model base (e.g. language specific wikipedia).
    ## 3) Concatenate local and language_model base vectors.
    
    ## TODO: 
    ## Reconsider logic to recompute the vectors. Support running many experiments on same set of vectors.
    ## Clean up local variable names.
    ## Don't pass in sentences_nouns

    logger.info('Calculating skipgram vectors.')
    local_content_skipgram = fasttext.skipgram(text_fn, model_directory + 'tmp', 
        lr=0.02, dim=vector_dim, ws=5,
        epoch=5, min_count=0, neg=5, loss='ns', bucket=2000000, minn=1, maxn=4,
        thread=8, t=1e-5, lr_update_rate=100)
        
    words = [w for text in sentences_nouns for w in text]
    #nouns = [n for f in nouns_all for n in f]
    Vocab = set(words)

    if language_model == False:
        model_comb = local_content_skipgram
        model_comb_vocab = list(Vocab)
        return model_comb, model_comb_vocab
    else:
        logger.info('Concatenate local and language_model base word vectors: dim = %d', 
                    int(wikipedia_reader._vector_dim) + vector_dim)
        
        model_comb={}
        model_comb_vocab=[]
        
        for r in wikipedia_reader.__iter__(Vocab):
            record = dict((f, getattr(r, f)) for f in r._fields)
            w = record['word']
            v = np.array(record['vector'])
            model_comb[w] = np.array(np.concatenate((v,local_content_skipgram[w])))
            model_comb_vocab.append(w)

        logger.info('Length of common_vocab: %d', len(model_comb_vocab))

        combined_vectors = model_directory + 'tmp.combined_vectors.txt'
        writer = csv.writer(open(combined_vectors,'w'),delimiter='\t')
        for k in model_comb.keys():
            writer.writerow(model_comb[k])

        return model_comb, model_comb_vocab

In [None]:
###Helper Functions
def norm(a):
    return np.sqrt(np.sum(np.square(a)))

def cosine(a,b):
    return 1-np.dot(a,b)/np.sqrt(np.sum(a**2)*np.sum(b**2))

def l1(a,b):
    return abs(a-b).sum()

def l2(a,b):
    return np.sqrt(np.square(a-b).sum())

## Score vocabulary for depth

In [None]:
### Create a list of words to be clustered based on a model with some l2_threshold and can normalize the vectors 
### and also repeat or no
def create_word_list(model=None, vocab=None, features=None, Texts=None, repeat=True,
                     l2_threshold=0, normalized=True, min_count=100, min_length=0):
    data_d2v=[]
    word_d2v=[]
    words_text=[w for text in Texts for w in text]
    count=Counter(words_text)

    A=set(words_text)
    for w in vocab:
        if w in A and len(w)>min_length and l2(model[w],np.zeros(features))>l2_threshold and count[w]>min_count:
            if normalized:
                data_d2v.append(model[w]/l2(model[w],np.zeros(features)))
            else:
                data_d2v.append(model[w])
            word_d2v.append(w)
        else:
            logger.info('Mismatch: ', w)

    return data_d2v, word_d2v

In [None]:
def cluster_vectors(data_d2v):
    min_cluster_size_opt = 0
    min_samples_opt = 0
    count = 0
    while True:
        tsne = TSNE(n_jobs=4)
        X_2D = tsne.fit_transform(np.array(data_d2v))
        logger.info('Attempt: %d', count)
        count += 1
        cluster_params = []
        label_values = []
        for min_cluster_size in range(4,30):
            for min_samples in range(4,min_cluster_size):
                clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples)
                #labels = clusterer.fit_predict(np.array(data_d2v))
                labels = clusterer.fit_predict(X_2D)
                label_max = clusterer.labels_.max()
                if label_max >= 20:
                    label_values.append(label_max)
                    cluster_params.append((min_cluster_size,min_samples))
                    logger.info('min_samples:%d, min_cluster_size:%d, label_max:%d',
                                    min_samples,min_cluster_size,label_max)
        label_values.reverse()
        cluster_params.reverse()
        if len(label_values) > 0:
            i = np.argmax(label_values)
            min_cluster_size_opt, min_samples_opt = cluster_params[i]
            break
    
    clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size_opt, min_samples=min_samples_opt) 
    return X_2D, clusterer

In [None]:
def calculate_depth(clusterer,words, num_points):
    cluster=[[] for w in xrange(2*num_points)]
    c=Counter()
    for i in xrange(num_points):
        cluster[i]=[i]

    for i in xrange(len(clusterer)):
        x=int(clusterer[i,0])
        y=int(clusterer[i,1])
        xval=[w for w in cluster[x]]
        yval=[w for w in cluster[y]]
        cluster[num_points+i]=xval+yval
        for w in cluster[num_points+i]:
            c[words[w]]+=1
        cluster[x][:]=[]
        cluster[y][:]=[]    
    return c

## Score vocabulary for co-occurence

In [None]:
def co_occurence_graph(sentences_nouns=None, combined_vocab=None):
    logger.info('Computing co-occurence graph')

    T=[' '.join(w) for w in sentences_nouns]
    
    ##Co-occurence matrix
    #cv=CountVectorizer(token_pattern=u'(?u)\\b([^\\s]+)')
    cv=CountVectorizer(vocabulary=combined_vocab)
    bow_matrix = cv.fit_transform(T)
    id2word={}
    for key, value in cv.vocabulary_.items():
        id2word[value]=key

    ids=[]
    for key,value in cv.vocabulary_.iteritems():
        if key in combined_vocab:
            ids.append(value)

    sort_ids=sorted(ids)
    bow_reduced=bow_matrix[:,sort_ids]
    normalized = TfidfTransformer().fit_transform(bow_reduced)
    similarity_graph_reduced=bow_reduced.T * bow_reduced
    
    ##Depth-rank weighting of edges, weight of edge i,j=cosine of angle between them
    logger.info('Computing degree')
    m,n=similarity_graph_reduced.shape

    cx=similarity_graph_reduced.tocoo()
    keyz=[id2word[sort_ids[w]] for w in xrange(len(sort_ids))]
    data=[]
    ro=[]
    co=[]
    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
        if v>0 and i!=j:
            value=1
            if value>0:
                ro.append(i)
                co.append(j)
                data.append(value)

    SS=sp.sparse.coo_matrix((data, (ro, co)), shape=(m,n))
    SP_full=SS.tocsc()
    id_word={w:id2word[sort_ids[w]] for w in xrange(len(sort_ids))}
    word_id={value:key for key,value in id_word.items()}
    degsum=SP_full.sum(axis=1)
    deg={}
    for x in xrange(len(sort_ids)):
        deg[id2word[sort_ids[x]]]=int(degsum[x])
    
    return deg

## Summarize document

In [None]:
def summarize_document(file_name=None):
    root_dir = 'wikipedia_fa/zh/text/'
    vector_dim = 25
    
    sample_data = corenlp_text_data(root_dir=root_dir,file_names=[file_name])
    stats = [sample_data.data_stats[key] for key in sample_data.data_stats]
    df = pd.DataFrame(stats)
    sentences_nouns = [s.split() for f in sample_data.sentences_nouns for s in f]
    
    with open('tmp.txt','w') as fp:
        for s in sentences_nouns:
            s_string = ' '.join(s)
            fp.write(s_string.encode('utf-8') + '\n')
            
    model_comb, model_comb_vocab = compute_word_vectors(language_model=LANGUAGE_MODEL, vector_dim=vector_dim,
                                                        sentences_nouns=sentences_nouns)

    logger.info('Clustering for depth...')
    local_vec = True

    features = int(wikipedia_reader._vector_dim)*LANGUAGE_MODEL + vector_dim
    data_d2v,word_d2v = create_word_list(model=model_comb, vocab=model_comb_vocab, features=features, 
                                         Texts=sentences_nouns, l2_threshold=0, normalized=True, 
                                         min_count=0, min_length=0)

    X_2D, clusterer = cluster_vectors(data_d2v)
    labels = clusterer.fit_predict(X_2D)
    
    mcv = {}
    for n,w in enumerate(word_d2v):
        mcv[w] = n
    ##Calculate depth of words
    depth = calculate_depth(clusterer.single_linkage_tree_.to_numpy(), word_d2v, len(data_d2v))
    deg = co_occurence_graph(sentences_nouns=sentences_nouns, combined_vocab=mcv)
    
    logger.info('Computing metrics')

    max_deg=max(deg.values())
    max_depth=max(depth.values())

    temp_deg_mod={w:np.log(1+deg[w])/np.log(1+max_deg) for w in deg.iterkeys()}
    alpha=np.log(0.5)/np.log(np.median(temp_deg_mod.values()))
    deg_mod={key:value**alpha for key, value in temp_deg_mod.iteritems()}

    temp={key:value*1./max_depth for key, value in depth.iteritems()}
    alpha=np.log(0.5)/np.log(np.median(temp.values()))
    depth_mod={key:value**alpha for key, value in temp.iteritems()}

    temp={key:deg_mod[key]*depth_mod[key] for key in depth_mod.iterkeys()}
    max_metric=np.max(temp.values())
    metric={key:value*1./max_metric for key,value in temp.iteritems()}

    logger.info('max_deg = %s, max_depth = %s',max_deg, max_depth)
    
    K = clusterer.labels_.max()+1
    print K
    cluster_label={word_d2v[x]:labels[x] for x in xrange(len(word_d2v))}

    cluster_label_ranked={}

    topic=[[] for i in xrange(-1,K)]
    clust_depth=[[] for i in xrange(K)]
    for i in xrange(K):
        topic[i]=[word_d2v[x] for x in xrange(len(word_d2v)) if labels[x]==i]
        #temp_score=[metric[w] for w in topic[i]]
        temp_score = []
        for w in topic[i]:
            if w in metric: temp_score.append(metric[w])
        clust_depth[i]=-np.sum(sorted(temp_score,reverse=True)[:])#int(np.sqrt(len(topic[i])))])
    index=np.argsort(clust_depth)
    index2=np.argsort(-clusterer.cluster_persistence_)
    for i in xrange(K):
        for w in topic[index[i]]:
            cluster_label_ranked[w]=i

    noise = [word_d2v[x] for x in xrange(len(word_d2v)) if labels[x]==-1]
    for w in noise:
        cluster_label_ranked[w] = -1
        
    logger.info('Done...Generating output')
    lister=[]
    to_show=K
    to_show_words=200 #the maximum number of words of each type to display
    for i in xrange(to_show):
        top=topic[index[i]]
        sort_top=[w[0] for w in sorted([[w,metric[w]] for w in top],key=itemgetter(1),reverse=True)]
        lister.append(['Topic %d' %(i+1)]+sort_top[:to_show_words])

    max_len=max([len(w) for w in lister])
    new_list=[]
    for list_el in lister:
        new_list.append(list_el + [''] * (max_len - len(list_el)))
    Topics=list(itertools.izip_longest(*new_list))
    #X.insert(len(X),[-int(clust_depth[index[w]]*100)*1./100 for w in xrange(K)])
    sorted_words=[w[0] for w in sorted(metric.items(),key=itemgetter(1),reverse=True)][:to_show_words]

    return sample_data, word_d2v, metric, Topics

In [None]:
fn = 'addd9d361106ea4c7a79e3a79c8be18d_body.txt.json'
fn = '89e8cfc72f7fde2851cd174dc73b2c44_body.txt.json'
sample_data, word_d2v, metric, Topics = summarize_document(file_name=fn)
df = pd.DataFrame([sample_data.data_stats[key] for key in sample_data.data_stats])
df[['subtitle_language','subtitle_year','subtitle_id','filename', 'no_sents','no_words','no_nouns']]

In [None]:
df_topics = pd.DataFrame([Topics[i][0:15] for i in range(0,11)])
df_topics.columns = df_topics.iloc[0]
df_topics = df_topics.reindex(df_topics.index.drop(0))
df_topics

In [None]:
from gensim.models.word2vec import LineSentence

lines = LineSentence('tmp.txt')
dictionary = corpora.Dictionary(lines)
sentences_nouns = [s.split() for f in sample_data.sentences_nouns for s in f]
with open('lengths.txt','w') as fp:
    for sentence_id, sentence in enumerate(sentences_nouns):
        fp.write(str(len(sentence)) + '\n')

with open('sets.txt','w') as fp:
    for sentence_id, sentence in enumerate(sentences_nouns):
        #print len(sentence), sentence
        term_ids = [w[0] for w in dictionary.doc2bow(sentence)]
        for term_id in term_ids:
            fp.write(str(sentence_id+1) + ' ' + str(term_id+1) + '\n')

In [None]:
print len(dictionary), dictionary

In [None]:
with open('vocab.txt','w') as fp:
    for n in range(0,len(dictionary)):
        w = dictionary.get(n)
        try:
            fp.write(str(metric[w])+'\n')
        except:
            fp.write(str(0) + '\n')

In [None]:
nz = 3700
M = len(dictionary)
N = len(sentences_nouns)
s = 100
D = 'sets.txt'
W = 'vocab.txt'
L = 'lengths.txt'
b = 5
cmd = '/root/notebooks/stage/mad-science/OCCAMSV5/occams_v5 -z {} -m {} -n {} -s {} -D {} -W {} -L {} -b {}'.format(nz, M, N, s, D, W, L, b)
print cmd
cmd_list = cmd.split()
ret_val = subprocess.check_output(cmd_list, stderr=subprocess.STDOUT)
for line in ret_val.split('\n'):
    if 'Chosen sentences' in line:
        print line
        sentence_ids = line.split(':')[1].split()

In [None]:
summary_nouns = []
for n in [int(n)-1 for n in sentence_ids]:
    summary_nouns.append(' '.join(sample_data.sentences_nouns[0][n].split()))
    #print sample_data.sentences_nouns[0][n].split()
    #print n, '|'.join(sample_data.sentences_nouns[0][n].split())

summary_nouns = set([w for line in summary_nouns for w in line.split()])
readings = []
translations = []
for w in summary_nouns:
    try:
        t = d.getForHeadword(w).next()
        readings.append(t.Reading)
        translations.append(t.Translation)
    except:
        readings.append(' ')
        translations.append(' ')
        
topic_numbers = []
for w in summary_nouns:
    try:
        topic_numbers.append(cluster_label_ranked[w]+1)
    except:
        topic_numbers.append(-1)
        
weights = []
for w in summary_nouns:
    try:
        weights.append(metric[w])
    except:
        weights.append(-1)
terms_df = pd.DataFrame(zip(summary_nouns,readings,translations,weights,topic_numbers),
                        columns=['Noun','Reading','Translation','Weight','Topic'])
print len(terms_df)
grouped = terms_df.groupby('Topic')
for name,group in grouped:
    print name, #group.sum(),
#    print group
terms_df.sort_values(by='Weight',ascending=False)

In [None]:
sids = [int(sid)-1 for sid in sentence_ids]
with open('summary.txt','w') as fp:
    for sid in sids:
        s = sample_data.text_sentences[0][sid]
        fp.write(s.encode('utf-8') + '\n')
        print str(sid)+': ', sample_data.text_sentences[0][sid]

## Evaluate Summarization

In [None]:
def write_eval_info(filename=None,exp_no=1):
    ## System summaries (i.e. output from algorithm).
    fn = filename.split('.')[0]
    fn_base = re.sub('_body','', fn)
    system_vocab = []
    out_fn = 'eval/systems/' + fn_base + '.' + str(exp_no) + '.txt'
    print out_fn
    with open(out_fn,'w') as fp:
        for sid in sids:
            s = sample_data.text_sentences[0][sid]
            system_vocab.extend(s.split())
            fp.write(s.encode('utf-8') + '\n')
    system_vocab = set(system_vocab)
    print len(system_vocab)
    for w in system_vocab:
        print w,
            
    ## Model summaries (i.e. gold standard).
    root_dir = 'wikipedia_fa/zh/text/'
    fn_summary = re.sub('_body','_summary', fn)
    model_vocab = []
    !ls {root_dir + fn_summary + '.txt.json'}
    data = corenlp_text_data(root_dir=root_dir,file_names=[fn_summary + '.txt.json'])
    sentences_nouns = [s.split() for f in data.sentences_nouns for s in f]
    out_fn = 'eval/models/' + fn_base + '.A.' + str(exp_no) + '.txt'
    print out_fn
    with open(out_fn,'w') as fp:
        for s in sentences_nouns:
            model_vocab.extend(s)
            s_string = ' '.join(s)
            fp.write(s_string.encode('utf-8') + '\n')
    model_vocab = set(model_vocab)
    print len(model_vocab)
    for w in model_vocab:
        print w,
    
    print '\n' + '=='*40
    for w in model_vocab.intersection(system_vocab):
        print w,

In [None]:
def write_eval_info_nouns(filename=None,exp_no=1):
    ## System summaries (i.e. output from algorithm).
    fn = filename.split('.')[0]
    fn_base = re.sub('_body','', fn)
    system_vocab = []
    out_fn = 'eval/systems/' + fn_base + '.' + str(exp_no) + '.txt'
    print out_fn
    with open(out_fn,'w') as fp:
        for sid in sids:
            s = sample_data.sentences_nouns[0][sid]
            system_vocab.extend(s.split())
            fp.write(s.encode('utf-8') + '\n')
    s_peer = ' '.join(system_vocab)
    #s_peer = ' '.join(list(s_peer)) ## Look at just unicode characters instead of words.
    print type(s_peer)
    system_vocab = set(system_vocab)
    #print len(system_vocab)
    #for w in system_vocab:
    #    print w,
            
    ## Model summaries (i.e. gold standard).
    root_dir = 'wikipedia_fa/zh/text/'
    fn_summary = re.sub('_body','_summary', fn)
    model_vocab = []
    !ls {root_dir + fn_summary + '.txt.json'}
    data = corenlp_text_data(root_dir=root_dir,file_names=[fn_summary + '.txt.json'])
    sentences_nouns = [s.split() for f in data.sentences_nouns for s in f]
    out_fn = 'eval/models/' + fn_base + '.A.' + str(exp_no) + '.txt'
    print out_fn
    with open(out_fn,'w') as fp:
        for s in sentences_nouns:
            model_vocab.extend(s)
            s_string = ' '.join(s)
            fp.write(s_string.encode('utf-8') + '\n')
    s_model = ' '.join(model_vocab)
    #s_model = ' '.join(list(s_model)) ## Look at just unicode characters instead of words.
    model_vocab = set(model_vocab)
    #print len(model_vocab)
    #for w in model_vocab:
    #    print w,
    
    #print '\n' + '=='*40
    #for w in model_vocab.intersection(system_vocab):
    #    print w,
        
    return s_peer,s_model

In [None]:
s_peer,s_model = write_eval_info_nouns(df['filename'][0])

In [None]:
from itertools import tee, islice

def ngrams(lst, n):
  tlst = lst
  while True:
    a, b = tee(tlst)
    l = tuple(islice(a, n))
    if len(l) == n:
      yield l
      next(b)
      tlst = b
    else:
      break

In [None]:
def rouge_n_r(s_peer=None,s_model=None,ngram_n=1,display=False):
    s_peer_ngrams = Counter([ng for ng in ngrams(s_peer.split(),ngram_n)])
    s_model_ngrams = Counter([ng for ng in ngrams(s_model.split(),ngram_n)])

    t = 0
    tab = [['word','peer cnt','model cnt','min']]
    for ng in s_peer_ngrams:
        if ng in s_model_ngrams:
            t += min(s_peer_ngrams[ng],s_model_ngrams[ng]) ## Clip
            tab.append([' '.join(ng),s_peer_ngrams[ng],s_model_ngrams[ng],min(s_peer_ngrams[ng],s_model_ngrams[ng])])
            
    rnr = float(t)/sum(s_model_ngrams.values())

    if display == True:
        print 'total ' + str(ngram_n) + '-gram model count: ', sum(s_model_ngrams.values())
        print 'total ' + str(ngram_n) + '-gram peer count: ', sum(s_peer_ngrams.values())
        print 'total ' + str(ngram_n) + '-gram hit: ', t
        print 'total ROUGE-' + str(ngram_n) + '-R: ', float(t)/sum(s_model_ngrams.values())
        print tabulate(tab,headers='firstrow',tablefmt='psql')
        
    return rnr,tab

In [None]:
ngram_n = 1
rnr,tab = rouge_n_r(s_peer,s_model,ngram_n,display=False)
rouge_df = pd.DataFrame(tab)
rouge_df.columns = rouge_df.iloc[0]
rouge_df = rouge_df.reindex(rouge_df.index.drop(0))
print 'total ROUGE-' + str(ngram_n) + '-R: ', rnr
rouge_df

In [None]:
ngram_n = 2
rnr,tab = rouge_n_r(s_peer,s_model,ngram_n,display=False)
rouge_df = pd.DataFrame(tab)
rouge_df.columns = rouge_df.iloc[0]
rouge_df = rouge_df.reindex(rouge_df.index.drop(0))
print 'total ROUGE-' + str(ngram_n) + '-R: ', rnr
rouge_df

In [None]:
ngram_n = 3
rnr,tab = rouge_n_r(s_peer,s_model,ngram_n,display=False)
rouge_df = pd.DataFrame(tab)
rouge_df.columns = rouge_df.iloc[0]
rouge_df = rouge_df.reindex(rouge_df.index.drop(0))
print 'total ROUGE-' + str(ngram_n) + '-R: ', rnr
rouge_df

**TODO: This does not match output from ROUGE perl script.**

In [None]:
def rouge_skipgram_s2(s_peer=None,s_model=None,max_skip=2,display=False):
    tokens = ' '.join(list(s_peer)).split()  ## Characters (UNICODE)
    #tokens = s_peer.split()                 ## Words (From preprocessing segmentation step.)
    sgrams = [(tokens[index], tokens[index+j]) for index in range(len(tokens)) for j in range(1,max_skip+1) if (index + j) < len(tokens)]
    s_peer_ngrams = Counter([ng for ng in sgrams])
    
    tokens = ' '.join(list(s_model)).split()
    #tokens = s_model.split()
    sgrams = [(tokens[index], tokens[index+j]) for index in range(len(tokens)) for j in range(1,max_skip+1) if (index + j) < len(tokens)]
    s_model_ngrams = Counter([ng for ng in sgrams])

    t = 0
    tab = [['word','peer cnt','model cnt','min']]
    for ng in s_peer_ngrams:
        if ng in s_model_ngrams:
            t += min(s_peer_ngrams[ng],s_model_ngrams[ng]) ## Clip
            tab.append([' '.join(ng),s_peer_ngrams[ng],s_model_ngrams[ng],min(s_peer_ngrams[ng],s_model_ngrams[ng])])
            
    rnr = float(t)/sum(s_model_ngrams.values())

    if display == True:
        print 'total ROUGE-S' + str(max_skip) + ' model count: ', sum(s_model_ngrams.values())
        print 'total ROUGE-S' + str(max_skip) + ' peer count: ', sum(s_peer_ngrams.values())
        print 'total ROUGE-S' + str(max_skip) + ' hit: ', t
        print 'total ROUGE-S' + str(max_skip) + '-R: ', float(t)/sum(s_model_ngrams.values())
        print tabulate(tab,headers='firstrow',tablefmt='psql')
        
    return rnr,tab

In [None]:
rnr,tab = rouge_skipgram_s2(s_peer,s_model,display=False)
rouge_df = pd.DataFrame(tab)
rouge_df.columns = rouge_df.iloc[0]
rouge_df = rouge_df.reindex(rouge_df.index.drop(0))
print 'total ROUGE-' + str(ngram_n) + '-R: ', rnr
rouge_df