Topic Modeling Using Distributed Word Embeddings
================================================
Notebook version of https://github.com/rsrandhawa/Vec2Topic code, based on the article "Topic Modeling Using Distributed Word Embeddings" by R. S. Randhawa, P. Jain, and G. Madan. 

The basic approach is to first create a language model based on a large (ideally billions of words) text corpus. The technology used, distributed word embeddings, is a shallow neural network that seems to perform best on large datasets (trades simple but fast computation for tons of data).

The user generated content (which is usually a much smaller corpus) is likewise trained with consistent parameters. Vectors corresponding to the same vocabulary word are concatenated together to provide a model of the user generated content.

Word vectors that cluster together are interperted as topics of the user generated content. Some clusters appear better than others because they consist of coherent lists of words -- main goal is to score the importance of each topic.

Performing a hierarchical clustering provides a measure of depth for each word and computing a co-occurance graph (edge between two words if they belong to the same sentenence) provides a degree of co-occurance. Each word is scored by a (normalized) product of depth and degree. KMeans is used to cluster words into topics, and the scoring function is used to order the words and the topics.

Notebook below uses `fasttext` instead of `word2vec`. Looks like `fasttext` is showing too much prefernce to word endings (used default of 3-6 ngrams).

<pre>
+---------------+---------------+-----------+------------+-----------+-------------+--------------+-------------+
| Topic 1       | Topic 2       | Topic 3   | Topic 4    | Topic 5   | Topic 6     | Topic 7      | Topic 8     |
| investment    | costs         | meeting   | future     | customers | security    | opportunity  | market      |
| management    | opportunities | marketing | enterprise | markets   | asset       | poverty      | trade       |
| response      | operations    | planning  | interest   | offers    | agreement   | debt         | summit      |
| globalization | activities    | waiting   | business   | savings   | alliance    | tax          | fax         |
| process       | securities    | trading   | compaq     | partners  | seminar     | peacekeeping | pipeline    |
| conversation  | assets        | pricing   | demand     | remarks   | mission     | expense      | marketplace |
| stabilization | strategies    | housing   | customer   | others    | leadership  | ideal        | lay         |
| ability       | solutions     | opening   | exchange   | meetings  | office      | audience     | shoreline   |
| invitation    | investments   | evening   | power      | seminars  | service     | awareness    | street      |
| integration   | analysts      | sizing    | balance    | stories   | partnership | threat       | deep-water  |
+---------------+---------------+-----------+------------+-----------+-------------+--------------+-------------+
</pre>

Required standard packages
--------------------------

In [None]:
import logging, re, os, bz2, pickle
from collections import Counter
from operator import itemgetter
import itertools

In [None]:
## Unicode wrapper for reading & writing csv files.
import unicodecsv as csv

## Lighter weight than pandas -- tabular display of tables.
from tabulate import tabulate

## In order to strip out the text from the xml formated data.
from bs4 import BeautifulSoup
import lxml

Required data science packages
------------------------------

In [None]:
## First the usual suspects: numpy, scipy, and gensim
import numpy as np
import scipy as sp
import gensim

## For scraping text out of a wikipedia dump. Get dumps at https://dumps.wikimedia.org/backup-index.html
from gensim.corpora import WikiCorpus

## For computing phrases from input text.
from gensim.models.phrases import Phrases

from textblob import TextBlob
import nltk

## Latest greatest word vectors (see https://pypi.python.org/pypi/fasttext).
import fasttext

## Latest greatest hierarchical clustering package. 
## Word vectors are clustered, with deeper trees indicating core topics.
import fastcluster

## Use scikit-learn to generate co-occurancy graph (edge if words in same sentence).
## The degree of each word indicates how strong it co-occurs.
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

## Use scikit-learn for K-Means clustering: identify topics.
from sklearn.cluster import KMeans

Base data directory and logging.
--------------------------------
The approach currently uses a lot of intermediate files (which is annoying, but means that the project can work on machines with smaller physical memory). The initial data (knowledge base as well as user generated content) and the intermediate files are all kept in the data directory.

In [None]:
data_directory = 'data/'
model_directory = 'models/'

In [None]:
from imp import reload
reload(logging)

LOG_FILENAME = data_directory + 'vec2topic.log'
#logging.basicConfig(filename=LOG_FILENAME,level=logging.INFO)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s %(message)s',"%b-%d-%Y %H:%M:%S")
logger.handlers[0].setFormatter(formatter)

List of intermediate files.
---------------------------
The (global) knowledge base is built off a (large) dataset.

In [None]:
## Main inputs to program. Data for model and name of knowledge base (background language model).
knowledge_base = 'simplewiki-20160820-pages-articles.xml.bz2'
knowledge_base_prefix = 'simplewiki-20160820-pages-articles'
knowledge_base_vector_dimension = 200    # Word vector dimension for knowledge base.

## Intermediate files generated from inputs.
knowledge_base_text = data_directory + knowledge_base_prefix + '.txt'
knowledge_base_phrases = data_directory + knowledge_base_prefix + '_phrases.txt'
knowledge_base_model = model_directory + knowledge_base_prefix + '.bin'
knowledge_base_vectors = model_directory + knowledge_base_prefix + '.vec'
knowledge_base_vectors_tsne = model_directory + knowledge_base_prefix + '_vec_tsne.txt'
knowledge_base_vocab = model_directory + knowledge_base_prefix + '_vocab.txt'
knowledge_base_bigrams = data_directory + knowledge_base_prefix + '_bigrams.pkl'

The (local) user generated content. Sample data from OpenSubtitles: http://opus.lingfil.uu.se/OpenSubtitles2016/xml/en/2015/369610/6300079.xml.gz

In [None]:
## Main inputs to program (data for user content and name of local model).
local_content_name = 'ken_lay_text'
#local_content_name = 'OpenSubtitles2016_xml_en_2015_369610_6300079'
local_content_vector_dimension = 25

## File names for user content.
local_content = data_directory + local_content_name 
local_content_xml = local_content + '.xml'
local_content_txt = local_content + '.txt'
local_content_phrases = local_content + '_phrases.txt'

## Intermediate files resulting from computation of word embeddings using fastText package.
local_content_vectors = model_directory + local_content_name + '.vec'
local_content_model = model_directory + local_content_name + '.bin'

## Projected 2D vectors useful for visualization.
#local_content_vectors_tsne = data_directory + local_content_prefix + '_vec_tsne.txt'

In [None]:
combined_vectors = model_directory + local_content_name + '.combined_vectors.txt'
combined_vectors_tsne = model_directory + local_content_name + '.combined_vectors_tsne.txt'

Global knowledge vectors -- English Wikipedi
--------------------------------------------
First step is to compute word embeddings of a global knowledge base from the English Wikipedia to capture the generic meaning of words in widely used contexts.

The gensim package has examples of processing wikipedia dumps as well as streaming corpus implementation. The article just glosses over these steps and the sample github code grabs an undocumented data set from the authors drobbox account. 

### Process wikipedia dump
First download the wikipedia dump and place it in the data directory before running this notebook. The cell below will use the gensim class WikiCorpus to strip the wikipedia markup and store each article as one line of the output text file. Only do these computations once if possible.

In [None]:
knowledge_base_text

In [None]:
if not os.path.isfile(knowledge_base_text):
    space = ' '
    i = 0
    output = open(knowledge_base_text, 'wb')
    logger.info('Processing knowledge base %s', knowledge_base)
    wiki = WikiCorpus(data_directory + knowledge_base, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")
else:
    logger.info('Knowledge base %s already on disk.', knowledge_base_text)

TODO: Use gensim to compute phrases.

In [None]:
ls {data_directory + knowledge_base}

In [None]:
count = 0
fp = bz2.BZ2File(data_directory + knowledge_base,'rU')
for title, text, pageid in gensim.corpora.wikicorpus.extract_pages(fp):
#for (tokens, (pageid, title)) in wiki.get_texts():
    if count >= 1: break
    count += 1
    text = gensim.corpora.wikicorpus.filter_wiki(text)
    sents = nltk.sent_tokenize(text.lower())
    print pageid, title
    for sent in sents:
        print sent
        
fp.close()

In [None]:
def read_sents_from_data(path):
    with bz2.BZ2File(path, 'rb') as data:    
        for title, text, pageid in gensim.corpora.wikicorpus.extract_pages(data):
            text = gensim.corpora.wikicorpus.filter_wiki(text)
            sents = nltk.sent_tokenize(text.lower())
            for sent in sents:
                yield nltk.word_tokenize(sent)

In [None]:
count = 0
for sent in read_sents_from_data(data_directory + knowledge_base):
    if True and count >= 10:
        break
    count += 1
    print sent

*TODO:* Determine optimal values of `max_vocab_size` for 16G RAM.

In [None]:
kb_bigrams = Phrases(read_sents_from_data(data_directory + knowledge_base), threshold=100.0)

In [None]:
for phrase, score in kb_bigrams.export_phrases(read_sents_from_data(data_directory + knowledge_base)):
    print(u'{0}\t{1}'.format(phrase, score))

In [None]:
if not os.path.isfile(knowledge_base_bigrams):
    with open(knowledge_base_bigrams,'w') as bigrams_fp:
        pickle.dump(kb_bigrams, bigrams_fp)
    logger.info('Saved copy of knowledge base bigrams %s', knowledge_base_bigrams)
else:
    with open(knowledge_base_bigrams,'r') as bigrams_fp:
        pickle.load(kb_bigrams, bigrams_fp)    
    logger.info('Read copy of knowledge base bigrams %s', knowledge_base_bigrams)

In [None]:
#for s in kb_trigrams[[sent.split() for sent in sents]]:
#    print ' '.join(s)

In [None]:
#kb_trigrams = Phrases(kb_bigrams[read_sents_from_data(data_directory + knowledge_base)])

In [None]:
#for phrase, score in kb_trigrams.export_phrases(read_sents_from_data(data_directory + knowledge_base)):
#    print(u'{0}\t{1}'.format(phrase, score))

In [None]:
if not os.path.isfile(knowledge_base_phrases):
    with open(knowledge_base_phrases, 'w') as data:
        for sent in read_sents_from_data(data_directory + knowledge_base):
            s = ' '.join(kb_bigrams[sent]) + u'\n'
            data.write(s.encode('utf-8'))
    logger.info('Saved copy of knowledge base phrases %s', knowledge_base_phrases)
else:    
    logger.info('Copy of knowledge base phrases %s on disk.', knowledge_base_phrases)

### Compute word vectors for knowledge base

Some computational performances comparing `word2vec` vs. `fasttext`. 

For computing full wikipedia using `word2vec`, using 300 dimensional word vectors, need to filter vocabulary so that basic memory usage of word2vec fits in physical memory. 

> the `syn0` structure holding (input) word-vectors-in-training will require:
> 5759121 (your vocab size) * 600 (dimensions) * 4 bytes/dimension = 13.8GB
> The `syn1neg` array (hidden->output weights) will require another 13.8GB.
<pre>
min_count = 10 results in 2,947,700 words (requires more than 7G physical memory)
min_count = 5 results in 4,733,171 words (requires more than 11G physical memory)
min_count = 0 results in 11,631,317 words (requires more than 28G physical memory)
</pre>

Using `fasttext` with `min_count=5`, `bucket=2000000`, and `t=1e-4` on enwiki, used a constant 8.43G memory used during computation (over 10 hours 8-core, 16G ram). Final vocabulary has 2,114,311 words.

In [None]:
if not os.path.isfile(knowledge_base_vectors):
    knowledge_base_skipgram = fasttext.skipgram(knowledge_base_phrases, 
        model_directory + knowledge_base_prefix, lr=0.02, 
        dim=knowledge_base_vector_dimension, ws=5, word_ngrams=1,
        epoch=1, min_count=5, neg=5, loss='ns', bucket=2000000, minn=3, maxn=6,
        thread=8, t=1e-4, lr_update_rate=100)
else:
    logger.info('Knowledge vectors %s already on disk.', knowledge_base_vectors)
    knowledge_base_skipgram = fasttext.load_model(knowledge_base_model)

In [None]:
len(knowledge_base_skipgram.words)

Simple test to see if the model created/read ok.

In [None]:
print u'supermarket' in knowledge_base_skipgram
print u'san_diego' in knowledge_base_skipgram
print u'San_Diego' in knowledge_base_skipgram

Create a counter to keep track of the knowledge base vocabulary. Later the sample code uses this to find the vocabulary in common between the knowledge base and the user generated data. Try to process both data sets in the same way.

In [None]:
knowledge_base_exist = Counter()
for w in knowledge_base_skipgram.words:
    knowledge_base_exist[w.lower()] = w.lower()
knowledge_base_vocab_lowercase = knowledge_base_exist.keys()

In [None]:
logger.info('funky: %s', knowledge_base_exist[u'funky'])
logger.info('san_diego: %s', knowledge_base_exist[u'san_diego'])

User content vectors -- OpenSubtitles2016
-----------------------------------------
OpenSubtitles is a very useful project for language analysis since it has a decent collection of parrallel sentences -- the foreign language captions that enthusiasts have created for their favorite movies.

---
Start with an `input.xml`, file listing captions from foreign film' obtained from the OpenSubtitle project. 

<pre>
BeautifulSoup:                     input.xml -> input.txt 
                           local_content_xml -> local_content_txt   
</pre>

In [None]:
local_content_txt

In [None]:
with open(local_content_xml,'r') as fp:
    soup = BeautifulSoup(fp,'lxml')

In [None]:
with open(local_content_txt,'w') as fp:
    for s in soup.findAll('s'): 
        text = ' '.join(s.text.strip().lower().split())
        fp.write(text.encode('utf-8') + '\n')

In [None]:
## After processing, each line is a sentence. 
## Read in lines, skipping empty lines, to yield
text_lines = []
with open(local_content_txt, 'rb') as local_content_file:
    for line in local_content_file:
        line = line.strip()
        if line != '':
            text_lines.append(line)
for line in text_lines[0:3]:
    logger.info('%s', line)
logger.info("Text lines: %d", len(text_lines))

## The 
sentences = [[w for w in line.split()] for line in text_lines]
for sent in sentences[0:3]:
    logger.info('%s', sent)

In [None]:
lc_bigrams = kb_bigrams[sentences]

In [None]:
for phrase, score in kb_bigrams.export_phrases(sentences):
    print(u'{0}\t{1}'.format(phrase, score))

In [None]:
count = 0
with open(local_content_phrases, 'w') as data:
    for sent in kb_bigrams[sentences]:
        if False and count >= 100: break
        count += 1
        s = ' '.join(sent) + u'\n'
        data.write(s.lower().encode('utf-8'))
        
logger.info('Wrote %s sentences.', count)

In [None]:
def read_nouns_from_pos_data(path):
    with open(path, 'rU') as data:
        reader = csv.reader(data, delimiter=' ')
        for row in reader:
            nouns = []
            blob=TextBlob(' '.join(row))
            for word,tag in blob.tags:
                if tag in ['NN','NNP','NNS','NNPS']:
                    nouns.append(word) 
            yield nouns

In [None]:
sentences_nouns = []
count = 0
for nouns in read_nouns_from_pos_data(local_content_phrases): 
    if False and count > 10: 
        break
    count += 1
    sentences_nouns.append(nouns)
    
logger.info('%d', len(sentences_nouns))

Compute word vectors
--------------------

In [None]:
data_directory + local_content_name

In [None]:
if not os.path.isfile(local_content_vectors):
    local_content_skipgram = fasttext.skipgram(local_content_phrases, model_directory + local_content_name, 
        lr=0.02, dim=local_content_vector_dimension, ws=5, word_ngrams=1,
        epoch=1, min_count=0, neg=5, loss='ns', bucket=2000000, minn=3, maxn=6,
        thread=8, t=1e-4, lr_update_rate=100)
else:
    logger.info('Local vectors %s already on disk.', local_content_vectors)
    local_content_skipgram = fasttext.load_model(local_content_model)

In [None]:
logger.info('Creating word vecs')

words=[w for text in sentences_nouns for w in text]
Vocab=set(words)

model_comb={}
model_comb_vocab=[]

common_vocab=set(knowledge_base_vocab_lowercase).intersection(local_content_skipgram.words).intersection(Vocab)

for w in common_vocab:
    if len(w)>2:
        model_comb[w]=np.array(np.concatenate((knowledge_base_skipgram[w],local_content_skipgram[w])))
        model_comb_vocab.append(w)
    else:
        logger.info(w)
        
logger.info('Length of common_vocab = %d', len(common_vocab))

In [None]:
print len(set(knowledge_base_skipgram.words))
print len(set(local_content_skipgram.words))
print len(Vocab)

In [None]:
writer = csv.writer(open(combined_vectors,'w'),delimiter='\t')
for k in model_comb.keys():
    writer.writerow(model_comb[k])

In [None]:
combined_vectors

In [None]:
!/root/bhtsne/bhtsne.py -i {combined_vectors} -o {combined_vectors_tsne} -d 2 -v

In [None]:
import pandas as pd
from bokeh.charts import Scatter, output_notebook, show

In [None]:
output_notebook()

In [None]:
reader = csv.reader(open(combined_vectors_tsne), delimiter='\t')
count = 0
X_full = []
for row in reader:
    if False and count >= 10000:
        break
    X_full.append(np.array([float(p) for p in row]))
    count += 1

In [None]:
df = pd.DataFrame(X_full, columns=['x','y'])

In [None]:
df.tail()

In [None]:
p = Scatter(df, x='x', y='y', color='blue')
show(p)

In [None]:
##Create a frequency count of words in email
words=[w for text in sentences_nouns for w in text]
Vocab=set(words)

In [None]:
###Helper Functions
def norm(a):
    return np.sqrt(np.sum(np.square(a)))

def cosine(a,b):
    return 1-np.dot(a,b)/np.sqrt(np.sum(a**2)*np.sum(b**2))

def l1(a,b):
    return abs(a-b).sum()

def l2(a,b):
    return np.sqrt(np.square(a-b).sum())

In [None]:
### Create a list of words to be clustered based on a model with some l2_threshold and can normalize the vectors 
### and also repeat or no
def create_word_list(model,vocab,features,Texts,repeat=True,l2_threshold=0,normalized=True,min_count=100,min_length=0):
    data_d2v=[]
    word_d2v=[]
    words_text=[w for text in Texts for w in text]
    count=Counter(words_text)
    if repeat:
        for text in Texts:
            for w in text:
                if w in vocab and count[w]>min_count:
                    if len(w)>min_length and l2(model[w],np.zeros(features))>l2_threshold:
                        if normalized:
                            data_d2v.append(model[w]/l2(model[w],np.zeros(features)))
                        else:
                            data_d2v.append(model[w])
                        word_d2v.append(w)
    else:
        A=set(words_text)
        for w in vocab:
            if w in A and len(w)>min_length and l2(model[w],np.zeros(features))>l2_threshold and count[w]>min_count:
                if normalized:
                    data_d2v.append(model[w]/l2(model[w],np.zeros(features)))
                else:
                    data_d2v.append(model[w])
                word_d2v.append(w)

    return data_d2v, word_d2v

In [None]:
#Run Agglomerative clustering
logger.info('Clustering for depth...')
local_vec = True

data_d2v,word_d2v=create_word_list(model_comb,model_comb_vocab,25*local_vec+200,sentences_nouns,repeat=False,normalized=True,min_count=0,l2_threshold=0)
spcluster=fastcluster.linkage(data_d2v,method='average',metric='cosine')

In [None]:
def calculate_depth(spcluster,words, num_points):
    cluster=[[] for w in xrange(2*num_points)]
    c=Counter()
    for i in xrange(num_points):
        cluster[i]=[i]

    for i in xrange(len(spcluster)):
        x=int(spcluster[i,0])
        y=int(spcluster[i,1])
        xval=[w for w in cluster[x]]
        yval=[w for w in cluster[y]]
        cluster[num_points+i]=xval+yval
        for w in cluster[num_points+i]:
            c[words[w]]+=1
        cluster[x][:]=[]
        cluster[y][:]=[]    
    return c

In [None]:
##Calculate depth of words
num_points=len(data_d2v)
depth=calculate_depth(spcluster,word_d2v,num_points)

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram

In [None]:
plt.figure()
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    spcluster,
    truncate_mode='lastp',
    p=20,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=12.,  # font size for the x axis labels
    show_contracted=True,
)
plt.tight_layout()

In [None]:
logger.info('Computing co-occurence graph')

T=[' '.join(w) for w in sentences_nouns]

In [None]:
for line in T[0:10]: print line

In [None]:
logger.info(len(T))

In [None]:
##Co-occurence matrix
cv=CountVectorizer(token_pattern=u'(?u)\\b([^\\s]+)')
bow_matrix = cv.fit_transform(T)
id2word={}
for key, value in cv.vocabulary_.items():
    id2word[value]=key

ids=[]
for key,value in cv.vocabulary_.iteritems():
    if key in model_comb_vocab:
        ids.append(value)

sort_ids=sorted(ids)
bow_reduced=bow_matrix[:,sort_ids]
normalized = TfidfTransformer().fit_transform(bow_reduced)
similarity_graph_reduced=bow_reduced.T * bow_reduced

In [None]:
##Depth-rank weighting of edges, weight of edge i,j=cosine of angle between them
logger.info('Computing degree')
m,n=similarity_graph_reduced.shape

cx=similarity_graph_reduced.tocoo()
keyz=[id2word[sort_ids[w]] for w in xrange(len(sort_ids))]
data=[]
ro=[]
co=[]
for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
    if v>0 and i!=j:
        value=1
        if value>0:
            ro.append(i)
            co.append(j)
            data.append(value)

SS=sp.sparse.coo_matrix((data, (ro, co)), shape=(m,n))
SP_full=SS.tocsc()
id_word={w:id2word[sort_ids[w]] for w in xrange(len(sort_ids))}
word_id={value:key for key,value in id_word.items()}

In [None]:
logger.info('Computing metrics')
#compute metrics
degsum=SP_full.sum(axis=1)
deg={}
for x in xrange(len(sort_ids)):
    deg[id2word[sort_ids[x]]]=int(degsum[x])

max_deg=max(deg.values())
max_depth=max(depth.values())

temp_deg_mod={w:np.log(1+deg[w])/np.log(1+max_deg) for w in deg.iterkeys()}
alpha=np.log(0.5)/np.log(np.median(temp_deg_mod.values()))
deg_mod={key:value**alpha for key, value in temp_deg_mod.iteritems()}

temp={key:value*1./max_depth for key, value in depth.iteritems()}
alpha=np.log(0.5)/np.log(np.median(temp.values()))
depth_mod={key:value**alpha for key, value in temp.iteritems()}

#temp={key:deg_mod[key]*depth_mod[key] for key in depth_mod.iterkeys()}
temp = {}
for key in depth_mod.iterkeys():
    if key in deg_mod:
        temp[key] = deg_mod[key]*depth_mod[key]
max_metric=np.max(temp.values())
metric={key:value*1./max_metric for key,value in temp.iteritems()}

In [None]:
logger.info('max_deg = %s, max_depth = %s',max_deg, max_depth)

In [None]:
##Kmeans
NUM_TOPICS = 30
K=NUM_TOPICS
kmeans=KMeans(n_clusters=K)
kmeans.fit([w for w in data_d2v])
kmeans_label={word_d2v[x]:kmeans.labels_[x] for x in xrange(len(word_d2v))}

kmeans_label_ranked={}

topic=[[] for i in xrange(K)]
clust_depth=[[] for i in xrange(K)]
for i in xrange(K):
    topic[i]=[word_d2v[x] for x in xrange(len(word_d2v)) if kmeans.labels_[x]==i]
    #temp_score=[metric[w] for w in topic[i]]
    temp_score = []
    for w in topic[i]:
        if w in metric: temp_score.append(metric[w])
    clust_depth[i]=-np.mean(sorted(temp_score,reverse=True)[:])#int(np.sqrt(len(topic[i])))])
index=np.argsort(clust_depth)
for num,i in enumerate(xrange(K)):
    for w in topic[index[i]]:
        kmeans_label_ranked[w]=i

In [None]:
logger.info('Done...Generating output')
lister=[]
to_show=K
to_show_words=20 #the maximum number of words of each type to display
for i in xrange(to_show):
    top=topic[index[i]]
    #sort_top=[w[0] for w in sorted([[w,metric[w]] for w in top],key=itemgetter(1),reverse=True)]
    sort_tmp = []
    for w in top:
        if w in metric: sort_tmp.append([w,metric[w]])
    sort_top=[w[0] for w in sorted(sort_tmp,key=itemgetter(1),reverse=True)]
    lister.append(['Topic %d' %(i+1)]+sort_top[:to_show_words])

max_len=max([len(w) for w in lister])
new_list=[]
for list_el in lister:
    new_list.append(list_el + [''] * (max_len - len(list_el)))
Topics=list(itertools.izip_longest(*new_list))
#X.insert(len(X),[-int(clust_depth[index[w]]*100)*1./100 for w in xrange(K)])
sorted_words=[w[0] for w in sorted(metric.items(),key=itemgetter(1),reverse=True)][:to_show_words]

In [None]:
score_words = sorted_words
deep_words = [w[0] for w in depth.most_common(to_show_words)]
filer = 'wiki_simple.txt'
outfile_topics = data_directory + filer.split('.')[0] + '_topics.csv'
outfile_score = data_directory + filer.split('.')[0] + '_score.csv'
outfile_depth = data_directory + filer.split('.')[0] + '_depth.csv'
b = open(outfile_topics, 'wb')
a = csv.writer(b)
a.writerows(Topics)
b = open(outfile_score, 'wb')
a = csv.writer(b)
a.writerows([[w] for w in score_words])
b = open(outfile_depth, 'wb')
a = csv.writer(b)
a.writerows([[w] for w in deep_words])

In [None]:
score_words[0:10]

In [None]:
columns = 8
print 'Total number of Topics = {}.'.format(K)
for j in range(K/columns + 1):
    first = j*columns + 1;last = (j*columns + columns)
    print 'Displaying Topics {} thru {}.'.format(first, last)
    print tabulate([Topics[i][first-1:last] for i in range(0,11)], tablefmt=u'psql')