In [1]:
import os
import io
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.corpora import Dictionary
from gensim.models import Phrases,TfidfModel
from gensim.models import LdaModel
from gensim.models import CoherenceModel

Below I read the folder with the containing folders and files.

In [2]:
directory = 'nipstxt/'
index = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
folders = ['nips' + num for num in index]
# Read all texts into a list.
docs = []
counter=0
my_dict={}
for folder in folders:
    files = os.listdir(directory + folder)
    for filen in files:
        my_dict[counter]=folder+"/"+filen
        counter+=1
        # Note: ignoring characters that cause encoding errors.
        with io.open(directory + folder + '/' + filen, encoding="utf-8",errors="ignore") as fid:
            txt = fid.read()
        #Append the text we read in a list containing all text from the documents.
        docs.append(txt)


In [3]:
print 'Number of documents:',len(docs)

Number of documents: 1740


# Part 1: Natural Language Processing (NLP)

Split the desired text per word. The regexptokenizer will split based on the parameter passed as an argument. We further process and remove the numbers and the tokens with length equal to 1.

In [4]:
to_token = RegexpTokenizer(r'\w+')
word_doc=[]
for idx in range(len(docs)):
    word_doc.append(to_token.tokenize(docs[idx].lower()))
clean_docs=[]
#Remove words that is a number only and words that are only one character.
clean_docs = [[token for token in doc if not token.isnumeric()] for doc in word_doc]
clean_docs = [[token for token in doc if len(token) > 3] for doc in clean_docs]
print 'Clean docs:',clean_docs[0][0:20]

Clean docs: [u'connecting', u'past', u'bruce', u'macdonald', u'assistant', u'professor', u'knowledge', u'sciences', u'laboratory', u'computer', u'science', u'department', u'university', u'calgary', u'university', u'drive', u'calgary', u'alberta', u'abstract', u'ecently']


We use the WordNet lemmatizer from NLTK. It will produced an output more readable,while the stemmer would stemm the words and would be bery difficult to be distinguished.

e.g. lemmatizer.lemmatize('dogs'))->dog

In [5]:
lemmatizer = WordNetLemmatizer()
lemma_docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in clean_docs]
print lemma_docs[0][0:20]

[u'connecting', u'past', u'bruce', u'macdonald', u'assistant', u'professor', u'knowledge', u'science', u'laboratory', u'computer', u'science', u'department', u'university', u'calgary', u'university', u'drive', u'calgary', u'alberta', u'abstract', u'ecently']


Remove stopwords from inside the documents.

In [6]:
stopwords_set = set(stopwords.words("english"))
clean_sw = [[token for token in doc if token not in stopwords_set] for doc in lemma_docs]
print clean_sw[0][0:20]

[u'connecting', u'past', u'bruce', u'macdonald', u'assistant', u'professor', u'knowledge', u'science', u'laboratory', u'computer', u'science', u'department', u'university', u'calgary', u'university', u'drive', u'calgary', u'alberta', u'abstract', u'ecently']


We find bigrams and trigrams in the documents. Trigrams or bigrams are sets of words that they form a meaning that would be lost if each word was separate.We want to include as much contect as we can from the documents.

In [7]:
bigram = Phrases(clean_sw, min_count=5)
trigram = Phrases(bigram[clean_sw])
for idx in range(len(clean_sw)):
    for token in trigram[bigram[clean_sw[idx]]]:
        if '_' in token:
            # Token is a bigram or trigram, add to document.
            clean_sw[idx].append(token)



We must try to keep the corpus with the most important words for identifying a topic. For that reason we remove items appearing less than 20 times an in no more than 50 percent of the documents. 

In [8]:
dictionary = Dictionary(clean_sw)
dictionary.filter_extremes(no_below=20,no_above=0.5)
corpus = [dictionary.doc2bow(doc) for doc in clean_sw]
print 'Dictionary contains ',len(dictionary),'words.'

Dictionary contains  7731 words.


# Part 2 : Topic Modelling with LDA
Firsly,we have to perform a search on potential parameter values for LDA algorithm.
The two basic arguments we will perform this best parameter estimation procedure are:


    alpha parameter: Hyperparameter that affect sparsity of the document-topic.The smaller alpha the more sparse the distribution.
    
    num_topics: number of topics the algorithm will try to assign to each document. If we have 10 topics, the documents will be tagged with one of the 10 topics.
    
    iterations: number of iterations the algorithm will be executed. Either the algorithm will be stopped because it will converge or because it will reach the iterations limit.
    
    
The best parameters for the algorithm are chosen based on the average coherence each topic produces.Coherence is a key property of any well-organized text. It evaluates the degree of logical consistency for text and can help document a set of sentences into a logically consistent order.Coherence is a metric that that shows how good the topic evaluation was. Higher topic coherence would mean that there is a clear topic that all the words represent,thus it is easier to put a label on them.
<br>
I check for values from 4 to 35. Better coherence means that the number is closer to  0.


In [9]:
num_topics = [x for x in range(2,50,3)]
eval_every = None  # Don't evaluate model perplexity, takes too much time.
# Make a index to word dictionary.
alpha=[.1,1,10,100]
temp=dictionary[0]
id2word = dictionary.id2token
#Generate all the pairs between alpha parameters and num of topics
params=[[x,y] for x in num_topics for y in alpha]
best=""
b_p=""
alpha=''
i=0
res=[]
prms=[]
perpl=[]
prev_p=num_topics[0]
for param in params:
    
    model = LdaModel(corpus=corpus, id2word=dictionary,
                     chunksize=2000,alpha=param[1],
                     iterations=100, num_topics=param[0], \
                     eval_every=eval_every,
                    random_state=1)
    top_topics = model.top_topics(corpus)
    # Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics. 
    perf=sum([t[1] for t in top_topics])/float(param[0])
    if prev_p!=param[0]:
        print
    prev_p=param[0]
    print param,'->',perf,'|',
    
    if best=="" or best<perf:
        best=perf
        b_p=param[0]
        alpha=param[1]

[2, 0.1] -> -0.956608879685 | [2, 1] -> -0.959847919139 | [2, 10] -> -0.932965129939 | [2, 100] -> -0.908790521225 |
[5, 0.1] -> -0.963019296082 | [5, 1] -> -0.957980551881 | [5, 10] -> -0.94878626625 | [5, 100] -> -0.937217594004 |
[8, 0.1] -> -0.995484224572 | [8, 1] -> -0.943718611905 | [8, 10] -> -0.954397862495 | [8, 100] -> -0.929766899926 |
[11, 0.1] -> -1.01477499865 | [11, 1] -> -1.01235876685 | [11, 10] -> -0.9081712602 | [11, 100] -> -0.917093606377 |
[14, 0.1] -> -1.04803475626 | [14, 1] -> -1.00695324205 | [14, 10] -> -0.920790230708 | [14, 100] -> -0.908822019716 |
[17, 0.1] -> -1.0947001465 | [17, 1] -> -1.02107850526 | [17, 10] -> -0.916283588439 | [17, 100] -> -0.913760670598 |
[20, 0.1] -> -1.10434567972 | [20, 1] -> -1.00832133541 | [20, 10] -> -0.913676328771 | [20, 100] -> -0.911431077751 |
[23, 0.1] -> -1.11922540163 | [23, 1] -> -1.01209554091 | [23, 10] -> -0.915196017669 | [23, 100] -> -0.907300984842 |
[26, 0.1] -> -1.13263716663 | [26, 1] -> -1.00673373169 | 

Here we train the model with the best parameters,found following the above procedure.

In [10]:
print "Best number of topics according to coherence is:",b_p,'and  ',alpha
model = LdaModel(corpus=corpus, id2word=dictionary,chunksize=2000, \
                       iterations=400, num_topics=b_p, \
                       alpha=alpha,
                       eval_every=eval_every,
                       random_state=1)



Best number of topics according to coherence is: 23 and   100


Find which documents belong to each topic. Show the number of the document that belongs to the respective topic along with the most representative words per topic.

In [11]:
cnt=[0 for x in range(b_p)]
repre=[[-1,0] for x in range(b_p)]
dcs=[[] for x in range(b_p)]

for ind in range(len(clean_sw)):
    bow = dictionary.doc2bow(clean_sw[ind])
    v=model[bow]
    elems=[x[1] for x in v]
    topic = max(elems)
    index=0
    for i,j in enumerate(v):
        #Find the index in the list that holds the probability of a document to belong in the topic.The maximum number is the maximum
        #probability,thus the more likely topic to place a document.
        if j[1]==topic:
            index=j[0]
            cnt[j[0]]+=1
            if repre[index][1]<=topic:
                repre[index][1]=topic
                repre[index][0]=ind
            break

The algorithm may present different number of documents per topics due to the slight noise induced by the algorithm itself.

In [12]:
for i in range(len(cnt)):
    wrds=[v for x,v in top_topics[i][0]]
    print 'Topic ',i,'had ',cnt[i],' documents.Most representative document with id ',repre[i][0],'.'

Topic  0 had  80  documents.Most representative document with id  1532 .
Topic  1 had  91  documents.Most representative document with id  748 .
Topic  2 had  83  documents.Most representative document with id  1552 .
Topic  3 had  53  documents.Most representative document with id  1255 .
Topic  4 had  74  documents.Most representative document with id  967 .
Topic  5 had  73  documents.Most representative document with id  1286 .
Topic  6 had  72  documents.Most representative document with id  231 .
Topic  7 had  72  documents.Most representative document with id  966 .
Topic  8 had  64  documents.Most representative document with id  31 .
Topic  9 had  58  documents.Most representative document with id  572 .
Topic  10 had  77  documents.Most representative document with id  789 .
Topic  11 had  64  documents.Most representative document with id  51 .
Topic  12 had  92  documents.Most representative document with id  1478 .
Topic  13 had  68  documents.Most representative document 

Show the top 20 words that are most representative for each topic along with the most representative topic.

In [13]:
for i in range(len(cnt)):
    wrds=[v for x,v in top_topics[i][0]]
    print 'Topic',i,':Most representative:'
    print 'Topic words:',model.print_topic(i, 20)
    print "========================================================================="
    print docs[repre[i][0]][0:400]
    print "========================================================================="

Topic 0 :Most representative:
Topic words: 0.004*"cell" + 0.004*"neuron" + 0.004*"class" + 0.003*"field" + 0.003*"hidden" + 0.003*"image" + 0.003*"layer" + 0.003*"recognition" + 0.003*"architecture" + 0.003*"sample" + 0.003*"signal" + 0.002*"noise" + 0.002*"dynamic" + 0.002*"rule" + 0.002*"control" + 0.002*"response" + 0.002*"node" + 0.002*"prediction" + 0.002*"matrix" + 0.002*"threshold"
An Integrated Vision Sensor for the 
Computation of Optical Flow Singular Points 
Charles M. Higgins and Christof Koch 
Division of Biology, 139-74 
California Institute of Technology 
Pasadena, CA 91125 
[chuck, koch] Oklab. caltech. edu 
Abstract 
A robust, integrative algorithm is presented for computing the position of 
the focus of expansion or axis of rotation (the singular point) in optical
Topic 1 :Most representative:
Topic words: 0.006*"neuron" + 0.005*"hidden" + 0.005*"image" + 0.004*"layer" + 0.004*"recognition" + 0.003*"cell" + 0.003*"noise" + 0.003*"signal" + 0.003*"visual" + 0.002*"rule

In [14]:
import pyLDAvis.gensim
import gensim
pyLDAvis.enable_notebook()
data = pyLDAvis.gensim.prepare(model, corpus, dictionary)
pyLDAvis.display(data)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  topic_term_dists = topic_term_dists.ix[topic_order]


Letâ€™s interpret the topic visualization. Notice how topics are shown on the left while words are on the right. Here are the main things you should consider:
<br>
i)Larger topics are more frequent in the corpus.
<br>
ii)Topics closer together are more similar, topics further apart are less similar.
<br>
iii)When you select a topic, you can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.
<br>
iv)Hovering over a word will adjust the topic sizes according to how representative the word is for the topic.
<br>