# NIPS: Topic modeling visualization

Some main topics at NIPS according to [wikipedia](https://en.wikipedia.org/wiki/Conference_on_Neural_Information_Processing_Systems) :

1. Machine learning, 
2. Statistics, 
3. Artificial intelligence, 
4. Computational neuroscience

However, the topics are within the same domain which makes it more challenging to distinguish between them. Here in this Kernel I will try to extract some topics using Latent Dirichlet allocation __LDA__. This tutorial features an end-to-end  natural language processing pipeline, starting with raw data and running through preparing, modeling, visualizing the paper. We'll touch on the following points


1. Topic modeling with **LDA**
1. Visualizing topic models with **pyLDAvis**
1. Visualizing LDA results with **t-SNE** and **bokeh**

In [1]:
%pylab inline

import numpy as np
import pandas as pd
import pickle as pk
from scipy import sparse as sp

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [2]:
texts = []

# Import modules
import os
from glob import glob
# Navigate to the directory
root  = os.path.join(os.getcwd(), '2. Rearranged Order/')
# Sort the file names 
paths = sorted(glob(root+'*.txt'))
# Copy each file to new file
for path in paths:
  with open(path, 'r') as file:
    texts.append(file.read())

In [3]:
df = pd.read_csv('Topics.csv')
topics = df['CATEGORY']
docs = np.array(texts, dtype='object')

In [4]:
docs

array(['TOPIC 1: Description\n\nUniversity of San Carlos (USC) is a Catholic educational institution administered since 1935 by Society of the Divine Word (SVD) missionaries. A University since 1948, USC offers the complete educational package from kindergarten, including a Montessori academy, to graduate school. Learn more about Education with a Mission and how we become Witness to the Word. USC is also referred here as University. \n\nRapid growth in the â€˜50s saturated the campus near the city center prompting expansion of the University to what was then called the Boysâ€™ High School in 1956 (now North Campus), and in 1964 to the Teacher Education Center and Girlsâ€™ High School (now South Campus) and to Talamban Campus. In 2008, the erstwhile SVD Formation Center was transformed into the Montessori Campus. Currently, The University has five campuses. \n\nThe University is one of the most respected higher education institutions in the Philippines. Programs offered have received ei

## Pre-process and vectorize the documents

In [5]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+')
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words that are only one character.
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    
    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

In [6]:
docs = docs_preprocessor(docs)

### **Compute bigrams/trigrams:**
Sine topics are very similar what would make distinguish them are phrases rather than single/individual words.

In [7]:
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 10 times or more).
bigram = Phrases(docs, min_count=10)
trigram = Phrases(bigram[docs])

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
    for token in trigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

### **Remove rare and common tokens:**

In [8]:
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)
print('Number of unique words in initital documents:', len(dictionary))

# Filter out words that occur no less than 3 documents nor more than 90% of the documents.
dictionary.filter_extremes(no_below=3, no_above=0.9)
print('Number of unique words after removing rare and common words:', len(dictionary))

Number of unique words in initital documents: 3868
Number of unique words after removing rare and common words: 1121


Pruning the common and rare words, we end up with only about 6% of the words.

** Vectorize data:**  
The first step is to get a back-of-words representation of each doc.

In [9]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [10]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 1121
Number of documents: 61


With the bag-of-words corpus, we can move on to learn our topic model from the documents.

# Train LDA model...

In [11]:
from gensim.models import LdaModel

In [12]:
# Set training parameters.
num_topics = 6
chunksize = 500 # size of the doc looked at every pass
passes = 20 # number of passes through documents
iterations = 400
eval_every = 1  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every)

 # How to choose the number of topics? 
__LDA__ is an unsupervised technique, meaning that we don't know prior to running the model how many topics exits in our corpus. Topic coherence, is one of the main techniques used to deestimate the number of topics. You can read about it [here.](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf)

However, I used the LDA visualization tool **pyLDAvis**, tried a few number of topics and compared the resuls. Four seemed to be the optimal number of topics that would seperate  topics the most. 

In [13]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [14]:
pyLDAvis.gensim.prepare(model, corpus, dictionary)

** What do we see here? **

**The left panel**, labeld Intertopic Distance Map, circles represent different topics and the distance between them. Similar topics appear closer and the dissimilar topics farther.
The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
 
**The right panel**, include the bar chart of the top 30 terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
Selecting each topic on the right, modifies the bar chart to show the "relevant" terms for the selected topic. 
Relevence is defined as in footer 2 and can be tuned by parameter $\lambda$, smaller $\lambda$ gives higher weight to the term's distinctiveness while larger $\lambda$s corresponds to probablity of the term occurance per topics. 

Therefore, to get a better sense of terms per topic we'll use  $\lambda$=0.

**How to evaluate our model?**  
So again since there is no ground through here, we have to be creative in defining ways to evaluate. I do this in two steps:

1. divide each document in two parts and see if the topics assign to them are simialr. => the more similar the better
2. compare randomly chosen docs with each other. => the less similar the better

In [15]:
from sklearn.metrics.pairwise import cosine_similarity

df['tokens'] = docs

docs1 = df['tokens'].apply(lambda l: l[:int(len(l)/2)])
docs2 = df['tokens'].apply(lambda l: l[int(len(l)/2):])

Transform the data 

In [16]:
corpus1 = [dictionary.doc2bow(doc) for doc in docs1]
corpus2 = [dictionary.doc2bow(doc) for doc in docs2]

# Using the corpus LDA model tranformation
lda_corpus1 = model[corpus1]
lda_corpus2 = model[corpus2]

In [17]:
from collections import OrderedDict
def get_doc_topic_dist(model, corpus, kwords=False):
    
    '''
    LDA transformation, for each doc only returns topics with non-zero weight
    This function makes a matrix transformation of docs in the topic space.
    '''
    top_dist =[]
    keys = []

    for d in corpus:
        tmp = {i:0 for i in range(num_topics)}
        tmp.update(dict(model[d]))
        vals = list(OrderedDict(tmp).values())
        top_dist += [np.array(vals)]
        if kwords:
            keys += [np.array(vals).argmax()]

    return np.array(top_dist), keys

In [18]:
top_dist1, _ = get_doc_topic_dist(model, lda_corpus1)
top_dist2, _ = get_doc_topic_dist(model, lda_corpus2)

print("Intra similarity: cosine similarity for corresponding parts of a doc(higher is better):")
print(np.mean([cosine_similarity(c1.reshape(1, -1), c2.reshape(1, -1))[0][0] for c1,c2 in zip(top_dist1, top_dist2)]))

random_pairs = np.random.randint(0, len(docs), size=(400, 2))

print("Inter similarity: cosine similarity between random parts (lower is better):")
print(np.mean([cosine_similarity(top_dist1[i[0]].reshape(1, -1), top_dist2[i[1]].reshape(1, -1)) for i in random_pairs]))

Intra similarity: cosine similarity for corresponding parts of a doc(higher is better):
0.92828345
Inter similarity: cosine similarity between random parts (lower is better):
0.41253808


## Let's look at the terms that appear more in each topic. 

In [19]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, atopic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

In [20]:
topic_summaries = []
print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')
for i in range(num_topics):
    print('Topic '+str(i)+' |---------------------\n')
    tmp = explore_topic(model,topic_number=i, topn=20, output=True )
#     print tmp[:5]
    topic_summaries += [tmp[:5]]
    print

term                 frequency

Topic 0 |---------------------

enrollment           0.031
with                 0.025
course               0.015
block_enrollment     0.014
which_mean           0.014
your                 0.012
from                 0.012
department_chair     0.011
ismis                0.011
graduation           0.010
semester             0.010
block                0.010
they                 0.009
then_click           0.009
period               0.009
which                0.009
then                 0.009
other                0.008
will                 0.008
student_task         0.008
Topic 1 |---------------------

course               0.050
department           0.028
program              0.026
through              0.018
following            0.016
chair                0.016
registrar_staff      0.012
engineering          0.010
education            0.010
student_task         0.009
local                0.009
general_education    0.009
enrollment           0.009
offer        

From above, it's possible to inspect each topic and assign a human-interpretable label to it. Here I labeled them as follows:

In [21]:
top_labels = {0: 'aa', 1:'bb', 2:'cc', 3:'dd', 4:'ee', 5:'ff'}

In [22]:
import re
import nltk

from nltk.corpus import stopwords

stops = set(stopwords.words('english'))

def paper_to_wordlist( paper, remove_stopwords=True ):
    '''
        Function converts text to a sequence of words,
        Returns a list of words.
    '''
    lemmatizer = WordNetLemmatizer()
    # 1. Remove non-letters
    paper_text = re.sub("[^a-zA-Z]"," ", paper)
    # 2. Convert words to lower case and split them
    words = paper_text.lower().split()
    # 3. Remove stop words
    words = [w for w in words if not w in stops]
    # 4. Remove short words
    words = [t for t in words if len(t) > 2]
    # 5. lemmatizing
    words = [nltk.stem.WordNetLemmatizer().lemmatize(t) for t in words]

    return(words)

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

tvectorizer = TfidfVectorizer(input='content', analyzer = 'word', lowercase=True, stop_words='english',\
                                  tokenizer=paper_to_wordlist, ngram_range=(1, 3), min_df=3, max_df=0.9,\
                                  norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=True)

dtm = tvectorizer.fit_transform(texts).toarray()



In [24]:
top_dist =[]
for d in corpus:
    tmp = {i:0 for i in range(num_topics)}
    tmp.update(dict(model[d]))
    vals = list(OrderedDict(tmp).values())
    top_dist += [array(vals)]

In [25]:
top_dist, lda_keys= get_doc_topic_dist(model, corpus, True)
features = tvectorizer.get_feature_names_out()

In [26]:
top_ws = []
for n in range(len(dtm)):
    inds = argsort(dtm[n])[::-1][:4].astype(int)
    tmp = [features[i] for i in inds]
    
    top_ws += [' '.join(tmp)]
    
df['Text_Rep'] = pd.DataFrame(top_ws)
df['clusters'] = pd.DataFrame(lda_keys)
df['clusters'].fillna(10, inplace=True)

cluster_colors = {0: 'blue', 1: 'green', 2: 'yellow', 3: 'red', 4: 'skyblue', 5:'salmon', 6:'orange', 7:'maroon', 8:'crimson', 9:'black', 10:'gray'}

df['colors'] = df['clusters'].apply(lambda l: cluster_colors[l])

In [27]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(top_dist)

In [28]:
df['X_tsne'] =X_tsne[:, 0]
df['Y_tsne'] =X_tsne[:, 1]

In [29]:
from bokeh.plotting import figure, show, output_notebook, save#, output_file
from bokeh.models import HoverTool, LabelSet, Legend, ColumnDataSource
output_notebook()

In [30]:
df['text'] = [' '.join(text.split(' ')[1:10]) for text in texts]

In [31]:
df['text']

0     1: Description\n\nUniversity of San Carlos (US...
1     2: Catholic Identity, Vision, Mission, and Cor...
2     3: The University Seal\n\nThe University seal ...
3     4: History\n\nThe University of San Carlos, ad...
4     5: University Saints\n\nThe three university s...
                            ...                        
56    57: Student Organization Registration and Supe...
57    58: Educational Tours and Field Trips\n\nIn th...
58    59: Student Support and Services Quality Polic...
59    60: Directory of Student Support Services and ...
60    61: Directory of Academic Programs and Departm...
Name: text, Length: 61, dtype: object

In [32]:
source = ColumnDataSource(dict(
    x=df['X_tsne'],
    y=df['Y_tsne'],
    color=df['colors'],
    label=df['clusters'].apply(lambda l: top_labels[l]),
    topic_key= df['clusters'],
    content = df['Text_Rep'],
    text = df['text']
))

In [33]:
df.head()

Unnamed: 0,CAT#,ORDER#,ARTICLE,SECTION,DOCUMENT,CATEGORY,TOPIC,tokens,Text_Rep,clusters,colors,X_tsne,Y_tsne,text
0,1,1,1,1,1,1,Description,"[topic, description, university, carlos, catho...",philippine ranking center campus,3,red,14.230335,-0.368709,1: Description\n\nUniversity of San Carlos (US...
1,2,2,1,2,1,1,"Catholic Identity, Vision Mission and Core Values","[topic, catholic, identity, vision, mission, c...",community society local value,2,yellow,19.542536,-1.946191,"2: Catholic Identity, Vision, Mission, and Cor..."
2,3,3,1,3,1,1,The University Seal,"[topic, university, seal, university, seal, of...",seal knowledge cross topic university,3,red,13.802202,-0.343544,3: The University Seal\n\nThe University seal ...
3,4,4,1,4,1,1,History,"[topic, history, university, carlos, administe...",priest later building talamban,3,red,14.232857,-0.136108,"4: History\n\nThe University of San Carlos, ad..."
4,5,5,1,5,1,1,University Saints,"[topic, university, saint, three, university, ...",missionary arnold priest joseph,3,red,14.65533,-0.511921,5: University Saints\n\nThe three university s...


In [34]:
title = 'T-SNE visualization of topics'

plot_lda = figure(min_width=500, min_height=300,
                     title=title, tools="pan,wheel_zoom,box_zoom,reset,hover",
                     x_axis_type=None, y_axis_type=None, min_border=1)

plot_lda.scatter(x='x', y='y', source=source,
                 color='color', alpha=0.8, size=10)#'msize', )

# hover tools
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips = {"content": "Title: @text, KeyWords: @content - Topic: @topic_key "}

show(plot_lda)

#save the plot
# save(plot_lda, '{}.html'.format(title))

In [40]:
df['color_original'] = df['CATEGORY'].apply(lambda l: cluster_colors[l-1])

In [41]:
source = ColumnDataSource(dict(
    x=df['X_tsne'],
    y=df['Y_tsne'],
    color=df['color_original'],
    label=df['clusters'].apply(lambda l: top_labels[l]),
    topic_key= df['clusters'],
    content = df['Text_Rep'],
    text = df['text']
))

In [42]:
title = 'T-SNE visualization of topics'

plot_lda = figure(min_width=500, min_height=300,
                     title=title, tools="pan,wheel_zoom,box_zoom,reset,hover",
                     x_axis_type=None, y_axis_type=None, min_border=1)

plot_lda.scatter(x='x', y='y', source=source,
                 color='color', alpha=0.8, size=10)#'msize', )

# hover tools
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips = {"content": "Title: @text, KeyWords: @content - Topic: @topic_key "}

show(plot_lda)

#save the plot
# save(plot_lda, '{}.html'.format(title))

In [39]:
df.to_csv('Topics_Analyzed.csv')
print("Results saved to CSV!")

Results saved to CSV!
