# Anchored CorEx: Topic Modeling with Minimal Domain Knowledge

**Author:** [Ryan J. Gallagher](http://ryanjgallagher.github.io/)  

**Last updated:** 03/22/2021

This notebook walks through how to use the CorEx topic model code. This includes fitting CorEx to your data, examining the topic model output, outputting results, building a hierarchical topic model, and anchoring words to topics.

Details of the CorEx topic model and evaluations against unsupervised and semi-supervised variants of LDA can be found in our TACL paper:

Gallagher, Ryan J., Kyle Reing, David Kale, and Greg Ver Steeg. "[Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge](https://www.transacl.org/ojs/index.php/tacl/article/view/1244)." *Transactions of the Association for Computational Linguistics (TACL)*, 2017.

In [164]:
import numpy as np
import scipy.sparse as ss
import matplotlib.pyplot as plt

import corextopic.corextopic as ct
import corextopic.vis_topic as vt # jupyter notebooks will complain matplotlib is being loaded twice

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
%matplotlib inline

## Loading the documents data

We need to first load data to run the CorEx topic model. We'll use the 20 Newsgroups dataset, which scikit-learn provides functionality to access.

In [165]:
documents_train = list(np.load('./training/train.npy')) # historical materials 4451 documents
documents_train2 = list(np.load('./training/train2.npy'))   # census bureau 4226 documents
documents_train.extend(documents_train2)
documents=documents_train
documents

['Skip to main content Search UPLOAD SIGN UP | LOG IN BOOKS VIDEO AUDIO SOFTWARE IMAGESABOUT BLOG PROJECTS HELP DONATE  CONTACT JOBS VOLUNTEER PEOPLE Search Metadata Search text contents Search TV news captions Search radio transcripts Search archived websitesAdvanced SearchSign up for freeLog inFull text of "The practical cabinet maker and furniture designer\'s assistant, with essays on history of furniture, taste in design, color and materials, with full explanation of the canons of good taste in furniture .."See other formats^ ',
 'HISTORY OF FURNITURE, TASTE IN DESIGN, COLOR AND MATERIALS, WITH FULL EXPLANATION OF THE CANONS OF GOOD TASTE IN FURNITURE ',
 'Together with Many Practical Directions for Making Cabinet Work Generally, and a Number of Pieces of Furniture in Particular, along with Hundreds of Recipes for Finishing, Staining, Varnishing, Polishing and Gilding all kinds of Cabinet Work :: :: :: ',
 'Author of "Practical Treatise on the Steel Sojmrk," "Modern Carpentry,*\' "

The topic model assumes input is in the form of a doc-word matrix, where rows are documents and columns are binary counts. We'll vectorize the newsgroups data, take the top 20,000 words, and convert it to a sparse matrix to save on memory usage. Note, we use binary count vectors as input to the CorEx topic model.

In [232]:
# Transform 20 newsgroup data into a sparse matrix
def to_matrix(doc,feature_num):
    vectorizer = CountVectorizer(stop_words='english', max_features=feature_num,binary=True)
    doc_word = vectorizer.fit_transform(doc)
    doc_word = ss.csr_matrix(doc_word)
    doc_word.shape # n_docs x m_words
    # Get words that label the columns (needed to extract readable topics and make anchoring easier)
    words = list(np.asarray(vectorizer.get_feature_names()))
    #preprocessing
    #remove integers from words
    not_digit_inds = [ind for ind,word in enumerate(words) if not word.isdigit()]
    doc_word = doc_word[:,not_digit_inds]
    words= [word for ind,word in enumerate(words) if not word.isdigit()]
    print('n_docs x m_words:')
    print(doc_word.shape) # n_docs x m_words
    return doc_word,words

Our doc-word matrix is 8677 documents by 20,000 words. Let's get the words that label the columns. We'll need these for outputting readable topics and later for anchoring.

We'll do a final step of preprocessing where we remove all integers from our set of words. This brings is down to 18614 words.

In [250]:
#the number is the max_features of count_vectorizer
train_matrix,train_words=to_matrix(documents,20000)

n_docs x m_words:
(8677, 18614)


In [None]:
#simply append training document to the end of test documents to increase the # of words to 18614

In [240]:
test=pd.read_csv('OCC_pairs.csv')['OCC_DES'].tolist()
len(test)

436

In [241]:
test.extend(['-'.join(documents)])
len(test)

437

In [258]:
test[-1]



In [242]:
test_matrix,test_words=to_matrix(test,20000)

n_docs x m_words:
(437, 18614)


## CorEx Topic Model

The main parameters of the CorEx topic model are:
+ **`n_hidden`**: number of topics ("hidden" as in "hidden latent topics")
+ **`words`**: words that label the columns of the doc-word matrix (optional)
+ **`docs`**: document labels that label the rows of the doc-word matrix (optional)
+ **`max_iter`**: number of iterations to run through the update equations (optional, defaults to 200)
+ **`verbose`**:  if `verbose=1`, then CorEx will print the topic TCs with each iteration
+ **`seed`**:     random number seed to use for model initialization (optional)

We'll train a topic model with 50 topics. (This will take a few minutes.)

In [None]:
# Train the CorEx topic model with 50 topics
topic_model = ct.Corex(n_hidden=20, words=words, max_iter=200, verbose=False, seed=1)
topic_model.fit(doc_word, words=words);

## CorEx Output

### Topics

The CorEx topic model provides functionality for easily accessing the topics. Let's take a look one of the topics.

In [None]:
# Print a single topic from CorEx topic model
topic_model.get_topics(topic=1, n_words=10)

The topic words are those with the highest *mutual information* with the topic, rather than those with highest probability within the topic as in LDA. The mutual information with the topic is the number reported in each tuple. CorEx also returns the "sign" of each word, which is either 1 or -1. If the sign is -1, then that means the *absence* of a word is informative in that topic, rather than its presence. 

If the column labels have not been specified through **`words`**, then the code will return the column indices for the top words in each topic.

We can also retrieve all of the topics at once if we would like.

In [None]:
#get rid of words like occupation, 

In [None]:
# Print all topics from the CorEx topic model
topics = topic_model.get_topics()
for n,topic in enumerate(topics):
    topic_words,_,_ = zip(*topic)
    print('{}: '.format(n) + ', '.join(topic_words))

The first topic for the newsgroup data tends to be less coherent than expected because of encodings and other oddities in the newsgroups data.  

We can also get the column indices instead of the column labels if necessary.

In [None]:
topic_model.get_topics(topic=5, n_words=10, print_words=False)

If we need to directly access the topic assignments for each word, they can be accessed through **`cluster`**.

In [None]:
print(topic_model.clusters)
print(topic_model.clusters.shape) # m_words

### Document Labels

As with the topic words, the most probable documents per topic can also be easily accessed. Documents are sorted according to log probabilities which is why the highest probability documents have a score of 0 ($e^0 = 1$) and other documents have negative scores (for example, $e^{-0.5} \approx 0.6$).

In [None]:
# Print a single topic from CorEx topic model
topic_model.get_top_docs(topic=1, n_docs=10, sort_by='log_prob')

CorEx is a *discriminative* model, whereas LDA is a *generative* model. This means that while LDA outputs a probability distribution over each document, CorEx instead estimates the probability a document belongs to a topic given that document's words. As a result, the probabilities across topics for a given document do not have to add up to 1. The estimated probabilities of topics for each document can be accessed through **`log_p_y_given_x`** or **`p_y_given_x`**.

In [None]:
print(topic_model.p_y_given_x.shape) # n_docs x k_topics

We can also use a softmax to make a binary determination of which documents belong to each topic. These softmax labels can be accessed through **`labels`**.

In [None]:
print(topic_model.labels.shape) # n_docs x k_topics

Since CorEx does not prescribe a probability distribution of topics over each document, this means that a document could possibly belong to no topics (all 0's across topics in **`labels`**) or all topics (all 1's across topics in **`labels`**).

## Total Correlation and Model Selection

### Overall TC

Total correlation is the measure which CorEx maximize when constructing the topic model. It can be accessed through **`tc`** and is reported in nats.

In [252]:
anchored_topic_model.tc

17.02005234965266

**Model selection:** CorEx starts its algorithm with a random initialization, and so different runs can result in different topic models. One way of finding a better topic model is to restart the CorEx algorithm several times and take the run that has the highest TC value (i.e. the run that produces topics that are most informative about the documents).

### Topic TC

The overall total correlation is the sum of the total correlation per each topic. These can be accessed through **`tcs`**. For an unsupervised CorEx topic model, the topics are always sorted from high to low according to their TC. For an anchored CorEx topic model, the topics are not sorted, and are outputted such that the anchored topics come first.

In [257]:
anchored_topic_model.tcs # k_topics
for i,v in enumerate(anchored_topic_model.tcs):
    print('Topic '+str(i)+': '+str(v))

Topic 0: 1.395195012850403
Topic 1: 0.996086247877842
Topic 2: 1.0920819226976213
Topic 3: 1.7302453060726326
Topic 4: 2.93572570781293
Topic 5: 1.6346752296702718
Topic 6: 2.217248885646196
Topic 7: 2.9233606433439743
Topic 8: 0.9912441725338391
Topic 9: 1.1041892211469482


### Selecting the Number of Topics

One way to choose the number of topics is to observe the distribution of TCs for each topic to see how much each additional topic contributes to the overall TC. We should keep adding topics until additional topics do not significantly contribute to the overall TC. This is similar to choosing a cutoff eigenvalue when doing topic modeling via LSA.

In [None]:
plt.figure(figsize=(10,5))
plt.bar(range(topic_model.tcs.shape[0]), topic_model.tcs, color='#4e79a7', width=0.5)
plt.xlabel('Topic', fontsize=16)
plt.ylabel('Total Correlation (nats)', fontsize=16);

We see the first topic is much more informative than the other topics. Given that we suspect that this topic is picking up on image encodings (as given by "dsl" and "n3jxp" in the topic) and other boilerplate text (as given by the high TC and lack of coherence of the rest of the topic), we could consider doing additional investigation and preprocessing to help ensure that the CorEx topic model does not pick up on these patterns which are not insightful.

### Pointwise Document TC

We can decompose total correlation further. The topic correlation is the average of the pointwise total correlations for each individual document. The pointwise total correlations can be accessed through **`log_z`**.

In [None]:
topic_model.log_z.shape # n_docs x k_topics

In [None]:
print(np.mean(topic_model.log_z, axis=0))
print(topic_model.tcs)

The pointwise total correlations in **`log_z`** represent the correlations within an individual document explained by a particular topic. These correlations have been used to measure how "surprising" documents are with respect to given topics (see references below).

## Hierarchical Topic Models

The **`labels`** attribute gives the binary topic expressions for each document and each topic. We can use this output as input to another CorEx topic model to get latent representations of the topics themselves. This yields a hierarchical CorEx topic model. Like the first layer of the topic model, one can determine the number of latent variables to add in higher layers through examination of the topic TCs.

In [None]:
# Train a second layer to the topic model
tm_layer2 = ct.Corex(n_hidden=10)
tm_layer2.fit(topic_model.labels);

# Train a third layer to the topic model
tm_layer3 = ct.Corex(n_hidden=1)
tm_layer3.fit(tm_layer2.labels);

If you have `graphviz` installed, then you can output visualizations of the hierarchial topic model to your current working directory. One can also create custom visualizations of the hierarchy by properly making use of the **`labels`** attribute of each layer.

In [None]:
vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3], column_label=words, max_edges=200, prefix='topic-model-example')

## Anchoring for Semi-Supervised Topic Modeling

Anchored CorEx is an extension of CorEx that allows the "anchoring" of words to topics. When anchoring a word to a topic, CorEx is trying to maximize the mutual information between that word and the anchored topic. So, anchoring provides a way to guide the topic model towards specific subsets of words that the user would like to explore.  

The anchoring mechanism is flexible, and so there are many possibilities of anchoring. We explored the following types of anchoring in our TACL paper:

1. Anchoring a single set of words to a single topic. This can help promote a topic that did not naturally emerge when running an unsupervised instance of the CorEx topic model. For example, one might anchor words like "snow," "cold," and "avalanche" to a topic if one suspects there should be a snow avalanche topic within a set of disaster relief articles.

2. Anchoring single sets of words to multiple topics. This can help find different aspects of a topic that may be discussed in several different contexts. For example, one might anchor "protest" to three topics and "riot" to three other topics to understand different framings that arise from tweets about political protests.

3. Anchoring different sets of words to multiple topics. This can help enforce topic separability if there appear to be chimera topics. For example, one might anchor "mountain," "Bernese," and "dog" to one topic and "mountain," "rocky," and "colorado" to another topic to help separate topics that merge discussion of Bernese Mountain Dogs and the Rocky Mountains.


We'll demonstrate how to anchor words to the the CorEx topic model and how to develop other anchoring strategies.

We can anchor multiple groups of words to multiple topics as well.

In [154]:
anchors=pd.read_excel('three_level_occupation_scores.xlsx')['Large Categories.1'][:9]

In [155]:
anchors

0    Agriculture, Forestry, and Animal Husbandry
1                           Clerical Occupations
2                  Domestic and Personal Service
3                         Extraction of Minerals
4        Manufacturing and Mechanical Industries
5                           Professional Service
6       Public Service, Not Elsewhere Classified
7                                          Trade
8                                 Transportation
Name: Large Categories.1, dtype: object

In [169]:
doc_word

<8677x18614 sparse matrix of type '<class 'numpy.int64'>'
	with 206317 stored elements in Compressed Sparse Row format>

In [171]:
# The anchor-words are the unique entries in ‘Large Categories_1’ from ‘three_level_occupation_scores.xlsx’ 
# without stopwords & with some tweakings done to it

anchor_words = [['agriculture','forestry','animal','husbandry'], #0
                ['clerical'], #1
                ['domestic','professional'], #2
                ['extraction','minerals'], #3
                ['manufacturing','mechanical'], #4
                ['professional'], #5
                ['public'], #6
                ['trade'],  #7
                ['transportation']  #8
               ]
#the word 'professional' is added above since in 'three_level_occupation_scores.xlsx', 
#all 'occupations' with the word 'professional' in them are from this category, 
#so it might be helpful to be added 
    

    
    
# number of groups set to 10
anchored_topic_model = ct.Corex(n_hidden=10, seed=2)
anchored_topic_model.fit(doc_word, words=words, anchors=anchor_words, anchor_strength=6);

In [172]:
# Print all topics from the CorEx topic model with their most significant words
topics = anchored_topic_model.get_topics()
for topic_n,topic in enumerate(topics):
    # w: word, mi: mutual information, s: sign
    topic = [(w,mi,s) if s > 0 else ('~'+w,mi,s) for w,mi,s in topic]
    # Unpack the info about the topic
    topic_words,mis,signs = zip(*topic)    
    # Print topic
    if topic_n<len(anchor_words):
        print(f'*{anchor_words[topic_n][0]}*')
        topic_str = '    '+str(topic_n)+': '+', '.join(topic_words)
    else:
        topic_str = str(topic_n)+': '+', '.join(topic_words)
    print(topic_str)

*agriculture*
    0: agriculture, animal, forestry, husbandry, occupations, gainful, division, occupied, gainfully, central
*clerical*
    1: work, time, clerical, possible, process, fact, care, true, best, taken
*domestic*
    2: domestic, professional, tho, porto, rico, service, personal, universities, oxford, cambridge
*extraction*
    3: water, dry, add, hot, extraction, boil, mix, stain, brush, minerals
*manufacturing*
    4: manufacturing, industry, products, total, value, establishments, earners, cent, reported, number
*professional*
    5: furniture, professional, style, art, century, cabinet, taste, decoration, chairs, french
*public*
    6: public, wood, fig, piece, inch, cut, surface, mold, sides, pieces
*trade*
    7: trade, union, workers, local, international, members, locals, organization, labor, convention
*transportation*
    8: new, york, pennsylvania, jersey, transportation, ohio, illinois, massachusetts, state, north
9: operations, included, figures, steel, includes

Note that '~' indicates that the absence of the word is significant about the topic, so let's check what topic 7,9 are later

## CorEx Output

### Topics

The CorEx topic model provides functionality for easily accessing the topics. Let's take a look one of the topics.

If we need to directly access the topic assignments for each word, they can be accessed through **`cluster`**.

In [222]:
print(anchored_topic_model.clusters)

[8 4 4 ... 0 0 0]


In [223]:
print(anchored_topic_model.clusters.shape)

(18614,)


In [225]:
len(words)

636

### Document Labels

As with the topic words, the most probable documents per topic can also be easily accessed. Documents are sorted according to log probabilities which is why the highest probability documents have a score of 0 ($e^0 = 1$) and other documents have negative scores (for example, $e^{-0.5} \approx 0.6$).

In [None]:
# Print a single topic from CorEx topic model
topic_model.get_top_docs(topic=1, n_docs=10, sort_by='log_prob')

CorEx is a *discriminative* model, whereas LDA is a *generative* model. This means that while LDA outputs a probability distribution over each document, CorEx instead estimates the probability a document belongs to a topic given that document's words. As a result, the probabilities across topics for a given document do not have to add up to 1. The estimated probabilities of topics for each document can be accessed through **`log_p_y_given_x`** or **`p_y_given_x`**.

In [None]:
print(topic_model.p_y_given_x.shape) # n_docs x k_topics

We can also use a softmax to make a binary determination of which documents belong to each topic. These softmax labels can be accessed through **`labels`**.

In [None]:
print(topic_model.labels.shape) # n_docs x k_topics

Since CorEx does not prescribe a probability distribution of topics over each document, this means that a document could possibly belong to no topics (all 0's across topics in **`labels`**) or all topics (all 1's across topics in **`labels`**).

## Total Correlation and Model Selection

### Overall TC

Total correlation is the measure which CorEx maximize when constructing the topic model. It can be accessed through **`tc`** and is reported in nats.

In [None]:
topic_model.tc

**Model selection:** CorEx starts its algorithm with a random initialization, and so different runs can result in different topic models. One way of finding a better topic model is to restart the CorEx algorithm several times and take the run that has the highest TC value (i.e. the run that produces topics that are most informative about the documents).

### Topic TC

The overall total correlation is the sum of the total correlation per each topic. These can be accessed through **`tcs`**. For an unsupervised CorEx topic model, the topics are always sorted from high to low according to their TC. For an anchored CorEx topic model, the topics are not sorted, and are outputted such that the anchored topics come first.

In [None]:
topic_model.tcs.shape # k_topics

In [None]:
print(np.sum(topic_model.tcs))
print(topic_model.tc)

### Selecting the Number of Topics

One way to choose the number of topics is to observe the distribution of TCs for each topic to see how much each additional topic contributes to the overall TC. We should keep adding topics until additional topics do not significantly contribute to the overall TC. This is similar to choosing a cutoff eigenvalue when doing topic modeling via LSA.

In [None]:
plt.figure(figsize=(10,5))
plt.bar(range(topic_model.tcs.shape[0]), topic_model.tcs, color='#4e79a7', width=0.5)
plt.xlabel('Topic', fontsize=16)
plt.ylabel('Total Correlation (nats)', fontsize=16);

We see the first topic is much more informative than the other topics. Given that we suspect that this topic is picking up on image encodings (as given by "dsl" and "n3jxp" in the topic) and other boilerplate text (as given by the high TC and lack of coherence of the rest of the topic), we could consider doing additional investigation and preprocessing to help ensure that the CorEx topic model does not pick up on these patterns which are not insightful.

### Pointwise Document TC

We can decompose total correlation further. The topic correlation is the average of the pointwise total correlations for each individual document. The pointwise total correlations can be accessed through **`log_z`**.

In [None]:
topic_model.log_z.shape # n_docs x k_topics

In [None]:
print(np.mean(topic_model.log_z, axis=0))
print(topic_model.tcs)

The pointwise total correlations in **`log_z`** represent the correlations within an individual document explained by a particular topic. These correlations have been used to measure how "surprising" documents are with respect to given topics (see references below).

Note that the 2nd topic 'clerical' isn't a good topic

Note that 'occupation' is considered a significant word in Topic 1. Yet, remember that we mentioning how this word doesn't have much meaning in our occupation clustering & including it will hurt out classification. 

*b Perhaps we should consider dropping this word entirely from our training document


In [91]:
#check what words are in the topic with many ~
def get_words_exist(num):
    topic = anchored_topic_model.get_topics(topic=num, n_words=1000)
    tup = [(w,mi,s) for w,mi,s in topic if s>0]
    # Unpack the info about the topic
    words,_,_ = zip(*tup)    
    # Print topic
    topic_str = str(num)+': '+', '.join(words)
    print(topic_str)
get_words_exist(6)

6: states, united, establishment, group, respect, fuel, corporate, did, character, publications, increasing, recent, philadelphia, relating, connection, estimated, laundries


This raises a question, why isn't states classified with Carolina, which is in group 1

In [None]:
#get top training documents for each topic

In [173]:
#function to get topic of document:
def df_topic_of_doc(documents):
    l=[]
    for i,topics in enumerate(anchored_topic_model.labels):
        try:
            if True not in topics:
                l.append([documents[i],np.nan])
            else:
                num=np.where(topics == True)[0][0] 

                topics = anchored_topic_model.get_topics(topic=num)

                topic = [(w,mi,s) if s > 0 else ('~'+w,mi,s) for w,mi,s in topics]
                # Unpack the info about the topic
                words,mis,signs = zip(*topic)    
                # convert to list
                word_ls=list(words)
                if num<len(anchor_words):
                    category=anchors[num]
                else:
                    category=np.nan
                    
                l.append([documents[i],num,category,word_ls])
        except Exception as e:
            print(e)
            print(i)
            print(documents[i])
            print(topics)
    return pd.DataFrame(l,columns=['document','topic_num','category','words'])

In [175]:
df_test=df_topic_of_doc(documents)

In [176]:
df_test

Unnamed: 0,document,topic_num,category,words
0,Skip to main content Search UPLOAD SIGN UP | L...,2.0,Domestic and Personal Service,"[domestic, professional, tho, porto, rico, ser..."
1,"HISTORY OF FURNITURE, TASTE IN DESIGN, COLOR A...",5.0,Professional Service,"[furniture, professional, style, art, century,..."
2,Together with Many Practical Directions for Ma...,1.0,Clerical Occupations,"[work, time, clerical, possible, process, fact..."
3,"Author of ""Practical Treatise on the Steel Soj...",,,
4,"In preparing this work, I think it unnecessary...",1.0,Clerical Occupations,"[work, time, clerical, possible, process, fact..."
...,...,...,...,...
8672,"Boxes, fancy Uil(l paper _ Droacl and other ba...",8.0,Transportation,"[new, york, pennsylvania, jersey, transportati..."
8673,"Clothing, rnens, including shirts_ Confectione...",5.0,Professional Service,"[furniture, professional, style, art, century,..."
8674,"Pottery, tcrr!lcotta, and fireclay productsPri...",,,
8675,"um rellas and canes, l; vault lights and vent...",2.0,Domestic and Personal Service,"[domestic, professional, tho, porto, rico, ser..."


In [177]:
df.to_excel('corex_topic_new/topics_with_top_docs.xlsx')

In [None]:
#test topic modelling on new data

In [None]:
test=pd.read_csv

In [226]:
#save model 
anchored_topic_model.save('corex_topic_new/model_1.pkl')

In [231]:
#load model (please first assign value)
def load(filename):
    """ Unpickle class instance. """
    import pickle
    return pickle.load(open(filename, 'rb'))
load('corex_topic_new/model_1.pkl')

<corextopic.corextopic.Corex at 0x7fa96a470d30>

# predict

In [None]:
#what is 636, is it the total number of words?
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]
processed_docs = [tokenize(doc) for doc in test_doc]
word_count_dict = Dictionary(processed_docs)
#word_count_dict.filter_extremes(no_below=10, no_above=0.2) # word must appear >10 times, and appear in no more than 20% documents (in order to be selected as a word in the dictionary)
#bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs] # bow all document of corpus

In [243]:
predict_labels=anchored_topic_model.transform(test_matrix)
predict_labels

array([[False, False,  True, ...,  True,  True, False],
       [False, False,  True, ...,  True,  True, False],
       [ True, False,  True, ...,  True, False, False],
       ...,
       [ True, False, False, ...,  True, False, False],
       [ True, False, False, ...,  True, False, False],
       [ True,  True,  True, ...,  True,  True,  True]])

In [244]:
predict_labels=predict_labels[:436] #document * topic

In [245]:
#function to get topic of document:
def df_topic_of_doc(documents,labels):
    l=[]
    for i,topics in enumerate(labels):
        try:
            if True not in topics:
                l.append([documents[i],np.nan])
            else:
                
                arr=np.where(predict_labels[i] == True)
                l.append([documents[i],arr,'',''])
                for num in arr[0]:

                    topics = anchored_topic_model.get_topics(topic=num)

                    topic = [(w,mi,s) if s > 0 else ('~'+w,mi,s) for w,mi,s in topics]
                    # Unpack the info about the topic
                    words,mis,signs = zip(*topic)    
                    # convert to list
                    word_ls=list(words)
                    if num<len(anchor_words):
                        category=anchors[num]
                    else:
                        category=np.nan
                    l.append([documents[i],num,category,word_ls])
        except Exception as e:
            print(e)
            print(i)
            print(documents[i])
            print(topics)
    return pd.DataFrame(l,columns=['document','topic_num','category','words'])

In [246]:
df_predict=df_topic_of_doc(test,predict_labels)
df_predict

Unnamed: 0,document,topic_num,category,words
0,"Agriculture, Forestry, and Animal Husbandry Da...","([2, 7, 8],)",,
1,"Agriculture, Forestry, and Animal Husbandry Da...",2,Domestic and Personal Service,"[domestic, professional, tho, porto, rico, ser..."
2,"Agriculture, Forestry, and Animal Husbandry Da...",7,Trade,"[trade, union, workers, local, international, ..."
3,"Agriculture, Forestry, and Animal Husbandry Da...",8,Transportation,"[new, york, pennsylvania, jersey, transportati..."
4,"Agriculture, Forestry, and Animal Husbandry Da...","([2, 7, 8],)",,
...,...,...,...,...
1960,"Clerical Occupations Messenger, bundle, and of...",0,"Agriculture, Forestry, and Animal Husbandry","[agriculture, animal, forestry, husbandry, occ..."
1961,"Clerical Occupations Messenger, bundle, and of...",4,Manufacturing and Mechanical Industries,"[manufacturing, industry, products, total, val..."
1962,"Clerical Occupations Messenger, bundle, and of...",5,Professional Service,"[furniture, professional, style, art, century,..."
1963,"Clerical Occupations Messenger, bundle, and of...",6,"Public Service, Not Elsewhere Classified","[public, wood, fig, piece, inch, cut, surface,..."


In [247]:
df_predict.to_excel('corex_topic_1/predicted_1910.xlsx')

### Document Labels

As with the topic words, the most probable documents per topic can also be easily accessed. Documents are sorted according to log probabilities which is why the highest probability documents have a score of 0 ($e^0 = 1$) and other documents have negative scores (for example, $e^{-0.5} \approx 0.6$).

In [None]:
# Print a single topic from CorEx topic model
topic_model.get_top_docs(topic=0, n_docs=10, sort_by='log_prob')

somehow riding sick sister to charity is categorized as agriculture, really doens't work

In [None]:
for doc_num in topic_model.get_top_docs(topic=0, n_docs=10, sort_by='log_prob'):
    print(documents[doc_num[0]])

CorEx is a *discriminative* model, whereas LDA is a *generative* model. This means that while LDA outputs a probability distribution over each document, CorEx instead estimates the probability a document belongs to a topic given that document's words. As a result, the probabilities across topics for a given document do not have to add up to 1. The estimated probabilities of topics for each document can be accessed through **`log_p_y_given_x`** or **`p_y_given_x`**.

In [None]:
print(topic_model.p_y_given_x.shape) # n_docs x k_topics

We can also use a softmax to make a binary determination of which documents belong to each topic. These softmax labels can be accessed through **`labels`**.

In [None]:
topic_model.labels

In [None]:
len(topic_model.labels[0])

Since CorEx does not prescribe a probability distribution of topics over each document, this means that a document could possibly belong to no topics (all 0's across topics in **`labels`**) or all topics (all 1's across topics in **`labels`**).

Note, in the above topic model, topics will no longer be sorted according to descending TC. Instead, the first topic will be the one with "nasa" and "space" anchored to it, the second topic will be the one with "sports" and "stadium" anchored to it, and so on.  

Observe, the topic with "love" and "hope" anchored to it is less interpretable than the other three topics. This could be a sign that there is not a good topic around these two words, and one should consider if it is appropriate to anchor around them.

We can continue to develop even more involved anchoring strategies. Here we anchor "nasa" by itself, as well as in two other topics each with "politics" and "news" to find different aspects around the word "nasa". We also create a fourth anchoring of "war" to a topic.

### Choosing Anchor Strength

The anchor strength controls how much weight CorEx puts towards maximizing the mutual information between the anchor words and their respective topics. Anchor strength should always be set at a value *greater than* 1, since setting anchor strength between 0 and 1 only recovers the unsupervised CorEx objective. Empirically, setting anchor strength from 1.5-3 seems to nudge the topic model towards the anchor words. Setting anchor strength greater than 5 is strongly enforcing that the CorEx topic model find a topic associated with the anchor words.

We encourage users to experiment with the anchor strength and determine what values are best for their needs.

## Other Output

The **`vis_topic`** module provides support for outputting topics and visualizations of the CorEx topic model. The code below creates a results direcory named "twenty" in your working directory.

In [249]:
vt.vis_rep(anchored_topic_model, column_label=words, prefix='twenty')

Print topics in text file


IndexError: list index out of range

## Further Reading

Our TACL paper details the theory of the CorEx topic model, its sparsity optimization, anchoring via the information bottleneck, comparisons to LDA, and anchoring experiments. The two papers from Greg Ver Steeg and Aram Galstyan develop the CorEx theory in general and provide further motivation and details of the underlying CorEx mechanisms. Hodas et al. demonstrated early CorEx topic model results and investigated an application of pointwise total correlations to quantify "surprising" documents.

1. [Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge](https://www.transacl.org/ojs/index.php/tacl/article/view/1244), Gallagher et al., TACL 2017.

2. [Discovering Structure in High-Dimensional Data Through Correlation Explanation](https://arxiv.org/abs/1406.1222), Ver Steeg and Galstyan, NIPS 2014. 

3. [Maximally Informative Hierarchical Representions of High-Dimensional Data](https://arxiv.org/abs/1410.7404), Ver Steeg and Galstyan, AISTATS 2015.

4. [Disentangling the Lexicons of Disaster Response in Twitter](https://dl.acm.org/citation.cfm?id=2741728), Hodas et al., WWW 2015.