## Author Topic Model

The author-topic model is an extension of Latent Dirichlet Allocation (LDA), that allows us to learn topic representations of authors in a corpus. The model can be applied to any kinds of labels on documents, such as tags on posts on the web. The model can be used as a novel way of data exploration, as features in machine learning pipelines, for author (or tag) prediction, or to simply leverage your topic model with existing metadata.

<img src="img/atm.png"/>

## Data

The data we'll be using consists of scientific papers about machine learning, from the Neural Information Processing Systems conference (NIPS). We'll first crawl the folders and files in the dataset, and read the files into memory.

In [None]:
import os, re
from smart_open import smart_open

# Folder containing all NIPS papers.
data_dir = 'nipstxt/'  # Set this path to the data on your machine.

# Folders containin individual NIPS papers.
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]

In [None]:
# Get all document texts and their corresponding IDs.

docs = []
doc_ids = []
for yr_dir in dirs:
    # List of filenames
    files = os.listdir(data_dir + yr_dir)
    for filen in files:
        # Get document ID
        (idx1, idx2) = re.search('[0-9]+', filen).span()
        doc_ids.append(yr_dir[4:] + '_' + str(int(filen[idx1:idx2])))
        
        # Read document text
        with smart_open(data_dir + yr_dir + '/' + filen, 'rb', encoding='latin-1') as f:
            txt = f.read()
            
        # Replace any whitespace (newline, tabs, etc.) by a single space.
        txt = re.sub(r'\s', ' ', txt)
        
        docs.append(txt)

In [None]:
doc_ids[:5]

In [None]:
docs[0]

Construct a mapping from author names to document IDs.

In [None]:
filenames = [data_dir + 'idx/a' + yr + '.txt' for yr in yrs]  # Using the years defined in previous cell

# Get all author names and their corresponding document IDs
author2doc = dict()
i = 0
for yr in yrs:
    # The files "a00.txt" and so on contain the author-document mappings
    filename = data_dir + 'idx/a' + yr + '.txt'
    for line in smart_open(filename, 'rb', errors='ignore', encoding='latin-1'):
        # Each line corresponds to one author
        contents = re.split(',', line)
        author_name = (contents[1] + contents[0]).strip()
        # Remove any whitespace to reduce redundant author names
        author_name = re.sub(r'\s', '', author_name)
        # Get document IDs for author
        ids = [c.strip() for c in contents[2:]]
        if not author2doc.get(author_name):
            # This is a new author.
            author2doc[author_name] = []
            i += 1
        
        # Add document IDs to author.
        author2doc[author_name].extend([yr + '_' + id for id in ids])

In [None]:
author2doc

Use an integer ID in author2doc, instead of the IDs provided in the NIPS dataset.

In [None]:
# Mapping from ID of document in NIPS datast, to an integer ID.
doc_id_dict = dict(zip(doc_ids, range(len(doc_ids))))
# Replace NIPS IDs by integer IDs.
for a, a_doc_ids in author2doc.items():
    for i, doc_id in enumerate(a_doc_ids):
        author2doc[a][i] = doc_id_dict[doc_id]

In [None]:
author2doc

## Preprocess

Now we will preprocess this dataset using same process and functions we used earlier for the Fake news dataset. 

In [18]:
import re
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation

import nltk
nltk.download('wordnet') # download wordnet to be used in lemmatization
from nltk.stem import WordNetLemmatizer

def preprocess(texts):
    # tokenization
    texts = [re.findall(r'\w+', line.lower()) for line in texts]
    # remove stopwords
    texts = [remove_stopwords(' '.join(line)).split() for line in texts]
    # remove punctuation
    texts = [strip_punctuation(' '.join(line)).split() for line in texts]
    # remove words that are only 1-2 character
    texts = [[token for token in line if len(token) > 2] for line in texts]
    # remove numbers
    texts = [[token for token in line if not token.isnumeric()] for line in texts]
    # lemmatization 
    lemmatizer = WordNetLemmatizer()
    texts = [[word for word in lemmatizer.lemmatize(' '.join(line), pos='v').split()] for line in texts]
    
    return texts

[nltk_data] Downloading package wordnet to /Users/parul/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [19]:
processed_docs = preprocess(docs)

In [20]:
from gensim.models.phrases import Phrases, Phraser

# training for bigram collocation detection
phrases = Phrases(processed_docs, min_count=1, threshold=0.8, scoring='npmi')

In [21]:
bigram = Phraser(phrases)

In [22]:
# merging detected collocations with data
processed_docs = list(bigram[processed_docs])

In [24]:
from gensim.corpora import Dictionary

dictionary = Dictionary(processed_docs)

# Remove rare and common tokens.
# Filter out words that occur too frequently or too rarely.
max_freq = 0.5
min_wordcount = 20
dictionary.filter_extremes(no_below=min_wordcount, no_above=max_freq)

_ = dictionary[0]

In [25]:
# Vectorize data
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [26]:
print('Number of authors: %d' % len(author2doc))
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of authors: 2479
Number of unique tokens: 6357
Number of documents: 1740


# Training

The interface to the author-topic model is very similar to that of LDA. In addition to a corpus, dictionary (id2word) and number of topics (num_topics), the author-topic model requires either an author to document ID mapping (author2doc), or the reverse (doc2author).

In [27]:
from gensim.models import AuthorTopicModel

model = AuthorTopicModel(corpus=corpus, num_topics=10, id2word=dictionary.id2token, \
                author2doc=author2doc, chunksize=2000, passes=100, gamma_threshold=1e-10, \
                eval_every=0, iterations=1, random_state=2)

In [28]:
# Save model
model.save('models/atmodel')

In [29]:
# Load model
model = AuthorTopicModel.load('models/atmodel')

## Exploring author-topic representation

Now that we have trained a model, we can start exploring the authors and the topics.

First, let's simply print the top 10 relevant words in the topics.

In [30]:
topic_no = 0
model.show_topic(topic_no)

[('noise', 0.007148242388583639),
 ('density', 0.006989836410080504),
 ('weight', 0.006775707499261723),
 ('regression', 0.006356847443801718),
 ('matrix', 0.006267901742473971),
 ('estimation', 0.00625318201447436),
 ('estimate', 0.006230226607797577),
 ('gaussian', 0.006022509980821509),
 ('parameter', 0.005613187571721336),
 ('prediction', 0.005579763985637562)]

Below, we give each topic a label based on what each topic seems to be about intuitively.

In [55]:
topic_labels = ['Bayesian modelling', 'Neuroscience', 'Reinforcement learning',  \
                'Face Recognition', 'Math/general', 'Neurotech', 'Neural Networks', \
                'Object recognition', 'Numerical optimization', 'Speech recognition']

Rather than just calling `model.show_topics(num_topics=10)`, we format the output a bit so it is easier to get an overview.

In [56]:
for topic in model.show_topics(num_topics=10):
    print('Label: ' + topic_labels[topic[0]])
    words = ''
    for word, prob in model.show_topic(topic[0]):
        words += word + ' '
    print('Words: ' + words)
    print()

Label: Bayesian modelling
Words: noise density weight regression matrix estimation estimate gaussian parameter prediction 

Label: Neuroscience
Words: neurons neuron synaptic cell memory activity patterns cells dynamics connections 

Label: Reinforcement learning
Words: control reinforcement action optimal policy algorithms states actions decision dynamic 

Label: Face Recognition
Words: field image control face map distance fig images architecture phase 

Label: Math/general
Words: gaussian variables hidden field energy examples generalization matrix noise image 

Label: Neurotech
Words: analog circuit spike chip neuron signal voltage frequency neurons noise 

Label: Neural Networks
Words: units unit hidden net task recurrent trained architecture representation layer 

Label: Object recognition
Words: visual image motion object cells direction orientation images spatial eye 

Label: Numerical optimization
Words: let class theorem examples threshold bound points vectors dimension dista

These topics are by no means perfect. They have problems such as chained topics, intruded words, random topics, and unbalanced topics ([see Mimno and co-authors 2011](https://people.cs.umass.edu/~wallach/publications/mimno11optimizing.pdf). They will do for the purpose of this tutorial, however.

Now let's retrieve the topic distribution for an author. Each topic has a probability of being expressed given the particular author.

In [58]:
model['YannLeCun']

[(6, 0.22610997369321953), (8, 0.24210404543871827), (9, 0.5315931549129316)]

Let's print the top topics of some authors. First, we make a function to view it more easily and replacing the topic no. by the labels we gave for each topic above.

In [57]:
from pprint import pprint

def show_author(name):
    print('\n%s' % name)
    print('Docs:', model.author2doc[name])
    print('Topics:')
    pprint([(topic_labels[topic[0]], topic[1]) for topic in model[name]])

Below, we print some high profile researchers and inspect them. Three of these, Yann LeCun, Geoffrey E. Hinton and Christof Koch, are spot on.

In [59]:
show_author('YannLeCun')


YannLeCun
Docs: [177, 337, 325, 473, 560, 482, 642, 699, 775, 757, 1458]
Topics:
[('Neural Networks', 0.22610997369321953),
 ('Numerical optimization', 0.24210404543871827),
 ('Speech recognition', 0.5315931549129316)]


In [60]:
show_author('GeoffreyE.Hinton')


GeoffreyE.Hinton
Docs: [4, 177, 268, 212, 232, 438, 479, 446, 615, 734, 715, 968, 957, 1394, 1688, 1634]
Topics:
[('Math/general', 0.3797800173095033),
 ('Neural Networks', 0.44177799436544585),
 ('Object recognition', 0.11305616077790281),
 ('Speech recognition', 0.061465318598817854)]


In [61]:
show_author('TerrenceJ.Sejnowski')


TerrenceJ.Sejnowski
Docs: [459, 563, 519, 516, 693, 620, 686, 592, 733, 842, 900, 910, 975, 860, 896, 881, 972, 1180, 1179, 1278, 1284, 1144, 1220, 1274, 1404, 1344, 1515, 1514, 1534, 1600, 1625]
Topics:
[('Object recognition', 0.9999193015626199)]


In [62]:
show_author('ChristofKoch')


ChristofKoch
Docs: [64, 218, 209, 282, 373, 317, 294, 310, 520, 484, 681, 689, 803, 820, 701, 874, 1192, 1201, 1216, 1519, 1463, 1500, 1452, 1733]
Topics:
[('Neurotech', 0.9998903726144014)]


Terrence J. Sejnowski's results are surprising, however. He is a neuroscientist, so we would expect him to get the "neuroscience" label. This may indicate that Sejnowski works with the neuroscience aspects of visual perception, or perhaps that we have labeled the topic incorrectly, or perhaps that this topic simply is not very informative.

## Visualization

Now let's explore our author-topic model using interactive visualizations.

We take all the author-topic distributions and embed them in a 2D space. To do this, we reduce the dimensionality of this data using t-SNE.

t-SNE is a method that attempts to reduce the dimensionality of a dataset, while maintaining the distances between the points. That means that if two authors are close together in the plot below, then their topic distributions are similar.

In the cell below, we transform the author-topic representation into the t-SNE space. You can increase the `smallest_author` value if you do not want to view the authors with few documents only.

In [39]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=0)
smallest_author = 0  # Ignore authors with documents less than this.
authors = [model.author2id[a] for a in model.author2id.keys() if len(model.author2doc[a]) >= smallest_author]
_ = tsne.fit_transform(model.state.gamma[authors, :])  # Result stored in tsne.embedding_

In [40]:
tsne.embedding_

array([[-26.205421 ,  17.782434 ],
       [ -1.6217549,  -8.415733 ],
       [-60.33683  ,   3.6874433],
       ...,
       [-26.616974 , -31.621323 ],
       [ 49.503586 ,   2.1977787],
       [ 26.626907 ,  52.842297 ]], dtype=float32)

We are now ready to make the plot.

Note that if you run this notebook yourself, you will see a different graph. The random initialization of the model will be different, and the result will thus be different to some degree. You may find an entirely different representation of the data, or it may show the same interpretation slightly differently.

If you can't see the plot, you are probably viewing this tutorial in a Jupyter Notebook. View it in an nbviewer instead at http://nbviewer.jupyter.org/github/rare-technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb.

In [41]:
# Tell Bokeh to display plots inside the notebook.
from bokeh.io import output_notebook
output_notebook()

In [42]:
from bokeh.models import HoverTool
from bokeh.plotting import figure, show, ColumnDataSource

x = tsne.embedding_[:, 0]
y = tsne.embedding_[:, 1]
author_names = [model.id2author[a] for a in authors]

# Radius of each point corresponds to the number of documents attributed to that author.
scale = 0.1
author_sizes = [len(model.author2doc[a]) for a in author_names]
radii = [size * scale for size in author_sizes]

source = ColumnDataSource(
        data=dict(
            x=x,
            y=y,
            author_names=author_names,
            author_sizes=author_sizes,
            radii=radii,
        )
    )

# Add author names and sizes to mouse-over info.
hover = HoverTool(
        tooltips=[
        ("author", "@author_names"),
        ("size", "@author_sizes"),
        ]
    )

p = figure(tools=[hover, 'crosshair,pan,wheel_zoom,box_zoom,reset,save,lasso_select'])
p.scatter('x', 'y', radius='radii', source=source, fill_alpha=0.6, line_color=None)
show(p)

The circles in the plot above are individual authors, and their sizes represent the number of documents attributed to the corresponding author. Hovering your mouse over the circles will tell you the name of the authors and their sizes. Large clusters of authors tend to reflect some overlap in interest.

We see that the model tends to put duplicate authors close together. For example, Terrence J. Sejnowki and T. J. Sejnowski are the same person, and their vectors end up in the same place (see about (−63, 18) in the plot).

At about (−56, 28) we have a cluster of Machine learning researchers like YannLeCun, Yoshua Baengio and Geoffery  Hinton.

As discussed earlier, the "object recognition" topic was assigned to Sejnowski. If we get the topics of the other authors in Sejnoski's neighborhood, like Peter Dayan, we also get this same topic. Furthermore, we see that this cluster is close to the "neuroscience" cluster consisting of James M. Bower and Christof Koch, which is further indication that this topic is about visual perception in the brain.