# 9. Topic models

Topic models are a very commonly used probabilistic method that allows us to discover structure in a large collection of texts. The idea is that frequencies of word occurrences are different depending on the *topic*. Each topic attributes frequencies to words, with the most characteristic words being at the top (colored sheets on the left).

Further, a document does not contain just one topic, but typically is a *mixture* of serveral topics (bar chart on the right). The example article considers genetics (yellow), evolutionary biology (purple) and computer science (blue), but not neurology (green). Within each topic, the characteristic words are likely to occur together. But not necessarily between topics - an article from pure computer science would contain the same "blue" words occurring together, but not the "yellow" and "purple".

Using statistical inference we can reconstruct the topics and the topic proportions of each document from raw text.

![lda.png](attachment:lda.png)

D. Blei, *Introduction to probabilistic topic models.* 2012

We will use the Python library `gensim` (https://radimrehurek.com/gensim/) for computing topic models:

In [2]:
import gensim



In order for topic models to work properly, we need to consider the following:
1. We need a lot of text. Thousands of documents are a minimum, tens or hundreds of thousands are better.
2. The model works better if we convert words to lemmas and consider only nouns. Other parts of speech are much more rarely characteristic for a particular topic.
3. The corpus should really contain different topics, which can be seen by very characteristic words occurring together. News, Wikipedia or scientific articles will work well, novels or poetry not necessarily.

We will use the [British National Corpus "Baby" edition](http://hdl.handle.net/20.500.12024/2553) as a source of news text. Download this file, unzip it and put it into the directory `bnc/` of this notebook.

Here's how we read the corpus in two different forms of representation:
- `texts` contains all tokens as they occur in the text, so that we can still read the articles
- `nouns` contains only lemmas of nouns.

The corpus format is XML and it contains linguistic annotations, so we don't need to process it with spaCy. We can get the lemmas and POS tags straight from the XML using `BeautifulSoup` to parse it.

In [3]:
from bs4 import BeautifulSoup
from collections import defaultdict
import os
from operator import itemgetter

In [4]:
# We use the news texts from the directory bnc/download/Texts/news
# This way of specifying the path should work on all operating systems.
DATA_DIR = os.path.join('bnc', 'download', 'Texts', 'news')

In [5]:
DATA_DIR

'bnc/download/Texts/news'

In [6]:
texts = []                                                  # this will contain complete texts
nouns = []                                                  # this contains only lemmas of nouns
for filename in os.listdir(DATA_DIR):                       # go through all files in the directory
    filepath = os.path.join(DATA_DIR, filename)
    with open(filepath) as fp:
        soup = BeautifulSoup(fp)
        divs = soup.find('wtext').find_all('div')           # articles are inside <div> tags inside <wtext>
        for node_div in divs:
            doc_text, doc_nouns = [], []
            for node_w in node_div.find_all('w'):           # tokens are inside <w> tags
                if node_w.attrs['pos'] == 'SUBST':          # if the token is a noun...
                    doc_nouns.append(node_w.attrs['hw'])    #   append its lemma to the `d_nouns` list
                doc_text.append(node_w.string)              # in any case, append the token to the `d_text` list
            texts.append(doc_text)
            nouns.append(doc_nouns)

The corpus contains a couple thousand snippets from news media, which is enough for trying out topic models:

In [7]:
len(texts)

3644

In [8]:
len(nouns)

3644

In [9]:
' '.join(texts[0])



In [10]:
' '.join(nouns[0])



The following two lines convert the corpus to a bag-of-words representation needed by `gensim`:

In [11]:
dictionary = gensim.corpora.Dictionary(nouns)
corpus = [dictionary.doc2bow(d_nouns) for d_nouns in nouns]

This is how we train the model:

In [12]:
# LDA = Latent Dirichlet Allocation
tm = gensim.models.LdaModel(corpus, id2word=dictionary, num_topics=30, passes=10)

Now we can look what topics were discovered:

In [13]:
for t in tm.show_topics(30):
    print(t)

(0, '0.013*"year" + 0.012*"keeper" + 0.008*"golf" + 0.008*"eagle" + 0.007*"lady" + 0.007*"prize" + 0.007*"grange" + 0.007*"tournament" + 0.007*"matt" + 0.006*"march"')
(1, '0.018*"gazza" + 0.012*"world" + 0.007*"j." + 0.007*"alton" + 0.006*"umpire" + 0.006*"sentence" + 0.006*"crown" + 0.006*"lazio" + 0.006*"sport" + 0.006*"p."')
(2, '0.031*"cricket" + 0.025*"county" + 0.022*"lamb" + 0.014*"surrey" + 0.012*"ball" + 0.011*"club" + 0.010*"warrington" + 0.010*"tour" + 0.009*"sunday" + 0.008*"bowler"')
(3, '0.078*"school" + 0.022*"child" + 0.018*"teacher" + 0.015*"spurs" + 0.014*"parent" + 0.012*"education" + 0.011*"pupil" + 0.007*"millwall" + 0.006*"atkinson" + 0.006*"day"')
(4, '0.032*"festival" + 0.019*"music" + 0.013*"opera" + 0.010*"composer" + 0.008*"bat" + 0.007*"work" + 0.007*"piano" + 0.006*"concerto" + 0.006*"college" + 0.005*"mcallister"')
(5, '0.031*"system" + 0.020*"computer" + 0.017*"software" + 0.014*"user" + 0.013*"airline" + 0.011*"pc" + 0.011*"us" + 0.010*"gooch" + 0.009*"

And what topics do documents consist of:

In [14]:
tm.get_document_topics(corpus[0])

[(14, 0.7794846), (28, 0.20977327)]

In [15]:
dir(tm)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_adapt_by_suffix',
 '_apply',
 '_load_specials',
 '_save_specials',
 '_smart_save',
 'add_lifecycle_event',
 'alpha',
 'bound',
 'callbacks',
 'chunksize',
 'clear',
 'decay',
 'diff',
 'dispatcher',
 'distributed',
 'do_estep',
 'do_mstep',
 'dtype',
 'eta',
 'eval_every',
 'expElogbeta',
 'gamma_threshold',
 'get_document_topics',
 'get_term_topics',
 'get_topic_terms',
 'get_topics',
 'id2word',
 'inference',
 'init_dir_prior',
 'iterations',
 'lifecycle_events',
 'load',
 'log_perplexity',
 'minimum_phi_value',
 'minimum_probability',
 'num_terms',
 'num_topics',
 'num_updates',
 'numworkers',
 'offse

In [16]:
help(tm.get_topics)

Help on method get_topics in module gensim.models.ldamodel:

get_topics() method of gensim.models.ldamodel.LdaModel instance
    Get the term-topic matrix learned during inference.
    
    Returns
    -------
    numpy.ndarray
        The probability for each word in each topic, shape (`num_topics`, `vocabulary_size`).



# In-class exercises

## Ex 1

Count the "frequency" of each topic. For each document, add the proportion of this topic in the document to the topic's "frequency".

For example if the document only contains 0.1 of a certain topic, add 0.1 to that topic's "frequency".

In [18]:
topic_freqs = defaultdict(lambda: 0)
for doc in corpus:
    for t, p in tm.get_document_topics(doc):
        topic_freqs[t] += p

In [20]:
topic_freqs_lst = sorted(topic_freqs.items(), reverse=True, key=itemgetter(1))

In [21]:
topic_freqs_lst

[(6, 326.7784729376435),
 (14, 274.9874647241086),
 (9, 258.84519710112363),
 (27, 224.5393017232418),
 (29, 205.4553927835077),
 (12, 203.45565384346992),
 (25, 191.68400639854372),
 (21, 182.3839237531647),
 (20, 155.87709568534046),
 (10, 128.73715764377266),
 (26, 115.42116738576442),
 (8, 112.79000183008611),
 (16, 98.64253896102309),
 (19, 92.46972061134875),
 (28, 80.63266435917467),
 (13, 79.87558763846755),
 (23, 74.43985425308347),
 (24, 72.67524741310626),
 (0, 72.29655981622636),
 (1, 65.04983016476035),
 (11, 63.79013953637332),
 (2, 62.048007239587605),
 (3, 60.70788324903697),
 (15, 58.6454622419551),
 (7, 56.06065219640732),
 (17, 54.04287434462458),
 (18, 53.397389793768525),
 (22, 52.97327979374677),
 (4, 44.447271812707186),
 (5, 40.976140240207314)]

## Ex 2

Write a function `docs_by_topic(tm, corpus)`, that returns a dictionary containing document IDs and proportions of a certain topic, by topic. E.g.:

In [32]:
def docs_by_topic(tm, corpus):
    dt = defaultdict(list)
    # need to call tm.get_document_topics(corpus[i])
    # result: dictionary topic -> [(document_id, proportion), ...]
    for i, doc in enumerate(corpus):
        for t, p in tm.get_document_topics(doc):
            dt[t].append((i, p)) # how do we get the document index?
    # The following is another way of saying:
    #for key in dt:
    #    dt[key] = sorted(dt[key], reverse=True, key=itemgetter(1))
    dt = { key: sorted(val, reverse=True, key=itemgetter(1))
           for key, val in dt.items() }
    return dt

In [33]:
dt = docs_by_topic(tm, corpus)

Now e.g. `dt[1]` is a list of pairs `(document_idx, proportion)` meaning that document number `document_idx` contains `proportion` of topic 1. The list should be sorted by `proportion` in descending order:

In [36]:
dt[14][:10]

[(693, 0.97226167),
 (1092, 0.96808094),
 (699, 0.94081396),
 (1422, 0.9394658),
 (1999, 0.93409973),
 (706, 0.9155508),
 (574, 0.9122305),
 (470, 0.91176087),
 (639, 0.8865332),
 (3629, 0.87490684)]

In [40]:
for i, p in dt[14][:10]:
    print(' '.join(texts[i]))
    print()

Flogging  Briton  risks  100  lashes Christian  Gysin A  BRITON  sentenced  by  a  Saudi  court  to  a  flogging  for  allegedly  swearing  at  his  staff  may  have  the  sentence  doubled  for  appealing Hospital  executive  David  Brown 32 could  suffer  100  strokes  from  a  6ft  bamboo  cane  if  his  attempt  to  have  the  ruling  reversed  fails The  Foreign  Office aware  of  a  new  £40  billion  arms  sale  to  the  Saudis last  night  played  down  the  case Senior  British  officials  in  Jeddah  refused  to  discuss  Mr  Brown 's  plight Consul  John  Dimmock  said Why  are  you  ringing  me  about  this Why  is  there  so  much  publicity  about  it Foreign  Office  sources  said  public  floggings  were  designed  as  a  humiliation and  not  to  draw  blood The  man  who  beats  the  victim  holds  a  Koran  under  his  arm an  official  said That  limits  the  force Mr  Brown  went  to  Saudi  Arabia  last  February  to  increase  efficiency  at  the  British-run  Ki

## Ex 3

Write a function `explain(tm, doc)` that:
* for each word (i.e. noun) occurring in the document, determines in which topic this word in the most likely,
* returns a dictionary: topic_id -> list of pairs: `(word, probability, frequency_in_document)`, sorted by `probability`.

Use the function to "explain" a document with several topics!

*Hint*: you can use `tm.get_term_topics()` and `tm.id2word` like this:

In [44]:
tm.id2word[11]

'education'

In [92]:
tm.get_term_topics(tm.id2word[148])

[(17, 0.015026192), (19, 0.011623424)]

In [41]:
tm.get_term_topics('business')

[(8, 0.01741376), (20, 0.015308813)]

In [48]:
def explain(tm, doc):
    result = defaultdict(list)
    for w_id, freq in doc:
        word = tm.id2word[w_id]
        for t, p in tm.get_term_topics(word):
            result[t].append((word, p, freq))
    for t in result:
        result[t].sort(reverse=True, key=itemgetter(1))
    return result

In [49]:
explain(tm, corpus[0])

defaultdict(list,
            {3: [('education', 0.012065857, 3)],
             12: [('man', 0.024471885, 2)],
             14: [('service', 0.015782503, 1)],
             1: [('world', 0.012125934, 1)],
             25: [('world', 0.010695182, 1)],
             28: [('world', 0.012888271, 1)]})

# Optional homework

## Ex 4

Train a topic model with the YLE news data that we used in previous classes:
* you can use the code from previous notebooks to read the file,
* run spaCy on the texts to get lemmas and POS tags,
* filter the texts to select only lemmas of nouns,
* train the model like it's done here,
* try 10, 20 and 30 topics. Which one looks best?

The results will probably be poor because there is too little text. But still we might find some meaningful topics.