# Guided LDA using gensim

For a Guided LDA model, we need to provide the model with a list of topics and seed words. More often then not the topics we get from a standard, unsupervised LDA model are not to our satisfaction. Guided LDA can give the topics a nudge in the direction we want it to converge. We can call this a **semi-supervised LDA model**.

Inspiration for this notebook was provided by: https://gist.github.com/scign/2dda76c292ef76943e0cd9ff8d5a174a

In [23]:
# Import required libraraies
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import gensim
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

There are a couple of additional subpackages that nltk requires to use the POS tagging feature and the WordNet model. We have to make sure those are downloaded.

In [24]:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

Let's import the datasets from the EDA notebook.

In [25]:
boris_speech = pd.read_pickle("Pickled Files/boris_speech.pkl")
may_speech = pd.read_pickle("Pickled Files/may_speech.pkl")

In [26]:
boris_speech.head(3)

Unnamed: 0,Sentence
0,i am pleased that this campaign has so far bee...
1,that rocked me at first and then i decided t...
2,for many of us who are now deeply sceptical t...


In [27]:
may_speech.head(3)

Unnamed: 0,Sentence
0,today i want to talk about the united kingdom ...
1,but before i start i want to make clear that ...
2,sovereignty and membership of multilateral ins...


Here we'll create a corpus of text strings for each speech.

In [28]:
boris_corp_list = list(boris_speech['Sentence'])
may_corp_list = list(may_speech['Sentence'])

for sentence in boris_corp_list[:3]:
    print(sentence)
print("\n")
print(f"length of boris speech: {len(boris_corp_list)} sentences\n")

for sentence in may_corp_list[:3]:
    print(sentence)
print("\n")
print(f"length of may speech: {len(may_corp_list)} sentences")

i am pleased that this campaign has so far been relatively free of personal abuse   and long may it so remain   but the other day someone insulted me in terms that were redolent of     s soviet russia  he said that i had no right to vote leave  because i was in fact a  liberal cosmopolitan  
that rocked me  at first  and then i decided that as insults go  i didn t mind it at all   because it was probably true  and so i want this morning to explain why the campaign to leave the eu is attracting other liberal spirits and people i admire such as david owen  and gisela stuart  nigel lawson  john longworth   people who love europe and who feel at home on the continent  but whose attitudes towards the project of european union have been hardening over time 
for many of us who are now deeply sceptical  the evolution has been roughly the same  we began decades ago to query the anti democratic absurdities of the eu  then we began to campaign for reform  and were excited in      by the prime min

We'll now perform some lemmatization using the nltk library in order to transform words into their root form (lemma.)

To identify what part-of-speech any particular word is, is not easy, but nltk again comes to the rescue providing access to a part-of-speech tagger which returns a suitable tag for each word in a given text.

The twist is that the nltk.pos_tag function returns the Penn Treebank tag for the word but we just want whether the word is a noun, verb, adjective or adverb. We need a short simplification routine to translate from the Penn tag to a simpler tag.

In [29]:
# simplify Penn tags to n (NOUN), v (VERB), a (ADJECTIVE) or r (ADVERB)
def simplify(penn_tag):
    pre = penn_tag[0]
    if (pre == 'J'):
        return 'a'
    elif (pre == 'R'):
        return 'r'
    elif (pre == 'V'):
        return 'v'
    else:
        return 'n'

Now we can perform some preprocessing on the two corpuses.

In [30]:
def preprocess(text):
    stop_words = stopwords.words('english')
    toks = gensim.utils.simple_preprocess(str(text), deacc=True)
    wn = WordNetLemmatizer()
    return [wn.lemmatize(tok, simplify(pos)) for tok, pos in nltk.pos_tag(toks) if tok not in stop_words]

In [31]:
boris_corp = [preprocess(line) for line in boris_corp_list]
print("Boris:")
print(boris_corp[0])
print(boris_corp[1])
print(boris_corp[2])

print("May:")
may_corp = [preprocess(line) for line in may_corp_list]
print(may_corp[0])
print(may_corp[1])
print(may_corp[2])


Boris:
['pleased', 'campaign', 'far', 'relatively', 'free', 'personal', 'abuse', 'long', 'may', 'remain', 'day', 'someone', 'insult', 'term', 'redolent', 'soviet', 'russia', 'say', 'right', 'vote', 'leave', 'fact', 'liberal', 'cosmopolitan']
['rock', 'first', 'decide', 'insult', 'go', 'mind', 'probably', 'true', 'want', 'morning', 'explain', 'campaign', 'leave', 'eu', 'attract', 'liberal', 'spirit', 'people', 'admire', 'david', 'owen', 'gisela', 'stuart', 'nigel', 'lawson', 'john', 'longworth', 'people', 'love', 'europe', 'feel', 'home', 'continent', 'whose', 'attitude', 'towards', 'project', 'european', 'union', 'harden', 'time']
['many', 'u', 'deeply', 'sceptical', 'evolution', 'roughly', 'begin', 'decade', 'ago', 'query', 'anti', 'democratic', 'absurdity', 'eu', 'begin', 'campaign', 'reform', 'excite', 'prime', 'minister', 'bloomberg', 'speech', 'quietly', 'despair', 'reform', 'forthcoming', 'thanks', 'referendum', 'give', 'country', 'david', 'cameron', 'find', 'door', 'magically', 

It looks like the lemmatizing functions have performed their task very well in putting the words into their root form depending on the POS tag. We've also removed any more lingering stopwords which is excellent.

This is very important to grasp. The LDA algorithm in gensim reads the strings in a **bag of words** format. This structure will list each distinct word in the sentence once, along with the number of times it occurs in the sentence. The <span style="color: red; background-color: grey">*doc2bow*</span> function in the gensim dictionary will replace each word string with a tuple of exactly this format:

In [32]:
boris_dict = gensim.corpora.Dictionary(boris_corp)
may_dict = gensim.corpora.Dictionary(may_corp)

print(len(boris_dict))
print(len(may_dict))

1110
964


We need to make sure that our proposed seed words are actually in each dictionary, otherwise it throw an error later when we run the model. If they're not in the dictionary we can add them using the **gensim.dictionary.add_documents** method. <br>

We can also modify our dictionaries to remvoe words that will not be insightful to us and throw off our model, such as 'eu', 'european', etc. 

##### Add Words

In [33]:
proposed_seed_words = [
    'trade','market', 'economic', 'single', 'export',
    'immigrant', 'immigration', 'movement', 'border', 'population',
    'sovereignty', 'control', 'power','democracy', 'democratic']

In [34]:
print(all(i in proposed_seed_words for i in boris_dict.token2id))
print(all(i in proposed_seed_words for i in may_dict.token2id))
# The above proves that not all of the proposed seed words are missing from the corpuses but some of them are. We'll have to check each one.

False
False


In [35]:
for word in proposed_seed_words:
    print(word, word in boris_dict.token2id)

trade True
market True
economic True
single True
export True
immigrant False
immigration True
movement True
border True
population True
sovereignty False
control True
power True
democracy True
democratic True


In [36]:
boris_dict.add_documents([["immigrant", "sovereignty"]])

In [37]:
for word in proposed_seed_words:
    print(word, word in boris_dict.token2id)

trade True
market True
economic True
single True
export True
immigrant True
immigration True
movement True
border True
population True
sovereignty True
control True
power True
democracy True
democratic True


In [38]:
for word in proposed_seed_words:
    print(word, word in may_dict.token2id)

trade True
market True
economic True
single True
export True
immigrant True
immigration False
movement True
border True
population True
sovereignty True
control True
power True
democracy True
democratic True


In [39]:
may_dict.add_documents([["immigration"]])

In [40]:
for word in proposed_seed_words:
    print(word, word in may_dict.token2id)

trade True
market True
economic True
single True
export True
immigrant True
immigration True
movement True
border True
population True
sovereignty True
control True
power True
democracy True
democratic True


##### Remove words

In [41]:
# https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary

# remove most frequent word
boris_dict.filter_tokens(bad_ids=[boris_dict.token2id['eu']])
boris_dict.filter_tokens(bad_ids=[boris_dict.token2id['european']])
boris_dict.filter_tokens(bad_ids=[boris_dict.token2id['europe']])
boris_dict.filter_tokens(bad_ids=[boris_dict.token2id['country']])
boris_dict.filter_tokens(bad_ids=[boris_dict.token2id['britain']])
boris_dict.filter_tokens(bad_ids=[boris_dict.token2id['uk']])
boris_dict.filter_tokens(bad_ids=[boris_dict.token2id['government']])
boris_dict.filter_tokens(bad_ids=[boris_dict.token2id['someone']])

In [42]:
# https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary

# remove most frequent word
may_dict.filter_tokens(bad_ids=[may_dict.token2id['eu']])
may_dict.filter_tokens(bad_ids=[may_dict.token2id['european']])
may_dict.filter_tokens(bad_ids=[may_dict.token2id['europe']])
may_dict.filter_tokens(bad_ids=[may_dict.token2id['country']])
may_dict.filter_tokens(bad_ids=[may_dict.token2id['britain']])
may_dict.filter_tokens(bad_ids=[may_dict.token2id['uk']])
may_dict.filter_tokens(bad_ids=[may_dict.token2id['government']])
#may_dict.filter_tokens(bad_ids=[may_dict.token2id['someone']])

In [43]:
boris_bow = [boris_dict.doc2bow(line) for line in boris_corp]
may_bow = [may_dict.doc2bow(line) for line in may_corp]

In [44]:
print(boris_bow[1][:10])
print(may_bow[1][:10])

[(1, 1), (7, 1), (8, 1), (9, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1)]
[(1, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1)]


If we make any changes to the dictionary such as adding or removing words, we then have to make sure our corpus is modified accordingly to reflect these changes. Otherwise the indices won't align and we will get an **"index out of bounds"** error when  run the topic model.

In [45]:
removed_words = ['eu', 'european','europe','country', 'britain', 'uk', 'government', 'someone']

In [46]:
for each_string in boris_corp_list:
    for word in each_string:
        if word in removed_words:
            each_string.replace(word, "")

In [47]:
for each_string in may_corp_list:
    for word in each_string:
        if word in removed_words:
            each_string.replace(word, "")

In [48]:
print(len(boris_dict))
print(len(boris_corp_list))
print(len(may_dict))
print(len(may_corp_list))

1104
100
958
74


In [49]:
print(any(i in removed_words for i in boris_corp_list))
print(any(i in removed_words for i in may_corp_list))

False
False


We'll set up a function that displays the probability distribution calculated by the algorithm so that we can see how the topics have been allocated across terms.

We will run the following function for each test. Train a model with our prior distribution (or 'auto'), print out the topic distribution and show the topic allocation for our corpus.

In [62]:
def run_eta(bow_select, eta, corp_list, dictionary, ntopics, print_topics=True, print_dist=True):
    np.random.seed(42) # set the random seed for repeatability

    with (np.errstate(divide='ignore')):  # ignore divide-by-zero warnings
        model = gensim.models.ldamodel.LdaModel(
            corpus=bow_select, id2word=dictionary, num_topics=ntopics,
            random_state=42, chunksize=100, eta=eta,
            eval_every=1, update_every=1,
            passes=150, alpha='auto', per_word_topics=False, iterations=150)
    print('Perplexity: {:.2f}'.format(model.log_perplexity(bow_select)))
    if print_topics:
        # display the top terms for each topic
        for topic in range(ntopics):
            print('Topic {}: {}'.format(topic, [dictionary[w] for w,p in model.get_topic_terms(topic, topn=10)]))
    if print_dist:
        # display the topic probabilities for each document
        for line,bag in zip(corp_list[:15],bow_select):
            doc_topics = ['({}, {:.1%})'.format(topic, prob) for topic,prob in model.get_document_topics(bag)]
            print('{} {}'.format(line, doc_topics))
    return model
    # visuzlize the model term topics

The model parameter that tracks how words are allocated to terms is called **<span style="color: black; background-color: grey">eta</span>** in the gensim implementation.

In [51]:
# check may with auto 
may_test = run_eta(may_bow,'auto',corp_list=may_corp_list,dictionary=may_dict, ntopics=3)

Perplexity: -6.59
Topic 0: ['would', 'per', 'state', 'market', 'cent', 'single', 'trade', 'risk', 'way', 'could']
Topic 1: ['right', 'human', 'court', 'parliament', 'trade', 'world', 'would', 'leave', 'could', 'economy']
Topic 2: ['would', 'u', 'union', 'make', 'remain', 'member', 'membership', 'trade', 'want', 'people']
today i want to talk about the united kingdom  our place in the world and our membership of the european union  ['(2, 99.7%)']
but before i start  i want to make clear that   as you can see   this is not a rally   it will not be an attack or even a criticism of people who take a different view to me   it will simply be my analysis of the rights and wrongs  the opportunities and risks  of our membership of the eu  ['(2, 99.9%)']
sovereignty and membership of multilateral institutions ['(0, 66.7%)', '(2, 33.0%)']
in essence  the question the country has to answer on   rd june   whether to leave or remain   is about how we maximise britain s security  prosperity and influ

The model parameter that tracks how words are allocated to terms is called **<span style="color: black; background-color: grey">'eta'</span>** in the gensim implementation.When we use the keyword **<span style="color: black; background-color: grey">'auto'</span>**, gensim presupposes an even distribution across terms and topics. As we can see, this is not going to be an accurate technique for our analysis. There is too much overlap with the topics. We'll now try to give the model a nudge by presenting it with some seed words.  <br><br>
We can use the top 10 words from the EDA notebook for each politician to help us create the topic categories. I also am aware of the tmain topics from my studies of Brexit while at university!

In [59]:
seed_words = {
    'trade':0, 'market':0, 'economic':0, 'single':0, 'export':0,
    'immigrant':1, 'immigration':1, 'movement':1, 'border':1, 'population':1,
    'sovereignty':2, 'control':2, 'power':2,'democracy':2, "democratic":2
}

To define a prior distribution, we need to create a numpy matrix with the same number of rows and columns as topics and terms, respectively. We then populate that matrix with our prior distribution. To do this we pre-populate all the matrix elements with 1, then with a really high number for the elements that correspond to our 'guided' term-topic distribution.

In [60]:
def create_eta(seed_words, etadict, ntopics):
    eta = np.full(shape=(ntopics, len(etadict)), fill_value=1) # create a (ntopics, nterms) matrix and fill with 1
    for word, topic in seed_words.items(): # for each word in the list of seed words
        keyindex = [index for index,term in etadict.items() if term==word] # look up the word in the dictionary
        if (len(keyindex)>0): # if it's in the dictionary
            eta[topic,keyindex[0]] = 1e7  # put a large number in there
    eta = np.divide(eta, eta.sum(axis=0)) # normalize so that the probabilities sum to 1 over all topics
    return eta

##### Boris ETA Run

In [63]:
boris_eta = create_eta(seed_words, boris_dict, 3)
run_eta(bow_select=boris_bow, eta=boris_eta, corp_list=boris_corp_list,dictionary=boris_dict,ntopics=3)

Perplexity: -0.93
Topic 0: ['need', 'common', 'change', 'single', 'year', 'people', 'leave', 'french', 'new', 'policy']
Topic 1: ['people', 'remain', 'economic', 'trade', 'control', 'campaign', 'leave', 'take', 'per', 'go']
Topic 2: ['year', 'court', 'trade', 'single', 'market', 'change', 'get', 'deal', 'right', 'good']
i am pleased that this campaign has so far been relatively free of personal abuse   and long may it so remain   but the other day someone insulted me in terms that were redolent of     s soviet russia  he said that i had no right to vote leave  because i was in fact a  liberal cosmopolitan   ['(1, 99.9%)']
that rocked me  at first  and then i decided that as insults go  i didn t mind it at all   because it was probably true  and so i want this morning to explain why the campaign to leave the eu is attracting other liberal spirits and people i admire such as david owen  and gisela stuart  nigel lawson  john longworth   people who love europe and who feel at home on the c

<gensim.models.ldamodel.LdaModel at 0x1527aae7df0>

##### May ETA Run

In [66]:
may_eta = create_eta(seed_words, may_dict, 3)
run_eta(bow_select=may_bow, eta=may_eta, corp_list=may_corp_list,dictionary=may_dict,ntopics=3)

Perplexity: -2.27
Topic 0: ['would', 'market', 'state', 'per', 'risk', 'trade', 'single', 'cent', 'way', 'could']
Topic 1: ['right', 'human', 'court', 'parliament', 'trade', 'world', 'would', 'leave', 'could', 'economy']
Topic 2: ['would', 'u', 'union', 'make', 'remain', 'member', 'membership', 'want', 'trade', 'people']
today i want to talk about the united kingdom  our place in the world and our membership of the european union  ['(2, 99.7%)']
but before i start  i want to make clear that   as you can see   this is not a rally   it will not be an attack or even a criticism of people who take a different view to me   it will simply be my analysis of the rights and wrongs  the opportunities and risks  of our membership of the eu  ['(2, 99.9%)']
sovereignty and membership of multilateral institutions ['(0, 63.2%)', '(2, 36.5%)']
in essence  the question the country has to answer on   rd june   whether to leave or remain   is about how we maximise britain s security  prosperity and influ

<gensim.models.ldamodel.LdaModel at 0x1527b0b2b80>

Next steps:
* Visualise results using pyldavis