# Guided LDA using gensim

For a Guided LDA model, we need to provide the model with a list of topics and seed words. More often then not the topics we get from a standard, unsupervised LDA model are not to our satisfaction. Guided LDA can give the topics a nudge in the direction we want it to converge. We can call this a **semi-supervised LDA model**.

Inspiration for this notebook was provided by: https://gist.github.com/scign/2dda76c292ef76943e0cd9ff8d5a174a

In [3]:
# Import required libraraies
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import gensim
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

There are a couple of additional subpackages that nltk requires to use the POS tagging feature and the WordNet model. We have to make sure those are downloaded.

In [30]:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


True

Let's import the datasets from the EDA notebook.

In [9]:
boris_speech = pd.read_pickle("Pickled Files/boris_speech.pkl")
may_speech = pd.read_pickle("Pickled Files/may_speech.pkl")

In [10]:
boris_speech.head(3)

Unnamed: 0,Sentence
0,i am pleased that this campaign has so far bee...
1,that rocked me at first and then i decided t...
2,for many of us who are now deeply sceptical t...


In [11]:
may_speech.head(3)

Unnamed: 0,Sentence
0,today i want to talk about the united kingdom ...
1,but before i start i want to make clear that ...
2,sovereignty and membership of multilateral ins...


Here we'll create a corpus of text strings for each speech.

In [26]:
boris_corp = list(boris_speech['Sentence'])
may_corp = list(may_speech['Sentence'])

for sentence in boris_corp[:3]:
    print(sentence)
print("\n")
print(f"length of boris speech: {len(boris_corp)} sentences\n")

for sentence in may_corp[:3]:
    print(sentence)
print("\n")
print(f"length of may speech: {len(may_corp)} sentences")

i am pleased that this campaign has so far been relatively free of personal abuse   and long may it so remain   but the other day someone insulted me in terms that were redolent of     s soviet russia  he said that i had no right to vote leave  because i was in fact a  liberal cosmopolitan  
that rocked me  at first  and then i decided that as insults go  i didn t mind it at all   because it was probably true  and so i want this morning to explain why the campaign to leave the eu is attracting other liberal spirits and people i admire such as david owen  and gisela stuart  nigel lawson  john longworth   people who love europe and who feel at home on the continent  but whose attitudes towards the project of european union have been hardening over time 
for many of us who are now deeply sceptical  the evolution has been roughly the same  we began decades ago to query the anti democratic absurdities of the eu  then we began to campaign for reform  and were excited in      by the prime min

We'll now perform some lemmatization using the nltk library in order to transform words into their root form (lemma.)

To identify what part-of-speech any particular word is, is not easy, but nltk again comes to the rescue providing access to a part-of-speech tagger which returns a suitable tag for each word in a given text.

The twist is that the nltk.pos_tag function returns the Penn Treebank tag for the word but we just want whether the word is a noun, verb, adjective or adverb. We need a short simplification routine to translate from the Penn tag to a simpler tag.

In [27]:
# simplify Penn tags to n (NOUN), v (VERB), a (ADJECTIVE) or r (ADVERB)
def simplify(penn_tag):
    pre = penn_tag[0]
    if (pre == 'J'):
        return 'a'
    elif (pre == 'R'):
        return 'r'
    elif (pre == 'V'):
        return 'v'
    else:
        return 'n'

Now we can perform some preprocessing on the two corpuses.

In [28]:
def preprocess(text):
    stop_words = stopwords.words('english')
    toks = gensim.utils.simple_preprocess(str(text), deacc=True)
    wn = WordNetLemmatizer()
    return [wn.lemmatize(tok, simplify(pos)) for tok, pos in nltk.pos_tag(toks) if tok not in stop_words]

In [37]:
boris_corp = [preprocess(line) for line in boris_corp]
print("Boris:")
print(boris_corp[0])
print(boris_corp[1])
print(boris_corp[2])

print("May:")
may_corp = [preprocess(line) for line in may_corp]
print(may_corp[0])
print(may_corp[1])
print(may_corp[2])


Boris:
['pleased', 'campaign', 'far', 'relatively', 'free', 'personal', 'abuse', 'long', 'may', 'remain', 'day', 'someone', 'insult', 'term', 'redolent', 'soviet', 'russia', 'say', 'right', 'vote', 'leave', 'fact', 'liberal', 'cosmopolitan']
['rock', 'first', 'decide', 'insult', 'go', 'mind', 'probably', 'true', 'want', 'morning', 'explain', 'campaign', 'leave', 'eu', 'attract', 'liberal', 'spirit', 'people', 'admire', 'david', 'owen', 'gisela', 'stuart', 'nigel', 'lawson', 'john', 'longworth', 'people', 'love', 'europe', 'feel', 'home', 'continent', 'whose', 'attitude', 'towards', 'project', 'european', 'union', 'harden', 'time']
['many', 'deeply', 'sceptical', 'evolution', 'roughly', 'begin', 'decade', 'ago', 'query', 'anti', 'democratic', 'absurdity', 'eu', 'begin', 'campaign', 'reform', 'excite', 'prime', 'minister', 'bloomberg', 'speech', 'quietly', 'despair', 'reform', 'forthcoming', 'thanks', 'referendum', 'give', 'country', 'david', 'cameron', 'find', 'door', 'magically', 'open

It looks like the lemmatizing functions have performed their task very well in putting the words into their root form depending on the POS tag. We've also removed any more lingering stopwords which is excellent.