# Topic Modelling with Latent Dirichlet Allocation (LDA)

## The Data

Australian Broadcasting Corp (ABC) news headlines over a period of 15 years.

Source: [Kaggle](https://www.kaggle.com/therohk/million-headlines/data)

In [3]:
import pandas as pd

data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False)

In [4]:
data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [31]:
# cool hack to turn pd series into pd dataframe [[]]

data_text = data[['headline_text']]
data_text['index'] = data_text.index

documents = data_text

In [41]:
print('Nº Documents: %d' %len(documents))
documents.head()

Nº Documents: 1103663


Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## Data Pre-processing

The following actions will be applied to the data

- **Tokenisation**
  - **split** text into sencences and sentences into words
  - **lowercase** words
  - remove **punctuations**
- Remove words with **less than 3 characters**
- Remove **stopwords**
- Words are **lemmatised**
- Words are **stemmed**

In [None]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary

import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
nltk.download('wordnet')

import numpy as np
np.random.seed(2019)

Function to perform lemmatise and stemming preprocessing steps on the data

In [94]:
stemmer = SnowballStemmer('english')

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

Test with a single document

In [128]:
doc_sample = documents[documents['index'] == 611]['headline_text'].values[0]

doc_sample

print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenised and lemmatised document: ')
print(preprocess(doc_sample))

original document: 
['patterson', 'irresponsible', 'with', 'no', 'show', 'edmond']


 tokenised and lemmatised document: 
['patterson', 'irrespons', 'edmond']


Applying the preprocess function to the whole dataset (will take some time)

In [97]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

## Bag of Words on the Data set

Create a dictionary containing the number of occurences of a word in the documents

In [103]:
dictionary = Dictionary(processed_docs)

count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


Filter out tokens that appear in:
- less than 15 documents
- more than half of the documents

After the above, keep only the first 100k most frequent tokens

In [101]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

For each processed document create a list of dictionaries per word with its index and number of appereances

In [129]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[611]

[(436, 1), (1439, 1), (1440, 1)]

Preview of BoW for a sample preprocessed document

In [132]:
bow_doc_611 = bow_corpus[611]

for i in range(len(bow_doc_611)):
    print('Word {} ("{}") appears {} time(s).'.format(
        bow_doc_611[i][0],
        dictionary[bow_doc_611[i][0]],
        bow_doc_611[i][1]
    ))

Word 436 ("patterson") appears 1 time(s).
Word 1439 ("edmond") appears 1 time(s).
Word 1440 ("irrespons") appears 1 time(s).


## TF-IDF

Create a TF-IDF model using the bow corpus

In [113]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5892908867507543),
 (1, 0.38929654337861147),
 (2, 0.4964985175717023),
 (3, 0.5046520327464028)]


## LDA Model with BoW

For each topic, explore the worlds occuring in that topic and its relative weight

In [117]:
from gensim.models import LdaMulticore

lda_model = LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [122]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.022*"year" + 0.019*"interview" + 0.015*"peopl" + 0.013*"famili" + 0.011*"open" + 0.011*"port" + 0.011*"john" + 0.011*"jail" + 0.010*"season" + 0.009*"sentenc"
Topic: 1 
Words: 0.026*"govern" + 0.022*"queensland" + 0.018*"report" + 0.016*"say" + 0.013*"health" + 0.013*"rural" + 0.011*"child" + 0.010*"minist" + 0.010*"labor" + 0.009*"abus"
Topic: 2 
Words: 0.028*"elect" + 0.027*"charg" + 0.024*"murder" + 0.019*"polic" + 0.017*"live" + 0.015*"drug" + 0.014*"alleg" + 0.013*"claim" + 0.012*"accus" + 0.011*"assault"
Topic: 3 
Words: 0.045*"australia" + 0.021*"world" + 0.015*"test" + 0.014*"final" + 0.013*"hospit" + 0.013*"donald" + 0.011*"time" + 0.010*"win" + 0.009*"return" + 0.009*"record"
Topic: 4 
Words: 0.026*"attack" + 0.023*"kill" + 0.021*"crash" + 0.018*"die" + 0.018*"countri" + 0.016*"shoot" + 0.015*"hour" + 0.015*"dead" + 0.014*"polic" + 0.012*"train"
Topic: 5 
Words: 0.023*"adelaid" + 0.018*"market" + 0.014*"turnbul" + 0.014*"high" + 0.013*"share" + 0.013*"break

Are the topics distinguishable by looking at their words and weights?

## LDA with TF-IDF

In [124]:
lda_model_tfidf = LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

for id, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 9 
Words: 0.007*"octob" + 0.006*"wall" + 0.006*"peter" + 0.005*"fiji" + 0.005*"street" + 0.004*"david" + 0.004*"wallabi" + 0.004*"georg" + 0.004*"beat" + 0.004*"data"
Topic: 9 
Words: 0.010*"drum" + 0.009*"turnbul" + 0.007*"abbott" + 0.007*"marriag" + 0.007*"violenc" + 0.006*"septemb" + 0.006*"tuesday" + 0.005*"domest" + 0.005*"elect" + 0.005*"toni"
Topic: 9 
Words: 0.008*"donald" + 0.007*"royal" + 0.006*"commiss" + 0.006*"novemb" + 0.005*"abus" + 0.004*"child" + 0.004*"say" + 0.004*"tree" + 0.004*"court" + 0.004*"liber"
Topic: 9 
Words: 0.021*"countri" + 0.019*"hour" + 0.009*"weather" + 0.006*"flood" + 0.006*"rain" + 0.005*"asylum" + 0.005*"coast" + 0.005*"cyclon" + 0.005*"queensland" + 0.005*"seeker"
Topic: 9 
Words: 0.008*"christma" + 0.007*"thursday" + 0.006*"korea" + 0.005*"detent" + 0.005*"spring" + 0.004*"island" + 0.004*"cancer" + 0.004*"februari" + 0.004*"say" + 0.004*"dump"
Topic: 9 
Words: 0.009*"australia" + 0.009*"world" + 0.008*"final" + 0.007*"leagu" + 0.005*"frid

Are the topics distinguishable by looking at their words and weights?

## Performance evaluation by classifying sample document

### LDA BoW Model

In [141]:
wordId = 611
processed_docs[wordId]

['patterson', 'irrespons', 'edmond']

In [142]:
for index, score in sorted(lda_model[bow_corpus[wordId]], key=lambda tup: -1*tup[1]):
    print('\nScore: {}\t \nTopic: {}'.format(score, lda_model.print_topic(index, 10)))


Score: 0.7749999761581421	 
Topic: 0.026*"govern" + 0.022*"queensland" + 0.018*"report" + 0.016*"say" + 0.013*"health" + 0.013*"rural" + 0.011*"child" + 0.010*"minist" + 0.010*"labor" + 0.009*"abus"

Score: 0.02500000037252903	 
Topic: 0.022*"year" + 0.019*"interview" + 0.015*"peopl" + 0.013*"famili" + 0.011*"open" + 0.011*"port" + 0.011*"john" + 0.011*"jail" + 0.010*"season" + 0.009*"sentenc"

Score: 0.02500000037252903	 
Topic: 0.028*"elect" + 0.027*"charg" + 0.024*"murder" + 0.019*"polic" + 0.017*"live" + 0.015*"drug" + 0.014*"alleg" + 0.013*"claim" + 0.012*"accus" + 0.011*"assault"

Score: 0.02500000037252903	 
Topic: 0.045*"australia" + 0.021*"world" + 0.015*"test" + 0.014*"final" + 0.013*"hospit" + 0.013*"donald" + 0.011*"time" + 0.010*"win" + 0.009*"return" + 0.009*"record"

Score: 0.02500000037252903	 
Topic: 0.026*"attack" + 0.023*"kill" + 0.021*"crash" + 0.018*"die" + 0.018*"countri" + 0.016*"shoot" + 0.015*"hour" + 0.015*"dead" + 0.014*"polic" + 0.012*"train"

Score: 0.0250

### LDA TF-IDF Model

In [143]:
for index, score in sorted(lda_model_tfidf[bow_corpus[wordId]], key=lambda tup: -1*tup[1]):
     print('\nScore: {}\t \nTopic: {}'.format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.5249999761581421	 
Topic: 0.021*"countri" + 0.019*"hour" + 0.009*"weather" + 0.006*"flood" + 0.006*"rain" + 0.005*"asylum" + 0.005*"coast" + 0.005*"cyclon" + 0.005*"queensland" + 0.005*"seeker"

Score: 0.27500006556510925	 
Topic: 0.008*"christma" + 0.007*"thursday" + 0.006*"korea" + 0.005*"detent" + 0.005*"spring" + 0.004*"island" + 0.004*"cancer" + 0.004*"februari" + 0.004*"say" + 0.004*"dump"

Score: 0.02500000223517418	 
Topic: 0.007*"octob" + 0.006*"wall" + 0.006*"peter" + 0.005*"fiji" + 0.005*"street" + 0.004*"david" + 0.004*"wallabi" + 0.004*"georg" + 0.004*"beat" + 0.004*"data"

Score: 0.02500000223517418	 
Topic: 0.010*"drum" + 0.009*"turnbul" + 0.007*"abbott" + 0.007*"marriag" + 0.007*"violenc" + 0.006*"septemb" + 0.006*"tuesday" + 0.005*"domest" + 0.005*"elect" + 0.005*"toni"

Score: 0.02500000223517418	 
Topic: 0.008*"donald" + 0.007*"royal" + 0.006*"commiss" + 0.006*"novemb" + 0.005*"abus" + 0.004*"child" + 0.004*"say" + 0.004*"tree" + 0.004*"court" + 0.004*"libe

In both cases the test document has the highest probability to part of the topic on the top

### Testing model on unseen document

In [149]:
unseen_document = 'Police investigate shooting of woman at Karawatha property'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.5948548316955566	 Topic: 0.022*"polic" + 0.020*"canberra" + 0.020*"death" + 0.018*"help" + 0.016*"miss"
Score: 0.2718096077442169	 Topic: 0.026*"attack" + 0.023*"kill" + 0.021*"crash" + 0.018*"die" + 0.018*"countri"
Score: 0.016668451949954033	 Topic: 0.028*"elect" + 0.027*"charg" + 0.024*"murder" + 0.019*"polic" + 0.017*"live"
Score: 0.016667073592543602	 Topic: 0.032*"court" + 0.020*"face" + 0.017*"china" + 0.014*"fight" + 0.013*"leagu"
Score: 0.01666666753590107	 Topic: 0.022*"year" + 0.019*"interview" + 0.015*"peopl" + 0.013*"famili" + 0.011*"open"
Score: 0.01666666753590107	 Topic: 0.026*"govern" + 0.022*"queensland" + 0.018*"report" + 0.016*"say" + 0.013*"health"
Score: 0.01666666753590107	 Topic: 0.045*"australia" + 0.021*"world" + 0.015*"test" + 0.014*"final" + 0.013*"hospit"
Score: 0.01666666753590107	 Topic: 0.023*"adelaid" + 0.018*"market" + 0.014*"turnbul" + 0.014*"high" + 0.013*"share"
Score: 0.01666666753590107	 Topic: 0.032*"sydney" + 0.022*"south" + 0.022*"nort