# LDA (Latent Dirichlet Allocation) 

LDA is example of topic modelling.
Used to classify text in a document to a particular topic.
It builds a topic per document model and words per topic model, modeled as Dirichlet distributions

https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

### Importing Data

In [1]:
import pandas as pd
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines = False)
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

### Looking to a DataFrame

In [2]:
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


In [3]:
documents.shape

(1103663, 2)

### Loading gensim and nltk libraries

Gensim - is billed as a Natural Language Processing package that does ‘Topic Modeling for Humans’. But it is practically much more than that. It is a leading and a state-of-the-art package for processing texts, working with word vector models (such as Word2Vec, FastText etc) and for building topic models.
https://www.machinelearningplus.com/nlp/gensim-tutorial/#top

nltk - Natural Language Toolkit
http://www.nltk.org/

In [4]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Михаил\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Functions for lemmatization and stemming

In [5]:
stemmer = nltk.stem.snowball.EnglishStemmer()
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos = 'v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result
def lemmatize(text):
    a= []
    for token in gensim.utils.simple_preprocess(text):
        a.append(WordNetLemmatizer().lemmatize(token, pos = 'v'))
    return a

### Selecting a document to preview after preprocessing

TODO  = upgrade stemmers and lemmatizers to correct result from 'residents', 'heavy' etc. 

In [6]:
doc_sample = documents[documents['index'] == 4310].values[0][0]
print ('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print (words)
print ('lemmatized document:')
print (lemmatize(doc_sample))
print ('tokenized and lemmatized document: ')
print (preprocess(doc_sample))

original document: 
['rain', 'helps', 'dampen', 'bushfires']
lemmatized document:
['rain', 'help', 'dampen', 'bushfires']
tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


In [7]:
processed_docs = documents['headline_text'].map(preprocess)
processed_docs.head(10)

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

### Creating a bag of words

In [10]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [11]:
count = 0
for k, v in dictionary.iteritems():
    print (k, v)
    count +=1
    if count >10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


### Filtering only the words that apeears in
- less then 15 documents OR
- more than 0.5 documents (fraction of totalcorpus size, not absolute value)
- after the aboce two steps, keep only the first 100 000 most frequent tokens

In [13]:
dictionary.filter_extremes(no_below = 15, no_above = 0.5, keep_n = 100000)

### Creating dictionary for each doc in corpus

In [20]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]


In [21]:
bow_corpus[4311]

[(161, 1), (239, 1), (291, 1), (588, 1), (836, 1), (3549, 1), (3550, 1)]

In [16]:
bow_doc_4310 = bow_corpus[4310]
for i in range(len(bow_corpus_4310)):
    print ("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))

Word 76 ("bushfir") appears 1 time.
Word 112 ("help") appears 1 time.
Word 483 ("rain") appears 1 time.
Word 4014 ("dampen") appears 1 time.


### TF-IDF - term frequency-inverse document frequency
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [22]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

In [27]:
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5892908867507543),
 (1, 0.38929654337861147),
 (2, 0.4964985175717023),
 (3, 0.5046520327464028)]


### Running LDA using bag of words

In [28]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 10, id2word = dictionary, passes = 2, workers = 2)

##### for each topic, exploring the words occuring in that topic and its relative weight

In [29]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx,topic))

Topic: 0 
Words: 0.024*"kill" + 0.017*"elect" + 0.017*"say" + 0.016*"state" + 0.016*"attack" + 0.014*"china" + 0.014*"children" + 0.014*"deal" + 0.011*"talk" + 0.011*"leader"
Topic: 1 
Words: 0.036*"trump" + 0.035*"australia" + 0.019*"world" + 0.015*"win" + 0.014*"time" + 0.013*"gold" + 0.012*"meet" + 0.011*"lead" + 0.010*"beat" + 0.010*"take"
Topic: 2 
Words: 0.024*"crash" + 0.023*"canberra" + 0.021*"die" + 0.019*"hospit" + 0.014*"road" + 0.012*"polit" + 0.011*"green" + 0.011*"public" + 0.011*"resid" + 0.010*"question"
Topic: 3 
Words: 0.019*"perth" + 0.018*"melbourn" + 0.016*"sydney" + 0.015*"year" + 0.014*"open" + 0.014*"tasmanian" + 0.014*"tasmania" + 0.013*"record" + 0.012*"leav" + 0.011*"australia"
Topic: 4 
Words: 0.024*"warn" + 0.023*"test" + 0.014*"victoria" + 0.014*"driver" + 0.013*"news" + 0.012*"victorian" + 0.012*"street" + 0.011*"australia" + 0.011*"violenc" + 0.011*"say"
Topic: 5 
Words: 0.050*"polic" + 0.030*"charg" + 0.028*"court" + 0.025*"queensland" + 0.022*"murder" 