# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [2]:
# TODO: import needed libraries
import nltk
import numpy as np
import pandas as pd

Load the data in the file `random_headlines.csv`

In [3]:
# TODO: load the dataset
df = pd.read_csv("random_headlines.csv")
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [4]:
# TODO: Perform a short EDA
df.info()
# both columns contain non-null values
# second column is object (or string)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [5]:
# TODO: Preprocess the input data
from nltk.tokenize import word_tokenize

# tokenise
df['tokens'] = df["headline_text"].apply(lambda saf: nltk.word_tokenize(saf))

# punctuation
df['alphaNumeric'] = df['tokens'].apply(lambda saf: [wrd for wrd in saf if wrd.isalpha()])

# df.head()

# remove stopwords
stop_word = nltk.corpus.stopwords.words("english")
df['stop'] = df['alphaNumeric'].apply(lambda row: [ word for word in row if word not in stop_word ])

# df.head()
# # in removed

# stemming
stemmer = nltk.stem.PorterStemmer()
df['stemmed'] = df['stop'].apply(lambda row: [stemmer.stem(word) for word in row])

df['stemmed'].head()
# accident -> accid

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

Now use Gensim to compute a BOW

In [7]:
# TODO: Compute the BOW using Gensim

# class gensim.corpora.dictionary.Dictionary(Documents=None,prune_at=2000000)
from gensim.corpora import Dictionary

dictionary = Dictionary(df['stemmed'])
corpus = [dictionary.doc2bow(line) for line in df['stemmed']]
print(np.shape(corpus))
print(corpus[0:2])
# dictionary.doc2bow([word_1,word_2,...,word_n]) # pass each array line by line
# we have 20,00 documents inside the corpus
    # for each document there is this (x,y) BOW representation
    # x - id of the word (essentially a dictionary)
    # y - count of that word

(20000,)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]


Compute the TF-IDF using Gensim

In [25]:
# TODO: Compute TF-IDF

#from gensim.models import TfidfModel
from gensim.models import TfidfModel

tfidf_model = TfidfModel(corpus)
tf_idf = tfidf_model[corpus]
print(np.shape(tf_idf))
# returns an article of 20,000 articles

(20000,)


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [26]:
# TODO: Compute LSA

# Google -- Latent Semantic Analysis site:radimrehurek.com

# Try to calculate some of the most common topic that appear indie the articles
# here we care about (and obtain) the top 4 topics

# Latent Symantic Indexing

#------
#Eg
# run distributed LSA on nine documents
# lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200, chunksize=1, distributed=True)
#------

# from gensim.models import LsiModel
# lsi = LsiModel(corpus=corpus, num_topics=4,id2word=dictionary)

from gensim import corpora, models
#lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=4, chunksize=1, distributed=True)
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=4)

For each of the topic, show the most significant words.

In [27]:
# TODO: Print the 3 or 4 most significant words of each topic

lsi.print_topics(num_words=3)

[(0, '-0.752*"polic" + -0.404*"man" + -0.208*"charg"'),
 (1, '0.670*"man" + -0.574*"polic" + 0.328*"charg"'),
 (2, '0.654*"new" + 0.296*"plan" + -0.242*"man"'),
 (3, '0.703*"new" + -0.346*"say" + -0.334*"plan"')]

What do you think about those results?

We have the two dictionary entries 0 and 1 that seem to have the context of crimes news reports as these two documents conatins term associated with criminal incidents.

Documents 2 and 3 of the corpus contain terminilogies associated with politics

At this stage of my learning I am assuming "politics" and "crime" may possibly be classes in the bag of words.

Now let's try to use LDA instead of LSA using Gensim

In [39]:
# TODO: Compute LDA

# Google -- Latent dirichlet Allocaton site:radimrehurek.com

from gensim.models import LdaModel


lda = LdaModel(corpus=corpus, num_topics=4, id2word=dictionary,random_state=0,chunksize=512,passes=5)
# random_state - there is a degrre of randomness in the function, this parameter ensures it is always the same
# chunksize - Number of documents to be used in each training chunk. Similiar to the *0% & 20% we specified in previous labs

In [42]:
# TODO: print the most frequent words of each topic
lda.print_topics(num_words=3)

[(0, '0.016*"report" + 0.009*"back" + 0.009*"may"'),
 (1, '0.012*"mine" + 0.011*"polic" + 0.009*"elect"'),
 (2, '0.013*"question" + 0.010*"council" + 0.010*"fund"'),
 (3, '0.012*"sydney" + 0.012*"charg" + 0.011*"australian"')]

Now, how does it work with LDA?

The way the two models LSI and LSA are fundamentally different, but shres similarities.
LSI identifies a set of concepts from a corpus. On the Other hand LDA considers context words and topics.

Both methods use bag of words, however LSA focuses on reducing matrix dimension while LDA solves topic modelling problems.
Hence why the output of LDA here seems more meaningful to the user as it has more contrast and is less abstract compared to the results of LSI analysis.


LSI
----


Find common themes in documents 
Map
 document -> concept
 term -> concept
 concept -- set of terms with weights eg. "data"(0.8)
 
like an automatically constructed thesaurus (wrt to words mapping to concepts)

Discussion
 to retrieve concepts from documents
 to build a thesaurs automatically
 to reduce dimensionality (down to a few "concepts")
 
LSI uses Singular Value Decomposition (SVD)

Through LSI we can get a ranking of the discovery concept which can allow us to throw away less important concept -- essentially dimensionality reduction

--
LSI helps in finding important concepts
  EG transactions among peaple at a grocery store to turn data into information
   eg contectuallise data into colums for eg vegetarians and meat eaters, grocery items ({bread,lettuce tomatoes},{beef,chicken,..,etc})
   We get this data by creating matrices and analysing them.   


--

We can figure what the concept is from the set of words and their corresponding weights

SVD

LDA
-----

Sort documents according how related or similiar they are with other documents in terms of content.

Initially we need to feed knowledge to the machine -- training data

Chosing the corpus is important tas the model assumes each document contains terms that are somehow related to the same topic
For this reason we need to set the right "window" of context -- Which means we may need to remove some documents to make sampling easier
    This may include removing meaningless documents for the topics eg files with meta data
    
Gather all the words in the (remining) corpus and sort them by popularity

Sample the words  and remove others according to a meaningful criteria eg. model must contain 50,000 - 100,000 words, remove words that          appear in n no of, or x% of dictionaries, or if is in less than n no of, or x% of dictionaries.

Perform a TF-IDF analysis on these words

LDA brings together context, words and topics (@8:09)

Let's make some visualization of the LDA results using pyLDAvis.

In [54]:
# TODO: show visualization results of the LDA

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()

vis = gensimvis.prepare(lda,corpus,dictionary)
vis

  default_term_info = default_term_info.sort_values(
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.