# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [1]:
# TODO: import needed libraries
import nltk
import numpy as np
import pandas as pd

Load the data in the file `random_headlines.csv`

In [3]:
# TODO: load the dataset
df = pd.read_csv('random_headlines.csv')
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [4]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [6]:
# TODO: Preprocess the input data

#tokenize
df["tokens"] = df["headline_text"].apply(lambda row: nltk.word_tokenize(row))

#pinctuation
df["alphanumeric"] = df["tokens"].apply(lambda row: [word for word in row if word.isalpha()])

#remove stopwords
stop = nltk.corpus.stopwords.words('english')
df["stop"] = df["alphanumeric"].apply(lambda row: [word for word in row if word not in stop])

#stemming
stemmer = nltk.PorterStemmer()
df["stemmed"] = df["stop"].apply(lambda row: [stemmer.stem(word) for word in row])

df["stemmed"].head()

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

Now use Gensim to compute a BOW

In [9]:
# TODO: Compute the BOW using Gensim
from gensim.corpora import Dictionary

dictionary = Dictionary(df["stemmed"])
corpus = [dictionary.doc2bow(line) for line in df["stemmed"]]
print(np.shape(corpus))
corpus[0:2]

(20000,)


  return array(a, dtype, copy=False, order=order)


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]

In [7]:
!pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.2.0-cp39-cp39-macosx_10_9_x86_64.whl (24.0 MB)
[K     |████████████████████████████████| 24.0 MB 4.0 MB/s eta 0:00:01
[?25hCollecting smart-open>=1.8.1
  Downloading smart_open-6.0.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 5.2 MB/s eta 0:00:01
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.2.0 smart-open-6.0.0


Compute the TF-IDF using Gensim

In [10]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel

tfidf_model = TfidfModel(corpus)
tf_idf = tfidf_model[corpus]
print(np.shape(tf_idf))

(20000,)


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [11]:
# TODO: Compute LSA
from gensim.models import LsiModel

lsi = LsiModel(corpus=corpus, num_topics=4, id2word=dictionary)

For each of the topic, show the most significant words.

In [13]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi.print_topics(num_words=4)

[(0, '-0.752*"polic" + -0.405*"man" + -0.207*"charg" + -0.133*"new"'),
 (1, '0.668*"man" + -0.575*"polic" + 0.329*"charg" + 0.167*"court"'),
 (2, '-0.656*"new" + -0.296*"plan" + -0.241*"say" + 0.241*"man"'),
 (3, '-0.701*"new" + 0.345*"say" + 0.335*"plan" + 0.268*"govt"')]

What do you think about those results?

Now let's try to use LDA instead of LSA using Gensim

In [14]:
# TODO: Compute LDA
from gensim.models import LdaModel

lda = LdaModel(corpus=corpus, num_topics=4, id2word=dictionary)

In [15]:
# TODO: print the most frequent words of each topic
lda.print_topics(num_words=5)

[(0, '0.011*"interview" + 0.009*"win" + 0.005*"busi" + 0.005*"second"'),
 (1, '0.014*"man" + 0.010*"polic" + 0.009*"charg" + 0.009*"court"'),
 (2, '0.009*"polic" + 0.006*"kill" + 0.005*"murder" + 0.005*"plan"'),
 (3, '0.009*"new" + 0.007*"govt" + 0.006*"call" + 0.005*"report"')]

In [21]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 1.9 MB/s eta 0:00:01
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
[?25hCollecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: pyLDAvis, sklearn
  Building wheel for pyLDAvis (PEP 517) ... [?25ldone
[?25h  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136900 sha256=0deb7a80c87451eb085c3a4e3e1baee3ccdb33602ec33c5cf62514d12f1900dd
  Stored in directory: /Users/JasonWu/Library/Caches/pip/wheels/57/a4/86/d10c6c2e0bf149fbc0afb0aa5a6528ac35b30a133a0270c477
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-

Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [23]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim_models.prepare(lda, corpus, dictionary)
vis


  default_term_info = default_term_info.sort_values(
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.