# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [50]:
# TODO: import needed libraries
import numpy as np
import pandas as pd

Load the data in the file `random_headlines.csv`

In [62]:
# TODO: load the dataset
df = pd.read_csv("random_headlines.csv")
print(df.shape)
df.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [63]:
# TODO: Perform a short EDA
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [64]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Slaye\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [66]:
# TODO: Preprocess the input data
from nltk.tokenize import word_tokenize

# tokenization
df['tokenize'] = df['headline_text'].apply(lambda row: word_tokenize(row))
df["tokenize"]

# punctuation removal
df["alphanumeric"] = df["tokenize"].apply(lambda row: [
    word for word in row if word.isalpha()
])

# stopword removal
from nltk.corpus import stopwords 
stop = stopwords.words('english')
df["stop"] = df["alphanumeric"].apply(lambda row:[
    word for word in row if word not in stop
] )

# stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
df["stemmed"] = df["stop"].apply(lambda row: [
    stemmer.stem(word) for word in row
])

df['stemmed'].head()

0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object

Now use Gensim to compute a BOW

In [67]:
# TODO: Compute the BOW using Gensim
from gensim.corpora import Dictionary

dictionary = Dictionary(df["stemmed"])

BOW =[dictionary.doc2bow(document) for document in df["stemmed"]]

print(np.shape(BOW))
BOW[:2]

(20000,)


  result = asarray(a).shape


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]

Compute the TF-IDF using Gensim

In [72]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel
tfidf_model = TfidfModel(BOW)
tfidf = tfidf_model[BOW]

print(np.shape(tfidf))
tfidf

(20000,)


<gensim.interfaces.TransformedCorpus at 0x1d2a3de7be0>

Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [76]:
# TODO: Compute LSA
from gensim.models import LsiModel

lsi_model = LsiModel(BOW, id2word=dictionary, num_topics=4)

For each of the topic, show the most significant words.

In [77]:
# TODO: Print the 3 or 4 most significant words of each topic
lsi_model.print_topics(num_topics=4, num_words=3)

[(0, '-0.752*"polic" + -0.405*"man" + -0.207*"charg"'),
 (1, '0.671*"man" + -0.574*"polic" + 0.327*"charg"'),
 (2, '0.654*"new" + 0.297*"plan" + 0.242*"say"'),
 (3, '-0.703*"new" + 0.345*"say" + 0.331*"plan"')]

What do you think about those results?

The model outputs as predicted, the common topics such as police charging a man seem sane, but some topics are very similar to each other.

Now let's try to use LDA instead of LSA using Gensim

In [84]:
# TODO: Compute LDA
from gensim.models import LdaModel

lda_model = LdaModel(BOW, id2word=dictionary, num_topics=4)

In [85]:
# TODO: print the most frequent words of each topic
lda_model.print_topics(num_topics=4, num_words=3)

[(0, '0.011*"interview" + 0.008*"polic" + 0.006*"fund"'),
 (1, '0.011*"man" + 0.007*"new" + 0.005*"charg"'),
 (2, '0.007*"plan" + 0.007*"polic" + 0.006*"fire"'),
 (3, '0.008*"new" + 0.008*"us" + 0.006*"say"')]

Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [86]:
# TODO: show visualization results of the LDA
import pyLDAvis 
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

vis = pyLDAvis.gensim_models.prepare(lda_model, BOW, dictionary)
vis

  default_term_info = default_term_info.sort_values(


Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.