# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [81]:
# TODO: import needed libraries
import pandas as pd
import numpy as np
from nltk import word_tokenize, wordpunct_tokenize, pos_tag
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.models import LsiModel
from pprint import pprint
from gensim.models import LdaModel
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from gensim.matutils import Sparse2Corpus
import pyLDAvis
import pyLDAvis.gensim

Load the data in the file `random_headlines.csv`

In [82]:
# TODO: load the dataset
df = pd.read_csv("random_headlines.csv")
print(df.shape)
data.head()

(20000, 2)


Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [83]:
# TODO: Perform a short EDA
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [84]:
# TODO: Preprocess the input data

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_data(quote):
    quote = quote.lower()
    tokens = word_tokenize(quote)
    token_punc = [t for t in tokens if t.isalpha()]
    token_stop = [t for t in token_punc if t not in stop_words]
    stemmed_words = [stemmer.stem(w) for w in token_stop]
    return stemmed_words

# Apply the clean_data function to the 'headline_text' column and store the result in a new column 'stemmed'
df["stemmed"] = df["headline_text"].apply(lambda x: clean_data(x))

# Display the first few rows of the 'stemmed' column
print(df["stemmed"].head())



0    [ute, driver, hurt, intersect, crash]
1                       [die, cycl, accid]
2          [bumper, oliv, harvest, expect]
3    [replica, replac, northernmost, sign]
4          [wood, target, perfect, season]
Name: stemmed, dtype: object


Now use Gensim to compute a BOW

In [85]:
# TODO: Compute the BOW using Gensim
dictionary = Dictionary(df["stemmed"])
bow_corpus = [dictionary.doc2bow(text) for text in df["stemmed"]]
print((len(bow_corpus),))
print(bow_corpus[:2])

(20000,)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1)]]


Compute the TF-IDF using Gensim

In [86]:
# TODO: Compute TF-IDF
tfidf_model = TfidfModel(bow_corpus)
tf_idf_gensim = tfidf_model[bow_corpus]
print((len(tf_idf_gensim),))
print(tf_idf_gensim)

(20000,)
<gensim.interfaces.TransformedCorpus object at 0x00000228482B6FA0>


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [87]:
# TODO: Compute LSA
lsi_model = LsiModel(tf_idf_gensim, id2word=dictionary, num_topics=4)

  sparsetools.csc_matvecs(m, n, samples, corpus.indptr, corpus.indices,


For each of the topic, show the most significant words.

In [88]:
# TODO: Print the 3 or 4 most significant words of each topic
output = []
for topic_id, topic in lsi_model.show_topics(num_topics=4, num_words=3, formatted=False):
    topic_words = [(word, '{:.3f}*"{:s}"'.format(weight, word)) for word, weight in topic]
    topic_string = " + ".join([word[1] for word in topic_words])
    output.append((topic_id, topic_string))
output_string = "[" + ",\n ".join([str(topic) for topic in output]) + "]"
print(output_string)

[(0, '0.457*"man" + 0.385*"polic" + 0.322*"charg"'),
 (1, '0.396*"second" + 0.337*"abc" + 0.332*"news"'),
 (2, '-0.374*"second" + -0.320*"man" + -0.290*"abc"'),
 (3, '0.766*"polic" + -0.231*"man" + -0.226*"charg"')]


What do you think about those results?

Now let's try to use LDA instead of LSA using Gensim

In [89]:
# TODO: Compute LDA
num_topics_lda = 5
lda_model = LdaModel(corpus=tfidf_corpus, id2word=dictionary, num_topics=num_topics_lda)

In [90]:
# TODO: print the most frequent words of each topic
output_lda = []
for topic_id, topic in lda_model.show_topics(num_topics=num_topics_lda, num_words=3, formatted=False):
    topic_words = [(word, '{:.3f}*"{:s}"'.format(weight, word)) for word, weight in topic]
    topic_string = " + ".join([word[1] for word in topic_words])
    output_lda.append((topic_id, topic_string))
output_string_lda = "[" + ",\n ".join([str(topic) for topic in output_lda]) + "]"
print(output_string_lda)


[(0, '0.004*"polic" + 0.004*"fire" + 0.003*"new"'),
 (1, '0.006*"second" + 0.004*"abc" + 0.004*"weather"'),
 (2, '0.007*"polic" + 0.005*"man" + 0.005*"charg"'),
 (3, '0.004*"world" + 0.003*"futur" + 0.003*"fall"'),
 (4, '0.003*"health" + 0.003*"rural" + 0.003*"nation"')]


Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [80]:
# TODO: show visualization results of the LDA
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
lda_display = gensimvis.prepare(lda_model, tfidf_corpus, dictionary, sort_topics=False)

# Display the visualization
pyLDAvis.display(lda_display)

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.