# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [48]:
# TODO: import needed libraries
import pandas as pd
import numpy as np 

Load the data in the file `random_headlines.csv`

In [49]:
# TODO: load the dataset
df = pd.read_csv('random_headlines.csv')
df.head(10)

Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season
5,20091120,leckie salvages dramatic draw for adelaide
6,20031024,group to gauge rail services future
7,20130304,anti hunting rally still going ahead
8,20081115,dr congo refugees receive first aid
9,20130304,thailand signs agreement with muslim rebels


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [50]:
# TODO: Perform a short EDA
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB
None


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [51]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re

In [52]:
# TODO: Preprocess the input data
# Initialize the necessary components
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define a function to preprocess a single headline
def preprocess_headline(headline):

    headline = headline.lower()

    tokens = word_tokenize(headline)

    tokens = [re.sub(r'\W+', '', token) for token in tokens if token.isalpha()]

    tokens = [token for token in tokens if token not in stop_words]

    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

# Apply preprocessing to the "headline_text" column
df['processed_headline'] = df['headline_text'].apply(preprocess_headline)
processed_headline = df['processed_headline']

processed_headline

0                 [ute, driver, hurt, intersection, crash]
1                                  [dy, cycling, accident]
2                       [bumper, olive, harvest, expected]
3                  [replica, replaces, northernmost, sign]
4                          [wood, target, perfect, season]
                               ...                        
19995          [judge, attack, walkinshaw, running, arrow]
19996       [polish, govt, collapse, election, held, next]
19997                                  [drum, friday, may]
19998          [winterbottom, bathurst, provisional, pole]
19999    [pulled, pork, pawpaw, salad, local, success, ...
Name: processed_headline, Length: 20000, dtype: object

Now use Gensim to compute a BOW

In [53]:
from gensim.corpora.dictionary import Dictionary

In [54]:
# TODO: Compute the BOW using Gensim
dictionary = Dictionary(processed_headline)

dictionary.filter_extremes(no_below=1, no_above=0.5)

corpus = [dictionary.doc2bow(text) for text in processed_headline]

Compute the TF-IDF using Gensim

In [55]:
from gensim.corpora.dictionary import Dictionary
from gensim.models import TfidfModel

In [56]:
# TODO: Compute TF-IDF
tfidf = TfidfModel(corpus)
tfidf_corpus = tfidf[corpus]
tfidf_corpus[0]

[(0, 0.3078090045519022),
 (1, 0.3513689017461401),
 (2, 0.4282825995423115),
 (3, 0.5966762015643524),
 (4, 0.4922855238766366)]

Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [57]:
from gensim.models import LsiModel

In [58]:
# TODO: Compute LSA
num_topics = 3

lsi_model = LsiModel(tfidf_corpus, id2word=dictionary, num_topics=num_topics)


print("LSA Topics:")
for i, topic in lsi_model.print_topics(num_words=10):
    print(f"Topic {i}: {topic}")


lsa_corpus = lsi_model[tfidf_corpus]


print("\nLSA representation for the first headline:")
print(list(lsa_corpus)[0])

LSA Topics:
Topic 0: 0.467*"man" + 0.420*"police" + 0.226*"charged" + 0.157*"court" + 0.133*"murder" + 0.128*"missing" + 0.122*"death" + 0.118*"face" + 0.117*"new" + 0.113*"crash"
Topic 1: -0.529*"second" + -0.435*"abc" + -0.417*"news" + -0.362*"weather" + -0.278*"business" + -0.215*"sport" + 0.146*"man" + -0.103*"rural" + 0.089*"police" + -0.086*"national"
Topic 2: 0.473*"man" + 0.249*"charged" + -0.228*"council" + -0.217*"new" + -0.207*"govt" + -0.196*"plan" + -0.143*"say" + 0.138*"second" + -0.122*"call" + -0.120*"water"

LSA representation for the first headline:
[(0, 0.06234621506730024), (1, 0.016238525479537948), (2, 0.006267734774048175)]


For each of the topic, show the most significant words.

In [59]:
# TODO: Print the 3 or 4 most significant words of each topic
print("LSA Topics:")
for i, topic in lsi_model.show_topics(num_topics=num_topics, num_words=4, formatted=False):
    print(f"Topic {i}:")
    significant_words = [word for word, weight in topic]
    print(f"  {', '.join(significant_words)}")

LSA Topics:
Topic 0:
  man, police, charged, court
Topic 1:
  second, abc, news, weather
Topic 2:
  man, charged, council, new


What do you think about those results?

Answer: The result seems not accurate on this dataset in The topics generated by LSA can be harder to interpret because they are linear combinations of terms. The results in topic 0 and topic 2 are quite similar and the model can't show the significant difference among these 2 topics

Now let's try to use LDA instead of LSA using Gensim

In [63]:
from gensim.models import LdaModel

In [65]:
# TODO: Compute LDA
num_topics = 3

# Create the LDA model using the BOW corpus
lda_model = LdaModel(corpus, id2word=dictionary, num_topics=num_topics, passes=10, random_state=42)

In [66]:
# TODO: print the most frequent words of each topic
# Print the topics with the most significant words
print("LDA Topics:")
for i, topic in lda_model.show_topics(num_topics=num_topics, num_words=4, formatted=False):
    print(f"Topic {i}:")
    significant_words = [word for word, weight in topic]
    print(f"  {', '.join(significant_words)}")

# Transform the corpus to the LDA space and print an example
lda_corpus = lda_model[corpus]

# Print the LDA representation for the first headline as an example
print("\nLDA representation for the first headline:")
print(list(lda_corpus)[0])

LDA Topics:
Topic 0:
  fire, police, death, u
Topic 1:
  man, police, court, charged
Topic 2:
  council, govt, plan, call

LDA representation for the first headline:
[(0, 0.8845285), (1, 0.059799723), (2, 0.05567175)]


Now, how does it work with LDA?

Answer: The LDA works well on this dataset that can seperate the topics into 3 ones:
-Topic 1: News about fire accidents
-Topic 2: News about crimes
-Topic 3: News about government's policies

Let's make some visualization of the LDA results using pyLDAvis.

In [67]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

In [68]:
# TODO: show visualization results of the LDA
lda_display = gensimvis.prepare(lda_model, corpus, dictionary, sort_topics=False)

pyLDAvis.display(lda_display)

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.