# Topic Modelling for News

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [2]:
# TODO: import needed libraries
import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


# New Section

# New Section

Load the data in the file `random_headlines.csv`

In [3]:
# TODO: load the dataset
df = pd.read_csv("/content/random_headlines.csv")
df

Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season
...,...,...
19995,20030301,judge attacks walkinshaw over running of arrows
19996,20070908,polish govt collapses elections to be held next
19997,20150529,the drum friday may 29
19998,20071006,winterbottom on bathurst provisional pole


This is always a good idea to perform some EDA (exploratory data analytics) on a dataset...

In [4]:
# TODO: Perform a short EDA

# Check the shape of the DataFrame
print("Shape of the DataFrame:", df.shape)

# Check the data types of each column
print("Data types of each column:")
print(df.dtypes)

# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Display descriptive statistics for numerical columns
print("Descriptive statistics for numerical columns:")
print(df.describe())

# Display value counts for categorical columns
print("Value counts for categorical columns:")
for column in df.select_dtypes(include=['object']).columns:
    print(df[column].value_counts())


Shape of the DataFrame: (20000, 2)
Data types of each column:
publish_date      int64
headline_text    object
dtype: object
Missing values:
publish_date     0
headline_text    0
dtype: int64
Descriptive statistics for numerical columns:
       publish_date
count  2.000000e+04
mean   2.009558e+07
std    3.875403e+04
min    2.003022e+07
25%    2.006082e+07
50%    2.010022e+07
75%    2.013042e+07
max    2.017072e+07
Value counts for categorical columns:
headline_text
weather in 90 seconds                                     19
abc sport                                                 18
abc weather                                               17
news in 90 seconds                                        14
national rural news                                       12
                                                          ..
christian protesters occupy zed seselja office             1
western nsw minister says health hotline expanded          1
govt considers diversion camp for alice you

Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [5]:
# TODO: Preprocess the input data

# Lowercase the headlines
df['headline_text'] = df['headline_text'].apply(lambda x: x.lower())

# Tokenization
df['headline_text'] = df['headline_text'].apply(lambda x: word_tokenize(x))

# Remove punctuation
df['headline_text'] = df['headline_text'].apply(lambda x: [word for word in x if word.isalnum()])

# Remove stopwords
stop_words = set(stopwords.words('english'))
df['headline_text'] = df['headline_text'].apply(lambda x: [word for word in x if word not in stop_words])

# Stemming
stemmer = PorterStemmer()
df['headline_text'] = df['headline_text'].apply(lambda x: [stemmer.stem(word) for word in x])

# Lemmatization
lemmatizer = WordNetLemmatizer()
df['headline_text'] = df['headline_text'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

print(df['headline_text'])


0                    [ute, driver, hurt, intersect, crash]
1                                  [6yo, die, cycl, accid]
2                          [bumper, oliv, harvest, expect]
3                    [replica, replac, northernmost, sign]
4                          [wood, target, perfect, season]
                               ...                        
19995               [judg, attack, walkinshaw, run, arrow]
19996           [polish, govt, collaps, elect, held, next]
19997                              [drum, friday, may, 29]
19998            [winterbottom, bathurst, provision, pole]
19999    [pull, pork, pawpaw, salad, local, success, st...
Name: headline_text, Length: 20000, dtype: object


Now use Gensim to compute a BOW

In [6]:
# TODO: Compute the BOW using Gensim
from gensim.corpora.dictionary import Dictionary
from gensim.models import TfidfModel
from gensim.matutils import corpus2dense

# Create a dictionary from the preprocessed headlines
dictionary = Dictionary(df['headline_text'])

# Create a Bag of Words corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in df['headline_text']]

# Display the BOW representation for the first headline
print("Bag of Words representation for the first headline:")
print(bow_corpus[0])

# Alternatively, you can use TF-IDF representation
tfidf_model = TfidfModel(bow_corpus)
tfidf_corpus = tfidf_model[bow_corpus]

# Convert TF-IDF corpus to dense matrix for easier viewing
tfidf_matrix = corpus2dense(tfidf_corpus, num_terms=len(dictionary)).T

# Display the TF-IDF representation for the first headline
print("\nTF-IDF representation for the first headline:")
print(tfidf_matrix[0])


Bag of Words representation for the first headline:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]

TF-IDF representation for the first headline:
[0.30725467 0.35289437 0.4212905  ... 0.         0.         0.        ]


Compute the TF-IDF using Gensim

In [7]:
# TODO: Compute TF-IDF
from gensim.corpora import Dictionary
from gensim.models import TfidfModel

# Create a dictionary from the preprocessed headlines
dictionary = Dictionary(df['headline_text'])

# Create a Bag of Words corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in df['headline_text']]

# Compute TF-IDF
tfidf_model = TfidfModel(bow_corpus)
tfidf_corpus = tfidf_model[bow_corpus]

# Print TF-IDF representation for the first headline
print("TF-IDF representation for the first headline:")
for word_id, tfidf_value in tfidf_corpus[0]:
    print(f"Word: {dictionary[word_id]}, TF-IDF: {tfidf_value}")



TF-IDF representation for the first headline:
Word: crash, TF-IDF: 0.30725466582280214
Word: driver, TF-IDF: 0.3528943781678455
Word: hurt, TF-IDF: 0.42129048115131124
Word: intersect, TF-IDF: 0.5992666854471201
Word: ute, TF-IDF: 0.49442279315598586


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [8]:
# TODO: Compute LSA
from gensim.models import LsiModel

# Choose the number of topics (dimensionality reduction)
num_topics = 3

# Compute LSA
lsa_model = LsiModel(tfidf_corpus, num_topics=num_topics, id2word=dictionary)

# Print the topics
print("LSA Topics:")
lsa_topics = lsa_model.print_topics()
for topic in lsa_topics:
    print(topic)



LSA Topics:
(0, '0.455*"man" + 0.385*"polic" + 0.315*"charg" + 0.151*"court" + 0.142*"murder" + 0.128*"face" + 0.111*"new" + 0.110*"crash" + 0.110*"woman" + 0.108*"miss"')
(1, '0.439*"second" + 0.413*"90" + 0.341*"abc" + 0.298*"news" + 0.298*"weather" + -0.237*"man" + 0.233*"busi" + 0.184*"sport" + -0.149*"charg" + 0.101*"plan"')
(2, '-0.385*"man" + -0.273*"charg" + -0.264*"second" + -0.253*"90" + 0.222*"plan" + 0.197*"council" + 0.192*"govt" + 0.183*"new" + -0.168*"weather" + -0.164*"abc"')


For each of the topic, show the most significant words.

In [26]:
# TODO: Print the 3 or 4 most significant words of each topic
from gensim import corpora, models

# Assuming `dictionary` and `tfidf_corpus` are already defined
corpus = tfidf_corpus

lsi_model = models.LsiModel(corpus=corpus, num_topics=4, id2word=dictionary)
topics = lsi_model.print_topics(num_topics=4, num_words=4)
for idx, topic in topics:
    print(f"Topic {idx}: {topic}")






Topic 0: 0.453*"man" + 0.386*"polic" + 0.315*"charg" + 0.147*"court"
Topic 1: -0.437*"second" + -0.412*"90" + -0.342*"abc" + -0.298*"news"
Topic 2: -0.379*"man" + -0.280*"charg" + -0.263*"second" + -0.251*"90"
Topic 3: 0.774*"polic" + -0.239*"man" + -0.218*"charg" + 0.174*"investig"


What do you think about those results?

Now let's try to use LDA instead of LSA using Gensim

In [27]:
# TODO: Compute LDA
from gensim.models import LdaModel

# Define the number of topics
num_topics = 4

# Compute LDA
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)

# Print the topics
for idx, topic in lda_model.print_topics(num_topics=num_topics, num_words=4):
    print(f"Topic {idx}: {topic}")



Topic 0: 0.004*"second" + 0.003*"interview" + 0.003*"plan" + 0.003*"new"
Topic 1: 0.003*"review" + 0.003*"nation" + 0.003*"rural" + 0.003*"countri"
Topic 2: 0.006*"polic" + 0.004*"abc" + 0.004*"charg" + 0.003*"arrest"
Topic 3: 0.004*"man" + 0.004*"kill" + 0.003*"market" + 0.003*"crash"


In [28]:
# TODO: print the most frequent words of each topic
# Print the most frequent words of each topic
print("Most frequent words for each topic:")
for idx, topic in lda_model.show_topics(num_topics=num_topics, num_words=4, formatted=False):
    words = [word for word, _ in topic]
    print(f"Topic {idx}: {', '.join(words)}")


Most frequent words for each topic:
Topic 0: second, interview, plan, new
Topic 1: review, nation, rural, countri
Topic 2: polic, abc, charg, arrest
Topic 3: man, kill, market, crash


Now, how does it work with LDA?

In [30]:
pip install pyLDAvis


Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


Let's make some visualization of the LDA results using pyLDAvis.

In [31]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Create the visualization
lda_vis = gensimvis.prepare(lda_model, corpus, dictionary)

# Display the visualization
pyLDAvis.display(lda_vis)


Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.