# Topic Modelling using Latent Dirichlet Allocation #

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial.
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

## Imports & Web Scraping ##

In [78]:
# Standard imports
import pandas as pd

# For web scraping
import requests
#import urllib.request
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)


In [79]:
r = requests.get("https://bleacherreport.com/nba")
r
r2 = requests.get("https://www.thescore.com/nba/news")
r2

<Response [200]>

In [80]:
# soup = BeautifulSoup(r2.content)

In [81]:

# Print out the prettified HTML content to understand its structure
# print(soup.prettify())

### Sports News Headlines Data

In [82]:

# Send a request to the website
url = "https://bleacherreport.com/nba"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

# Find all the headline elements
    headlines = []
    for item in soup.find_all('h3'):
        headlines.append(item.get_text(strip=True))

# Load the headlines into a DataFrame
    df = pd.DataFrame(headlines, columns=['Headline'])

# Remove Duplicates
    df = df.drop_duplicates().reset_index(drop=True)
    print(df)
else:
    print("Failed to retrieve the page. Status code:", response.status_code)

                                   Headline
0                                       NBA
1         Dame's Game-Winner vs. Rockets ⌚️
2                      76ers Fall to 2-11 📉
3                     Bam Mixes Up Embiid 🌀
4               Jericho Sims Dunks on Kuz 😦
..                                      ...
71      Bron Extends Triple-Double Streak 📈
72                  D-Book Shoves Lu Dort 😳
73              Jimmy's New Look on Bench 😂
74                 AD Scores 40 vs. Spurs 💪
75  76ers Fall to 2-10 After Loss vs. Magic

[76 rows x 1 columns]


In [83]:
df = df.drop(index=0).reset_index(drop=True)

In [84]:


# Send a request to the website
url = "https://www.nytimes.com/athletic/3999271/2022/12/20/best-of-the-athletic-2022-top-stories-for-mlb-nfl-nhl-nba-global-football-and-more/"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

# Find all the headline elements
    headlines2 = []
    for item in soup.find_all('h3'):
        headlines2.append(item.get_text(strip=True))

# Load the headlines into a DataFrame
    df2 = pd.DataFrame(headlines2, columns=['Headline'])

# Remove Duplicates
    df2 = df2.drop_duplicates().reset_index(drop=True)
    print(df2)
else:
    print("Failed to retrieve the page. Status code:", response.status_code)

                                             Headline
0   The Eugene Melnyk era in Ottawa: Hopeful, then...
1   ‘The most toxic environment I’ve ever been a p...
2   ‘We went through hell’: Former players accuse ...
3   ‘A failed system’: A corrupt process exploits ...
4   Michigan men’s hockey coach accused of beratin...
..                                                ...
68  The Athletic Football Podcast: The Story of th...
69                                 Talk of the Devils
70                                     Away from Home
71              The Athletic Women’s Football Podcast
72                                   Football Clichés

[73 rows x 1 columns]


In [85]:


# Send a request to the website
url = "https://bleacherreport.com/nfl"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

# Find all the headline elements
    headlines4 = []
    for item in soup.find_all('h3'):
        headlines4.append(item.get_text(strip=True))

# Load the headlines into a DataFrame
    df4 = pd.DataFrame(headlines4, columns=['Headline'])

# Remove Duplicates
    df4 = df4.drop_duplicates().reset_index(drop=True)
    print(df4)
else:
    print("Failed to retrieve the page. Status code:", response.status_code)

                                 Headline
0                                     NFL
1   Texans Blow Out Cowboys in Dallas 😮‍💨
2            Texans Send Cowboys to 3-7 😬
3               CeeDee’s Face on Bench ☹️
4         Broken Piece of Cowboys’ Roof 😶
..                                    ...
62          Beyonce x NFL Halftime Show 🎄
63          Colts Show Richardson Love 🗣️
64               Week 11 NFL Highlights 🍿
65         Steelers Hold Off Ravens 18-16
66               Lamar's Quote After Loss

[67 rows x 1 columns]


In [86]:
df4 = df4.drop(index=0).reset_index(drop=True)

In [87]:
# Send a request to the website
url = "https://bleacherreport.com/mlb"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

# Find all the headline elements
    headlines5 = []
    for item in soup.find_all('h3'):
        headlines5.append(item.get_text(strip=True))

# Load the headlines into a DataFrame
    df5 = pd.DataFrame(headlines5, columns=['Headline'])

# Remove Duplicates
    df5 = df5.drop_duplicates().reset_index(drop=True)
    print(df5)
else:
    print("Failed to retrieve the page. Status code:", response.status_code)

                                             Headline
0                                                 MLB
1                     Paul Skenes, Luis Gil Win ROY 🏆
2                           Skenes’ Reaction to ROY 😂
3                          Pirates’ Post for Skenes 💛
4                        Yankees' Hype Tape for Gil 😤
..                                                ...
64                      AL, NL Platinum Glove Winners
65             Offseason Preview for Every MLB Team 📝
66                     MLB Moneyball Power Rankings 🤑
67  Babe Ruth's Rare 1916 Rookie Card to Be Sold a...
68                Vogt Deserves Manager of the Year 🗣

[69 rows x 1 columns]


In [88]:
df5 = df5.drop(index=0).reset_index(drop=True)

In [89]:
# Send a request to the website
url = "https://bleacherreport.com/nhl"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

# Find all the headline elements
    headlines6 = []
    for item in soup.find_all('h3'):
        headlines6.append(item.get_text(strip=True))

# Load the headlines into a DataFrame
    df6 = pd.DataFrame(headlines6, columns=['Headline'])

# Remove Duplicates
    df6 = df6.drop_duplicates().reset_index(drop=True)
    print(df6)
else:
    print("Failed to retrieve the page. Status code:", response.status_code)

                                         Headline
0                                             NHL
1                           Monday's Top 10 Plays
2             Macklin Celebrini Gets on the Board
3       1-Goal 1st Period in Sharks vs. Red Wings
4                        Ovi Going Off vs. Utah 🔥
..                                            ...
65     NHL Draft Prospects Stocks That Are Rising
66  1 Word for Every Team After the First Month 📖
67                  Jack Hughes' Breakaway Goal 😈
68                 Celebrini Goes Back-To-Back ✌️
69                    Latest NHL Power Rankings 📊

[70 rows x 1 columns]


In [90]:
df6 = df6.drop(index=0).reset_index(drop=True)

In [91]:
df6

Unnamed: 0,Headline
0,Monday's Top 10 Plays
1,Macklin Celebrini Gets on the Board
2,1-Goal 1st Period in Sharks vs. Red Wings
3,Ovi Going Off vs. Utah 🔥
4,Mike Reilly to Undergo Heart Surgery
...,...
64,NHL Draft Prospects Stocks That Are Rising
65,1 Word for Every Team After the First Month 📖
66,Jack Hughes' Breakaway Goal 😈
67,Celebrini Goes Back-To-Back ✌️


In [92]:
# combine all scraped data into 1 single data frame

combined_df = pd.concat([df, df2, df4, df5, df6], ignore_index=True)

# Remove any repeat entries
combined_df = combined_df.drop_duplicates().reset_index(drop=True)

print(combined_df)

                                          Headline
0                Dame's Game-Winner vs. Rockets ⌚️
1                             76ers Fall to 2-11 📉
2                            Bam Mixes Up Embiid 🌀
3                      Jericho Sims Dunks on Kuz 😦
4                       KAT's Easy 24-Point Game 🗽
..                                             ...
345     NHL Draft Prospects Stocks That Are Rising
346  1 Word for Every Team After the First Month 📖
347                  Jack Hughes' Breakaway Goal 😈
348                 Celebrini Goes Back-To-Back ✌️
349                    Latest NHL Power Rankings 📊

[350 rows x 1 columns]


In [93]:
print(len(combined_df))

350


### CNN 2023 Top news data ###

In [94]:
# Send a request to the website
url = "https://www.cnn.com/2023/12/28/world/top-100-digital-stories-2023-mabry/index.html"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

# Find all <a> tags
    headlines8 = []
    for a_tag in soup.find_all('a', href=True):  # Finds all <a> tags with an href attribute
        headline = a_tag.get_text(strip=True)    # Extracts the text inside the <a> tag
        if headline:  # Ensure text is not empty
            headlines8.append(headline)

# Load the headlines into a DataFrame
    df8 = pd.DataFrame(headlines8, columns=['Headline'])

# Remove Duplicates
    df8 = df8.drop_duplicates().reset_index(drop=True)
    print(df8)
else:
    print("Failed to retrieve the page. Status code:", response.status_code)

               Headline
0                 World
1                Africa
2              Americas
3                  Asia
4             Australia
..                  ...
232  Accessibility & CC
233               About
234         Newsletters
235         Transcripts
236         Help Center

[237 rows x 1 columns]


In [95]:
df8 = df8.iloc[133:228]
df8 = df8.reset_index(drop=True)
df8

Unnamed: 0,Headline
0,Damar Hamlin’s health update live story
1,Missing Titanic sub crew killed after ‘catastr...
2,"Lewiston, Maine mass shootings live story"
3,Arnold Schwarzenegger says friend Bruce Willis...
4,Missing Titanic sub search news live story
...,...
90,Tropical storm Idalia forecast
91,Idalia live story
92,‘White’ hydrogen discovery
93,Alabama woman who went missing after reporting...


## Data Preprocessing ##

We will perform the following steps:

* Remove emoji characters.
* **Tokenization**: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All **stopwords** are removed.
* Words are **lemmatized** - words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are **stemmed** - words are reduced to their root form.

In [96]:
!pip install gensim



In [97]:
!pip install pyldavis



In [98]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np


In [99]:
import pyLDAvis
import pyLDAvis.gensim

In [100]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [101]:
import re

# Function to remove emojis

def remove_emojis(text):
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # Chinese characters
        u"\U00002702-\U000027B0"
        u"\U00002700-\U000027BF"
        u"\U0001F900-\U0001F9FF"  # supplemental symbols
        u"\U0001FA70-\U0001FAFF"  # Chess Symbols
        u"\U00002600-\U000026FF"  # Misc symbols
        u"\U00002B50-\U00002BFF"
        u"\U0001F000-\U0001F02F"
        "]+", flags=re.UNICODE
    )
    return emoji_pattern.sub(r"", text)

# Apply to DataFrame
combined_df['Headline'] = combined_df['Headline'].apply(remove_emojis)

In [102]:
combined_df

Unnamed: 0,Headline
0,Dame's Game-Winner vs. Rockets ⌚️
1,76ers Fall to 2-11
2,Bam Mixes Up Embiid
3,Jericho Sims Dunks on Kuz
4,KAT's Easy 24-Point Game
...,...
345,NHL Draft Prospects Stocks That Are Rising
346,1 Word for Every Team After the First Month
347,Jack Hughes' Breakaway Goal
348,Celebrini Goes Back-To-Back ️


In [103]:
# function to perform the pre processing steps on the entire dataset

stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):

    result = []

    for token in gensim.utils.simple_preprocess(text) :

        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2:

            # Apply lemmatize_stemming() on the token, then add to the results list
            result.append(lemmatize_stemming(token))

    return result

In [104]:
# preprocess all the headlines, saving the list of results as 'processed_headlines'

processed_headlines = combined_df['Headline'].map(preprocess)

processed_headlines8 = df8['Headline'].map(preprocess)

In [105]:
processed_headlines

Unnamed: 0,Headline
0,"[dame, game, winner, rocket]"
1,"[er, fall]"
2,"[bam, mix, embiid]"
3,"[jericho, sim, dunk, kuz]"
4,"[kat, easi, point, game]"
...,...
345,"[nhl, draft, prospect, stock, rise]"
346,"[word, team, month]"
347,"[jack, hugh, breakaway, goal]"
348,"[celebrini, go]"


In [106]:
processed_headlines8

Unnamed: 0,Headline
0,"[damar, hamlin, health, updat, live, stori]"
1,"[miss, titan, sub, crew, kill, catastroph, imp..."
2,"[lewiston, main, mass, shoot, live, stori]"
3,"[arnold, schwarzenegg, say, friend, bruce, wil..."
4,"[miss, titan, sub, search, news, live, stori]"
...,...
90,"[tropic, storm, idalia, forecast]"
91,"[idalia, live, stori]"
92,"[white, hydrogen, discoveri]"
93,"[alabama, woman, go, miss, report, toddler, wa..."


## Perform Bag of words on dataset

create a dictionary from 'processed_headlines' containing the number of times a word appears in the training set.

In [107]:
dictionary = gensim.corpora.Dictionary(processed_headlines)

dictionary8 = gensim.corpora.Dictionary(processed_headlines8)

In [108]:
# View snippet of Sports Headlines data dictionary
count = 0

for k, v in dictionary.iteritems():

    print(k, v)

    count += 1

    if count > 10:
        break

0 dame
1 game
2 rocket
3 winner
4 er
5 fall
6 bam
7 embiid
8 mix
9 dunk
10 jericho


In [109]:
# View snippet of CNN news Headliens data dictionary
count = 0

for k, v in dictionary8.iteritems():

    print(k, v)

    count += 1

    if count > 10:
        break

0 damar
1 hamlin
2 health
3 live
4 stori
5 updat
6 catastroph
7 crew
8 implos
9 kill
10 miss


** Gensim filter_extremes **

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

Filter out tokens that appear in

* less than no_below  (absolute number) or
* more than no_above (fraction of total corpus size, not absolute number).
* after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [110]:
# filter out exceptionally rare and very common words

# apply filter_extremes() with the following parameters
# words appearing less than 3 times
# words appearing in more than 70% of all documents

dictionary.filter_extremes(no_below=3, no_above=0.7)

dictionary8.filter_extremes(no_below=3, no_above=0.7)

In [111]:
# Create Bag-of-words model for each headline.
# each headline will have a dictionary reporting how many words and how many times those words appear.

bow_corpus = [dictionary.doc2bow(head) for head in processed_headlines]

bow_corpus8 = [dictionary8.doc2bow(head) for head in processed_headlines8]

In [112]:
bow_corpus[0]

[(0, 1), (1, 1)]

In [113]:
bow_corpus8[0]

[(0, 1), (1, 1), (2, 1), (3, 1)]

##Perform TF-IDF ##
**"Term Frequency, Inverse Document Frequency"**

In [114]:
# Create tf-idf model object using models.TfidfModel on 'bow_corpus'

from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
print(tfidf)

TfidfModel<num_docs=350, num_nnz=622>


In [115]:
# Apply transformation to the entire corpus

corpus_tfidf = tfidf[bow_corpus]
print(corpus_tfidf[0])

[(0, 0.6249377238229042), (1, 0.7806746065698867)]


In [116]:
print(corpus_tfidf[1])

[(2, 1.0)]


## LDA using Bag of Words ##

In [117]:
# Train lda model

lda_model = gensim.models.LdaMulticore(bow_corpus,
                                       num_topics=5,
                                       id2word = dictionary,
                                       passes = 10,
                                       workers=2)

lda_model8 = gensim.models.LdaMulticore(bow_corpus8,
                                       num_topics=4,
                                       id2word = dictionary8,
                                       passes = 10,
                                       workers=2)

In [118]:
# For each topic, we will explore the words occuring in that topic and its relative weight

for idx, topic in lda_model.print_topics(-1):

    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.088*"nba" + 0.070*"nfl" + 0.064*"play" + 0.057*"insid" + 0.056*"week" + 0.029*"new" + 0.029*"contend" + 0.027*"cup" + 0.025*"team" + 0.022*"gianni"


Topic: 1 
Words: 0.063*"nhl" + 0.053*"rank" + 0.047*"basebal" + 0.040*"best" + 0.032*"race" + 0.032*"prospect" + 0.025*"year" + 0.025*"world" + 0.025*"bengal" + 0.025*"adam"


Topic: 2 
Words: 0.069*"loss" + 0.051*"cowboy" + 0.037*"skene" + 0.037*"predict" + 0.036*"score" + 0.031*"game" + 0.030*"texan" + 0.030*"call" + 0.030*"hugh" + 0.030*"yanke"


Topic: 3 
Words: 0.099*"score" + 0.068*"player" + 0.060*"go" + 0.058*"goal" + 0.043*"allen" + 0.035*"power" + 0.035*"jet" + 0.032*"team" + 0.029*"rank" + 0.027*"mlb"


Topic: 4 
Words: 0.069*"win" + 0.061*"team" + 0.059*"game" + 0.052*"trade" + 0.038*"soto" + 0.038*"footbal" + 0.033*"doubl" + 0.029*"winner" + 0.029*"red" + 0.028*"tripl"




In [119]:
for idx, topic in lda_model8.print_topics(-1):

    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.204*"stori" + 0.196*"live" + 0.084*"hous" + 0.079*"trump" + 0.061*"vote" + 0.049*"exclus" + 0.038*"damar" + 0.038*"hamlin" + 0.038*"speaker" + 0.038*"mccarthi"


Topic: 1 
Words: 0.133*"live" + 0.123*"stori" + 0.074*"death" + 0.060*"titan" + 0.059*"chines" + 0.046*"miss" + 0.046*"news" + 0.046*"video" + 0.045*"charg" + 0.045*"hurrican"


Topic: 2 
Words: 0.183*"shoot" + 0.059*"live" + 0.059*"mass" + 0.059*"stori" + 0.059*"indict" + 0.047*"trump" + 0.045*"say" + 0.045*"main" + 0.045*"year" + 0.045*"polic"


Topic: 3 
Words: 0.118*"tropic" + 0.090*"accid" + 0.090*"storm" + 0.089*"hilari" + 0.064*"idalia" + 0.063*"submers" + 0.063*"titan" + 0.061*"forecast" + 0.040*"stori" + 0.035*"go"




##LDA using TF-IDF ##

- performing TF-IDF on the corpus is is recemmended but not necessary for LDA implemention using the gensim model.

- *note: The author of Gensim dictates the standard procedure for LDA is to use the Bag of Words model.*

In [120]:
# lda model using corpus_tfidf

lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf,
                                             num_topics=5,
                                             id2word = dictionary,
                                             passes = 10,
                                             workers=2)

In [121]:
# For each topic, we will explore the words occuring in that topic and its relative weight

for idx, topic in lda_model_tfidf.print_topics(-1):

    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.063*"trade" + 0.050*"nba" + 0.045*"loss" + 0.043*"goal" + 0.037*"team" + 0.034*"red" + 0.034*"best" + 0.034*"basebal" + 0.030*"quot" + 0.030*"contend"


Topic: 1 Word: 0.103*"play" + 0.060*"go" + 0.054*"insid" + 0.041*"allen" + 0.037*"gianni" + 0.032*"player" + 0.031*"yanke" + 0.028*"nhl" + 0.028*"mixon" + 0.026*"year"


Topic: 2 Word: 0.073*"team" + 0.050*"rank" + 0.047*"coach" + 0.046*"mlb" + 0.042*"new" + 0.038*"power" + 0.036*"cowboy" + 0.035*"call" + 0.033*"mcdavid" + 0.033*"predict"


Topic: 3 Word: 0.088*"score" + 0.069*"win" + 0.063*"game" + 0.048*"nfl" + 0.046*"winner" + 0.045*"week" + 0.027*"award" + 0.026*"randl" + 0.026*"hockey" + 0.025*"texan"


Topic: 4 Word: 0.060*"athlet" + 0.056*"footbal" + 0.044*"tripl" + 0.042*"doubl" + 0.038*"bengal" + 0.038*"soto" + 0.036*"cup" + 0.031*"giant" + 0.031*"world" + 0.030*"chief"




When using tf-idf, heavier weights are given to words that are not as frequent which results in nouns being factored in. This can make it harder to figure out the categories as nouns can be hard to categorize.

## Modelling Process from Sci_Kit Learn ##

In [122]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(combined_df['Headline'])


tfidf_matrix.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [136]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=4, random_state=0)

In [137]:
lda.fit(tfidf_matrix)

In [138]:
lda.components_

array([[5.52268095, 0.25029774, 0.76865928, ..., 0.25012553, 0.92340014,
        0.25014546],
       [0.250246  , 0.25030322, 0.71741837, ..., 0.25015631, 0.25037204,
        0.25015946],
       [0.26627348, 0.25711268, 0.25034257, ..., 0.25012075, 1.25433529,
        0.62718599],
       [0.25023918, 0.87374999, 0.25036585, ..., 1.04823272, 0.25035846,
        0.25015923]])

In [139]:
vectorizer.get_feature_names_out()

array(['10', '100', '1000th', ..., 'yd', 'year', 'years'], dtype=object)

In [140]:
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}:")
    print([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10 - 1:-1]])
    print("\n")

Topic 0:
['10', 'athletic', 'plays', 'football', 'sasaki', 'soto', 'nba', 'baseball', 'roki', '2024']


Topic 1:
['team', 'allen', 'td', 'vs', 'josh', 'goal', 'nba', 'mlb', 'loss', 'trick']


Topic 2:
['game', 'teams', 'winner', 'rankings', 'nhl', 'week', 'new', 'ot', 'yankees', 'randle']


Topic 3:
['scores', 'vs', 'hughes', 'jack', 'giants', 'bench', 'win', 'roy', 'inside', 'nhl']




In [141]:
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict[f"Topic {topic_idx+1}"] = [feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]
    return topic_dict

no_top_words = 10
topics = display_topics(lda, vectorizer.get_feature_names_out(), no_top_words)

# Convert the topics dictionary to a DataFrame for better visualization
topics_df = pd.DataFrame(topics)

In [142]:
topics_df

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4
0,10,team,game,scores
1,athletic,allen,teams,vs
2,plays,td,winner,hughes
3,football,vs,rankings,jack
4,sasaki,josh,nhl,giants
5,soto,goal,week,bench
6,nba,nba,new,win
7,baseball,mlb,ot,roy
8,roki,loss,yankees,inside
9,2024,trick,randle,nhl


## Sample using Sklearn model

In [157]:
# Processed Text of sample 222
sample = processed_headlines[222]
sample

['basebal', 'hof', 'ballot']

In [158]:
combined_df.iloc[222]

Unnamed: 0,222
Headline,2025 Baseball HOF Ballot


In [159]:
#sklearn method
sample_vectorized = vectorizer.transform(sample)

In [160]:
#sklearn method
sample_topic_distr = lda.transform(sample_vectorized)
sample_topic_distr.argmax(axis=1)

array([0, 1, 1])

##Visualizations (gensim models) ##

### ***CNN Headlines Data Topics***

In [144]:
# CNN data
pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim.prepare(lda_model8, bow_corpus8, dictionary8, mds='mmds')
panel

#### Classification of the topics
* 1: Trump/ Voting/ Exclusives
* 2: Mass Shooting/ Investing/ Trump
* 3: Live Stories/ Deaths/ Missing People
* 4: Weather Forecast/ Natural disaster warnings

### CNN Headlines Data Sample

In [164]:
# Processed Text of sample 24
sample = processed_headlines8[24]
sample

['mccarthi', 'kick', 'pelosi', 'offic', 'sourc']

In [165]:
df8.iloc[24]

Unnamed: 0,24
Headline,McCarthy behind move to kick Pelosi out of her...


In [166]:
for index, score in sorted(lda_model8[bow_corpus8[24]], key=lambda tup: -1*tup[1]):

    print("\nScore: {}\t \nTopic: {}".format(score, lda_model8.print_topic(index, 10)))


Score: 0.6247275471687317	 
Topic: 0.204*"stori" + 0.196*"live" + 0.084*"hous" + 0.079*"trump" + 0.061*"vote" + 0.049*"exclus" + 0.038*"damar" + 0.038*"hamlin" + 0.038*"speaker" + 0.038*"mccarthi"

Score: 0.12514963746070862	 
Topic: 0.118*"tropic" + 0.090*"accid" + 0.090*"storm" + 0.089*"hilari" + 0.064*"idalia" + 0.063*"submers" + 0.063*"titan" + 0.061*"forecast" + 0.040*"stori" + 0.035*"go"

Score: 0.12506312131881714	 
Topic: 0.183*"shoot" + 0.059*"live" + 0.059*"mass" + 0.059*"stori" + 0.059*"indict" + 0.047*"trump" + 0.045*"say" + 0.045*"main" + 0.045*"year" + 0.045*"polic"

Score: 0.12505973875522614	 
Topic: 0.133*"live" + 0.123*"stori" + 0.074*"death" + 0.060*"titan" + 0.059*"chines" + 0.046*"miss" + 0.046*"news" + 0.046*"video" + 0.045*"charg" + 0.045*"hurrican"


###***Sports Headlines Data Topics***

In [146]:
# Sports data
pyLDAvis.enable_notebook()
panel2 = pyLDAvis.gensim.prepare(lda_model_tfidf, corpus_tfidf, dictionary, mds='mmds')
panel2

#### Classification of the topics
* 1: Game Scores/ Winners/ NFL
* 2: Trades/ Goals / NBA
* 3: Power Ranks/ Coaches/ MLB
* 4: Football/ Athletes
* 5: Best Plays/ Management/ NHL

### Testing model on Newly Created Sports Headline

In [169]:
unseen_headline = "NFL implements New Playoff rules for upcoming football season"

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_headline))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):

    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.5208846926689148	 Topic: 0.088*"nba" + 0.070*"nfl" + 0.064*"play" + 0.057*"insid" + 0.056*"week"
Score: 0.3782939612865448	 Topic: 0.069*"win" + 0.061*"team" + 0.059*"game" + 0.052*"trade" + 0.038*"soto"
Score: 0.033730506896972656	 Topic: 0.099*"score" + 0.068*"player" + 0.060*"go" + 0.058*"goal" + 0.043*"allen"
Score: 0.03369349241256714	 Topic: 0.063*"nhl" + 0.053*"rank" + 0.047*"basebal" + 0.040*"best" + 0.032*"race"
Score: 0.03339732810854912	 Topic: 0.069*"loss" + 0.051*"cowboy" + 0.037*"skene" + 0.037*"predict" + 0.036*"score"


The model classifies the headline to have the highest probabilities to belong to Topics 1 and 5 respectfully.