# BERTopic-based Subcategory Extraction

This notebook discovers subcategories per main category using BERTopic with strong sentence embeddings and tuned clustering. It also standardizes labels and provides quick evaluations.


In [87]:
# Imports
import os
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import TfidfVectorizer
from umap import UMAP
from hdbscan import HDBSCAN

import torch
from tqdm.auto import tqdm
tqdm.pandas()


In [88]:
# Load data
DATA_PATH = "../data/raw_bbc.csv"
df = pd.read_csv(DATA_PATH)
assert {"Category", "Text"}.issubset(df.columns), "CSV must have Category and Text columns"

# Keep short titles for embedding and topic representation
def make_title(text, max_words=40):
    words = str(text).split()
    return " ".join(words[:max_words])

df["Title"] = df["Text"].apply(make_title)
df.head(3)


Unnamed: 0,Category,Text,Filename,Subcategory,Title
0,business,"Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.\n\nTime Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. ""Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility,"" chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.\n\nTimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.\n",data/business/001.txt,,"Ad sales boost Time Warner profit Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. The firm, which is now one of the biggest investors in Google, benefited"
1,business,"Dollar gains on Greenspan speech\n\nThe dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.\n\nAnd Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. ""I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time,"" said Robert Sinche, head of currency strategy at Bank of America in New York. ""He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next.""\n\nWorries about the deficit concerns about China do, however, remain. China's currency remains pegged to the dollar and the US currency's sharp falls in recent months have therefore made Chinese export prices highly competitive. But calls for a shift in Beijing's policy have fallen on deaf ears, despite recent comments in a major Chinese newspaper that the ""time is ripe"" for a loosening of the peg. The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.\n",data/business/002.txt,,Dollar gains on Greenspan speech The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. And Alan Greenspan highlighted the US government's
2,business,"Yukos unit buyer faces loan claim\n\nThe owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan.\n\nState-owned Rosneft bought the Yugansk unit for $9.3bn in a sale forced by Russia to part settle a $27.5bn tax claim against Yukos. Yukos' owner Menatep Group says it will ask Rosneft to repay a loan that Yugansk had secured on its assets. Rosneft already faces a similar $540m repayment demand from foreign banks. Legal experts said Rosneft's purchase of Yugansk would include such obligations. ""The pledged assets are with Rosneft, so it will have to pay real money to the creditors to avoid seizure of Yugansk assets,"" said Moscow-based US lawyer Jamie Firestone, who is not connected to the case. Menatep Group's managing director Tim Osborne told the Reuters news agency: ""If they default, we will fight them where the rule of law exists under the international arbitration clauses of the credit.""\n\nRosneft officials were unavailable for comment. But the company has said it intends to take action against Menatep to recover some of the tax claims and debts owed by Yugansk. Yukos had filed for bankruptcy protection in a US court in an attempt to prevent the forced sale of its main production arm. The sale went ahead in December and Yugansk was sold to a little-known shell company which in turn was bought by Rosneft. Yukos claims its downfall was punishment for the political ambitions of its founder Mikhail Khodorkovsky and has vowed to sue any participant in the sale.\n",data/business/003.txt,,Yukos unit buyer faces loan claim The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan. State-owned Rosneft bought the Yugansk unit for $9.3bn in


In [89]:
# Clean text for BERTopic input
import re
import nltk
from nltk.corpus import stopwords

try:
    stop_words = set(stopwords.words("english"))
except LookupError:
    nltk.download("stopwords")
    stop_words = set(stopwords.words("english"))

WORD_RE = re.compile(r"[A-Za-z]+")

def clean_text(text: str, max_words: int = 200) -> str:
    tokens = str(text).lower().split()
    tokens = tokens[:max_words]
    cleaned = []
    for tok in tokens:
        if not WORD_RE.fullmatch(tok):
            continue
        if tok in stop_words:
            continue
        if len(tok) <= 2:
            continue
        cleaned.append(tok)
    return " ".join(cleaned)

# Build cleaned content from full Text
# Use first 200 tokens to balance quality/speed and avoid seq length issues
df["CleanContent"] = df["Text"].apply(lambda x: clean_text(x, max_words=200))
df.head(3)


Unnamed: 0,Category,Text,Filename,Subcategory,Title,CleanContent
0,business,"Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.\n\nTime Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. ""Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility,"" chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.\n\nTimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.\n",data/business/001.txt,,"Ad sales boost Time Warner profit Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. The firm, which is now one of the biggest investors in Google, benefited",sales boost time warner profit quarterly profits media giant timewarner jumped three months one biggest investors benefited sales internet connections higher advert timewarner said fourth quarter sales rose profits buoyed gains offset profit dip warner less users time warner said friday owns internet mixed lost subscribers fourth quarter profits lower preceding three company said underlying profit exceptional items rose back stronger internet advertising hopes increase subscribers offering online service free timewarner internet customers try sign existing customers timewarner also restate results following probe securities exchange commission close time fourth quarter profits slightly better
1,business,"Dollar gains on Greenspan speech\n\nThe dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.\n\nAnd Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. ""I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time,"" said Robert Sinche, head of currency strategy at Bank of America in New York. ""He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next.""\n\nWorries about the deficit concerns about China do, however, remain. China's currency remains pegged to the dollar and the US currency's sharp falls in recent months have therefore made Chinese export prices highly competitive. But calls for a shift in Beijing's policy have fallen on deaf ears, despite recent comments in a major Chinese newspaper that the ""time is ripe"" for a loosening of the peg. The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.\n",data/business/002.txt,,Dollar gains on Greenspan speech The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. And Alan Greenspan highlighted the US government's,dollar gains greenspan speech dollar hit highest level euro almost three months federal reserve head said trade deficit set alan greenspan highlighted willingness curb spending rising household savings factors may help reduce late trading new dollar reached market concerns deficit hit greenback recent federal reserve chairman speech london ahead meeting finance ministers sent dollar higher earlier tumbled back jobs think taking much sanguine view current account deficit taken said robert head currency strategy bank america new taking laying set conditions current account deficit improve year worries deficit concerns china currency remains pegged dollar sharp
2,business,"Yukos unit buyer faces loan claim\n\nThe owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan.\n\nState-owned Rosneft bought the Yugansk unit for $9.3bn in a sale forced by Russia to part settle a $27.5bn tax claim against Yukos. Yukos' owner Menatep Group says it will ask Rosneft to repay a loan that Yugansk had secured on its assets. Rosneft already faces a similar $540m repayment demand from foreign banks. Legal experts said Rosneft's purchase of Yugansk would include such obligations. ""The pledged assets are with Rosneft, so it will have to pay real money to the creditors to avoid seizure of Yugansk assets,"" said Moscow-based US lawyer Jamie Firestone, who is not connected to the case. Menatep Group's managing director Tim Osborne told the Reuters news agency: ""If they default, we will fight them where the rule of law exists under the international arbitration clauses of the credit.""\n\nRosneft officials were unavailable for comment. But the company has said it intends to take action against Menatep to recover some of the tax claims and debts owed by Yugansk. Yukos had filed for bankruptcy protection in a US court in an attempt to prevent the forced sale of its main production arm. The sale went ahead in December and Yugansk was sold to a little-known shell company which in turn was bought by Rosneft. Yukos claims its downfall was punishment for the political ambitions of its founder Mikhail Khodorkovsky and has vowed to sue any participant in the sale.\n",data/business/003.txt,,Yukos unit buyer faces loan claim The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan. State-owned Rosneft bought the Yugansk unit for $9.3bn in,yukos unit buyer faces loan claim owners embattled russian oil giant yukos ask buyer former production unit pay back rosneft bought yugansk unit sale forced russia part settle tax claim owner menatep group says ask rosneft repay loan yugansk secured rosneft already faces similar repayment demand foreign legal experts said purchase yugansk would include pledged assets pay real money creditors avoid seizure yugansk said lawyer jamie connected menatep managing director tim osborne told reuters news fight rule law exists international arbitration clauses rosneft officials unavailable company said intends take action menatep recover tax claims debts owed yukos filed bankruptcy protection


In [93]:
df.shape

(2225, 6)

In [94]:
# to see the whole dataframe
pd.set_option('display.max_colwidth', None)


In [97]:
# Filter only "business" category and select required columns
filtered = df[df["Category"] == "politics"][["Category", "CleanContent"]]

# Show the result
print(filtered.head(3))




     Category  \
896  politics   
897  politics   
898  politics   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  CleanContent  
896                                                                                                                                                           labour plans materni

In [None]:
print()

0       False
1       False
2       False
3       False
4       False
        ...  
2220    False
2221    False
2222    False
2223    False
2224    False
Name: is_sports, Length: 2225, dtype: bool


In [None]:
# Embedding model selection and device
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "intfloat/multilingual-e5-large"

embedder = SentenceTransformer(MODEL_NAME, device=DEVICE)
print("Device:", DEVICE)


Device: cuda


In [None]:
# Encode cleaned content in batches
BATCH_SIZE = 64
corpus = df["CleanContent"].tolist()
embeddings = embedder.encode(corpus, batch_size=BATCH_SIZE, show_progress_bar=True, normalize_embeddings=True)
len(embeddings)


Batches: 100%|██████████| 35/35 [00:44<00:00,  1.27s/it]


2225

In [None]:
# Category-specific stopwords and bigram vectorizer
SPORTS_STOPWORDS = {
    "england","united","chelsea","arsenal","liverpool","manchester","wenger","ferguson","rooney","premier","cup","league"
}
POLITICS_STOPWORDS = {
    "blair","brown","howard","labour","labour's","tory","tories","conservative","conservatives","minister","government","tony","gordon"
}

# Global baseline stopwords to reduce country/person dominance across all
GLOBAL_EXTRA = {"england","scotland","wales","britain","uk","bbc"}

custom_stopwords = SPORTS_STOPWORDS | POLITICS_STOPWORDS | GLOBAL_EXTRA

vectorizer = TfidfVectorizer(
    ngram_range=(1,2),
    lowercase=True,
    stop_words=list(custom_stopwords),
    min_df=.2,
    max_df=.8)


In [None]:
# Update representations using custom vectorizer + KeyBERT-inspired
# This will recompute topic words without refitting the clustering

try:
    topic_model.set_representations(representation_model=keybert_repr, vectorizer_model=vectorizer)
    rep = topic_model.get_topic_info()
    print("Updated topic representations.")
except Exception as e:
    print("Failed to update representations:", e)


Failed to update representations: 'BERTopic' object has no attribute 'set_representations'


In [None]:
# UMAP and HDBSCAN configuration
umap_model = UMAP(n_neighbors=25, n_components=5, min_dist=0.0, metric="cosine", random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=5, metric="euclidean", cluster_selection_epsilon=0.0, cluster_selection_method="eom")

topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    calculate_probabilities=False,
    verbose=True,
    nr_topics=None,
    min_topic_size=20,
    top_n_words=10
)


In [None]:
# Fit BERTopic globally, then map within each main category if needed
# Global fit tends to find coherent cross-category subtopics; we will post-label per-category

topics, probs = topic_model.fit_transform(docs_for_topics, embeddings)
df["Topic"] = topics

# Replace -1 with a placeholder
UNKNOWN_LABEL = "Miscellaneous"
df["Topic"] = df["Topic"].fillna(-1)



2025-09-30 15:50:11,088 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


2025-09-30 15:50:26,212 - BERTopic - Dimensionality - Completed ✓
2025-09-30 15:50:26,213 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-30 15:50:26,315 - BERTopic - Cluster - Completed ✓
2025-09-30 15:50:26,327 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-30 15:50:26,733 - BERTopic - Representation - Completed ✓


In [None]:
# Generate human-readable labels for topics
# Use BERTopic's built-in topic representations

rep = topic_model.get_topic_info()
rep.head(10)


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,155,-1_people_digital_technology_computer,"[people, digital, technology, computer, said, use, new, could, music, one]","[horizon super wireless data networks could soon use wireless watchdog seeking help best way regulate technology behind networks called ultra wideband ofcom wants ensure arrival devices cause problems already use part radio uwb makes possible stream huge amounts data air short one likely uses uwb make possible send dvd quality video images wirelessly screens let people beam music media players around technology potential transmit hundreds megabits data per uwb could also used create personal area networks let gadgets quickly easily swap data amongst technology works range metres uses billions short radio pulses every second carry recent consumer electronics show las vegas products uwb chips got first public use uwb, speak easy plan media players music film fans able control digital media players speaking plans development two scansoft gracenote developing technology give people access film music libraries simply voice want give people access digital music films home huge media libraries players make finding single songs unlocks potential devices store large digital music said ross vice president business development applications radically change car entertainment allowing drivers enjoy entire music collections without ever taking hands steering gracenote provides music library information millions different albums jukeboxes new technology designed people play individual song movie saying users also able request music fits mood film saying, speak easy plan media players music film fans able control digital media players speaking plans development two scansoft gracenote developing technology give people access film music libraries simply voice want give people access digital music films home huge media libraries players make finding single songs unlocks potential devices store large digital music said ross vice president business development applications radically change car entertainment allowing drivers enjoy entire music collections without ever taking hands steering gracenote provides music library information millions different albums jukeboxes new technology designed people play individual song movie saying users also able request music fits mood film saying]"
1,0,474,0_said_growth_oil_sales,"[said, growth, oil, sales, firm, shares, company, market, bank, economy]","[sees world economy behemoth general electric posted jump quarterly declared great benefiting growth initiatives excellent global said chief executive jeff biggest firm based stock market net profits final three months sales came whose businesses range jet engines nbc television forecast sustained growth year shares rose news ending friday industries financial industrial sectors picking said steve analyst fund manager matrix asset shares said orders fourth quarter higher period growth across fourth nine businesses delivered least earnings said full year gains less still net profit, barclays profits hit record level seen annual profits climb record levels boosted sharp rise business investment profits year december rose chief john varley said bank strong world earnings barclays capital investment bank rose investment branch operations held back growth retail group first five big banks report according biggest bank stock market report profits later barclays results line market global investors wing made jump profits barclaycard rose said affected series interest rate rises investment grow customer bank also blamed margins pressure mortgage business spending branches past year fall profits retail division outlook, irish markets reach high irish shares risen record investors persuaded buy market low inflation strong growth iseq index leading shares closed points fuelled strong growth banking financial fall rate inflation january gave fresh boost shares advanced economy set strong growth interest rates remain several biggest companies saw market value hit recent highs allied irish biggest company touched five year peak bank ireland shares rose highest level since august telecoms firm recently revealed would irish mobile phone hit yearly analysts said economic conditions benign irish shares still trading discount european ticks boxes far international investors roy chief investment officer hibernian investment told economic conditions set continue ireland]"
2,1,442,1_said_would_labour_blair,"[said, would, labour, blair, government, election, party, minister, tory, brown]","[blair backs tony blair backed chancellor gordon report amid opposition claims bullish state speech prime minister said report reinforced stability would central next election planning already well brown earlier denied economic forecasts optimistic refused rule future tax told bbc radio today politician make mistake john major colleagues made saying matter circumstances make sorts guarantees every individual politicians would responsible brown insisted spending plans could afford optimistic britain economy house prices blair praised chancellor role creating economic said speech napier said labour would publish next months, tories cut number conservative party would cut number mps tory leader michael howard plan forms part government unveiled later howard told sunday times party would also reduce number government special said referendum would held wales decide whether scrap welsh changes would take place within five years conservatives winning general howard told precise number mps would depend result welsh would probably mean reduction around current total wales decided keep assembly would stand lose howard said parties planned cut number civil servants whitehall labour tories almost accept similar drop well saying government departments, labour battle plan tories accused tony blair scrutiny labour unveiled details fight next general break party ditch battle bus daily press briefings instead blair travel key cities marginal seats deliver labour election chief alan milburn denied party trying prime promised positive upbeat election campaign labour ever tory liam fox said plans showed blair facing proper time british people looking accountability government turns back abandoning plans tour country scared face journalists press conference rather beg got general election widely expected next may parties stepping campaign milburn said economy would take centre stage campaign would election stand]"
3,2,351,2_film_best_show_also,"[film, best, show, also, star, said, actor, one, number, music]","[ray dvd beats box office takings film biopic ray surpassed box office takings combined tally dvd video sales success dvd outstripped box office earning first day release ray nominated six oscar categories including best film best actor jamie film recounts life blues singer ray died first week home entertainment release film number one selling limited edition version coming number sony horror film starring michelle second jennifer lopez richard romantic comedy shall number critically acclaimed performance ray already earned screen actors guild award best well prestigious golden ray director taylor responsible classic film officer also received oscar nomination best director three oscar nominations, critics laud comedy sideways road trip comedy sideways praise heaped two adding honours already picked chicago film critics association named winner five categories including best film best actor paul director award went clint eastwood million dollar southeastern film critics also awarded sideways best film year director alexander payne named best also best screenplay shared jim cfca awarded thomas haden church best supporting actor prize virginia madsen best supporting actress award roles sideways already voted best film critics associations new york los angeles nominated golden british actress imelda staunton cfca best actress gritty abortion drama vera adding growing list awards performance mike leigh scrubs star zach braff named best new director debut garden michael controversial documentary fahrenheit best, wine comedy wins award quirky comedy sideways named best film year los angeles film critics movie also picked four accolades including best director alexander payne supporting actor thomas haden british actress imelda staunton recognised role vera winning best liam neeson best actor awards handed january ceremony las sideways tells story two men take road trip wine regions also stars paul virginia madsen also named best supporting actress performance house flying directed yimou named best foreign language animation award went categories also named clint million dollar baby missing best film best director martin scorsese career achievement award handed veteran actor comic jerry lewis ceremony next]"
4,3,333,3_england_six_game_side,"[england, six, game, side, club, cup, coach, rugby, win, chelsea]","[mourinho defiant chelsea form chelsea boss jose mourinho insisted sir alex ferguson arsene wenger would swap places side knocked cup newcastle last sunday seeing barcelona secure champions league lead nou denied club suffering dip form league rivals arsenal manchester united could cannot speak blips better position mourinho want change positions top league nine points carling cup thing say better position champions league three teams either one team best position still mourinho said important keep results try put pressure never lost one important game week newcastle, robinson answers critics england captain jason robinson rubbished suggestions world champions team england beaten wales six nations opener cardiff last week face current champions france twickenham robinson certainly lose one game make bad doubt players still got team beat anyone england find striving avoid third successive championship defeat first time since robinson believes england team stop rot weekend lose two points sure play well week get win proved autumn put excellent performances need build disappointing start wales might certainly, robinson six nations england captain jason robinson miss rest six nations captain absence jonny due lead england final two games italy sale pulled squad wednesday torn ligament right undergo operation friday england yet name replacement robinson disappointing means miss last two games six nations twickenham two games sale looking back playing early robinson picked injury defeat ireland lansdowne road coach andy robinson hugely disappointed england captain immense figure autumn internationals six leading example look forward back england announcement latest setback among key figures already missing]"
5,4,87,4_open_seed_australian_roddick,"[open, seed, australian, roddick, second, win, first, beat, federer, cup]","[serena becomes world number two serena williams moved five places second world rankings australian open williams first grand slam title since victory lindsay world number champion marat safin remains fourth atp rankings beaten finalist lleyton hewitt replaces andy roddick world number roger federer retains top safin overtaken hewitt become new leader champions alicia lost thriller davenport top first time rise means australia player top rankings first time elena qualified reached third risen world leap places highest ranking, federer joins greats last year seen one player dominate one country dominate roger federer became first man since mats wilander win three grand slams one anastasia myskina became first russian woman win grand slam french two followed wimbledon briton tim henman enjoyed best greg rusedski fought back superbly federer began year world number one holder wimbledon masters cup set conquering new swiss sounded warning dominance come australian ripped draw beating marat safin andy roddick player put real resistance performance lleyton hewitt open final federer got better hewitt masters victory houston proved successive win open era major loss gustavo kuerten, henman face saulnier test british number one tim henman face cyril saulnier first round next australian greg british number quarter draw could face andy roddick second round beats swede jonas local favourite lleyton hewitt meet arnaud defending champion world number one roger federer faces fabrice top seed lindsay davenport drew spanish veteran conchita henman came two sets defeat saulnier first round french open last knows faces tough test seventh never gone beyond first major lined meet roddick last looking forward tough player got lot really tight one paris went way going need play well outset dangerous seeded hot favourite three]"
6,5,69,5_security_users_microsoft_virus,"[security, users, microsoft, virus, windows, software, spam, malicious, viruses, people]","[microsoft releases patches microsoft warned users update systems latest security fixes flaws windows monthly security flagged eight security holes could leave pcs open attack left number holes considered affect windows including internet explorer media player instant four important fixes also considered less either automatically users running programs could vulnerable viruses malicious attacks designed exploit many flaws could used virus writers take computers install delete see one critical patches microsoft made available important one fixes stephen microsoft security said flaws known although firm seen attacks exploiting rule critical flaw spates viruses follow home users businesses leave flaw, microsoft makes move microsoft says clamping people running pirated versions windows operating system restricting access security windows genuine advantage scheme means people prove software genuine still allow unauthorised copies get crucial security fixes via automatic options would microsoft releases regular security updates software protect either pcs detect updates automatically users manually download fixes running pirated windows programs would access downloads software giant people try manually download security patches let microsoft run automated checking procedure computer give identification regular patches releases security flaws important stop viruses threats penetrating security experts concerned restricting access patches could mean rise attacks pcs left graham senior consultant security firm told bbc news website, microsoft makes move microsoft says clamping people running pirated versions windows operating system restricting access security windows genuine advantage scheme means people prove software genuine still allow unauthorised copies get crucial security fixes via automatic options would microsoft releases regular security updates software protect either pcs detect updates automatically users manually download fixes running pirated windows programs would access downloads software giant people try manually download security patches let microsoft run automated checking procedure computer give identification regular patches releases security flaws important stop viruses threats penetrating security experts concerned restricting access patches could mean rise attacks pcs left graham senior consultant security firm told bbc news website]"
7,6,68,6_olympic_indoor_race_world,"[olympic, indoor, race, world, champion, holmes, european, record, radcliffe, athletics]","[bekele sets sights world mark olympic champion kenenisa bekele determined add world indoor two mile record norwich union grand prix chasing record held compatriot mentor haile set mark meeting still hungry much said aiming two mile world record birmingham next current record stands eight bekele stranger overhauling world marks national indoor ethiopian broke world indoor record debut meeting last compatriots mulugeta abiyote abate markos world indoor bronze medallist race bekele meet already attracted crop olympic champion kelly holmes taking part swedish heptathlon gold medallist carolina kluft contest relay gold medallists jason gardener mark, collins compete birmingham world commonwealth champion kim collins compete norwich union grand prix birmingham kitts nevis star joins british olympic relay gold medallists jason gardener mark sydney olympic champion world indoor record holder maurice greene athens olympic silver medallist francis obikwelu also take collins ran birmingham world indoor looking forward competing strong got great reception form crowd nia world indoor silver medal really exciting return world champion says good shape underestimating home gardener mark olympic gold medallists sure aiming win front home looking forward competing best sprinters sure metres one exciting races collins sixth olympic final athens hoping, holmes back form birmingham double olympic champion kelly holmes back best comfortably norwich union birmingham indoor grand running second competitive race shook rust win two still undecided competing european championships madrid probably entered make mind last training gone well expected got two weeks need take time make sure feel good felt good crowd behind feel like american eventual winner race almost ended three athletes disqualified false including mark first man guilty coming blocks world champion kim collins clinched second spot ahead world record holder training partner maurice jason unbeaten run came end]"
8,7,61,7_games_game_gaming_nintendo,"[games, game, gaming, nintendo, video, playstation, gamers, console, sony, handheld]","[sony psp console hits march gamers able buy playstation portable news europe handheld console sale first million sold come disc format sony billed machine walkman century sold units console play movies music also offers support wireless sony entering market dominated nintendo many launched handheld japan last year sold million sony said wanted launch psp europe roughly time gamers fear launch put nintendo said release europe gaming gaming entertainment said kaz president sony computer entertainment, nintendo handheld given euro date new handheld launch europe company portable games features retail nintendo said games would available prices ranging million consoles sold since first appeared japan end rival sony said launch first handheld europe end psp expected compete large part handheld despite assertion machines aimed different games available european launch date include super mario well titles developers rayman games development new nintendo backwards compatible game boy allowing earlier back catalogue games wireless link, sony psp console hits march gamers able buy playstation portable news europe handheld console sale first million sold come disc format sony billed machine walkman century sold units console play movies music also offers support wireless sony entering market dominated nintendo many launched handheld japan last year sold million sony said wanted launch psp europe roughly time gamers fear launch put nintendo said release europe gaming gaming entertainment said kaz president sony computer entertainment]"
9,8,37,8_music_apple_industry_bittorrent,"[music, apple, industry, bittorrent, legal, sales, piracy, dvd, digital, chart]","[hollywood sue net film pirates movie industry launched legal action sue people facilitate illegal film motion picture association america wants stop people using program bittorrent swap industry targeting people run websites provide information internet links movies copied filmed server operators targeted actions launched mpaa suits filed users programs edonkey directconnect united united finland mpaa bittorrent users download movies following link files found websites called unlike programs bittorrent works sharing could anything legitimate digital photo copied among multiple users movie industry hopes suing people run trackers cut bittorrent users illegal movies last month major film studios started legal action individuals swapping films growth, ready new network legal attacks websites help people swap pirated films forced development system could harder shut one site behind success bittorrent system producing software avoids pitfalls earlier test version new exeem program released late doubts remain new networks ability ensure files swapped late december movie studios launched legal campaign websites helped people swap pirated movies using bittorrent legal campaign worked way bittorrent system relies links called point users others happy share file looking shutting sites listed trackers crippled bittorrent one sites shut legal campaign helped boost popularity bittorrent system checking trackers led movies programmes claimed man behind goes nickname preparing release software new, ready new network legal attacks websites help people swap pirated films forced development system could harder shut one site behind success bittorrent system producing software avoids pitfalls earlier test version new exeem program released late doubts remain new networks ability ensure files swapped late december movie studios launched legal campaign websites helped people swap pirated movies using bittorrent legal campaign worked way bittorrent system relies links called point users others happy share file looking shutting sites listed trackers crippled bittorrent one sites shut legal campaign helped boost popularity bittorrent system checking trackers led movies programmes claimed man behind goes nickname preparing release software new]"


In [None]:
# Refresh representations and rebuild labels with sports/politics consolidation
# 1) Update topic representations now that the model is fitted
try:
    topic_model.set_representations(representation_model=keybert_repr, vectorizer_model=vectorizer)
    rep = topic_model.get_topic_info()
    print("Updated topic representations with bigrams and custom stopwords.")
except Exception as e:
    print("Failed to update representations:", e)

# 2) Build a label map from top words

def build_label_from_words(words, max_terms=3):
    if not words:
        return UNKNOWN_LABEL
    terms = [w for w, _ in words[:max_terms]] if isinstance(words[0], tuple) else words[:max_terms]
    return ", ".join(terms)

# topic -> label
topic_to_label = {}
for topic_id in rep["Topic"].tolist():
    if topic_id == -1:
        topic_to_label[topic_id] = UNKNOWN_LABEL
        continue
    words = topic_model.get_topic(topic_id) or []
    topic_to_label[topic_id] = build_label_from_words(words, max_terms=3)

# 3) Expanded taxonomy mapping for Sports and Politics
CANON = {
    # Sports variants
    "football": "Football",
    "soccer": "Football",
    "premier": "Football",
    "champions league": "Football",
    "fa cup": "Football",
    "cricket": "Cricket",
    "tennis": "Tennis",
    "rugby": "Rugby",
    "olympic": "Olympics",
    "grand slam": "Tennis",
    # Entertainment
    "film": "Movies",
    "movie": "Movies",
    "cinema": "Movies",
    # Business/Economy
    "rate": "Interest Rates",
    "inflation": "Inflation",
    "gdp": "Economy",
    # Politics
    "election": "Elections",
    "vote": "Elections",
    "parliament": "Parliament",
    "policy": "Public Policy",
    "budget": "Budget",
}

def consolidate(label: str) -> str:
    if not label or label == UNKNOWN_LABEL:
        return UNKNOWN_LABEL
    lower = label.lower()
    for key, canon in CANON.items():
        if key in lower:
            return canon
    return label.title()

# Assign labels
labels = [consolidate(topic_to_label.get(t, UNKNOWN_LABEL)) for t in df["Topic"]]
df["Subcategory"] = labels

df.head(5)


Failed to update representations: 'BERTopic' object has no attribute 'set_representations'


Unnamed: 0,Category,Text,Filename,Subcategory,Title,CleanContent,Topic
0,business,"Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.\n\nThe firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.\n\nTime Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.\n\nTime Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. ""Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility,"" chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.\n\nTimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.\n",data/business/001.txt,"Said, Growth, Oil","Ad sales boost Time Warner profit Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. The firm, which is now one of the biggest investors in Google, benefited",sales boost time warner profit quarterly profits media giant timewarner jumped three months one biggest investors benefited sales internet connections higher advert timewarner said fourth quarter sales rose profits buoyed gains offset profit dip warner less users time warner said friday owns internet mixed lost subscribers fourth quarter profits lower preceding three company said underlying profit exceptional items rose back stronger internet advertising hopes increase subscribers offering online service free timewarner internet customers try sign existing customers timewarner also restate results following probe securities exchange commission close time fourth quarter profits slightly better,0
1,business,"Dollar gains on Greenspan speech\n\nThe dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.\n\nAnd Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. ""I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time,"" said Robert Sinche, head of currency strategy at Bank of America in New York. ""He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next.""\n\nWorries about the deficit concerns about China do, however, remain. China's currency remains pegged to the dollar and the US currency's sharp falls in recent months have therefore made Chinese export prices highly competitive. But calls for a shift in Beijing's policy have fallen on deaf ears, despite recent comments in a major Chinese newspaper that the ""time is ripe"" for a loosening of the peg. The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.\n",data/business/002.txt,"Said, Growth, Oil",Dollar gains on Greenspan speech The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. And Alan Greenspan highlighted the US government's,dollar gains greenspan speech dollar hit highest level euro almost three months federal reserve head said trade deficit set alan greenspan highlighted willingness curb spending rising household savings factors may help reduce late trading new dollar reached market concerns deficit hit greenback recent federal reserve chairman speech london ahead meeting finance ministers sent dollar higher earlier tumbled back jobs think taking much sanguine view current account deficit taken said robert head currency strategy bank america new taking laying set conditions current account deficit improve year worries deficit concerns china currency remains pegged dollar sharp,0
2,business,"Yukos unit buyer faces loan claim\n\nThe owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan.\n\nState-owned Rosneft bought the Yugansk unit for $9.3bn in a sale forced by Russia to part settle a $27.5bn tax claim against Yukos. Yukos' owner Menatep Group says it will ask Rosneft to repay a loan that Yugansk had secured on its assets. Rosneft already faces a similar $540m repayment demand from foreign banks. Legal experts said Rosneft's purchase of Yugansk would include such obligations. ""The pledged assets are with Rosneft, so it will have to pay real money to the creditors to avoid seizure of Yugansk assets,"" said Moscow-based US lawyer Jamie Firestone, who is not connected to the case. Menatep Group's managing director Tim Osborne told the Reuters news agency: ""If they default, we will fight them where the rule of law exists under the international arbitration clauses of the credit.""\n\nRosneft officials were unavailable for comment. But the company has said it intends to take action against Menatep to recover some of the tax claims and debts owed by Yugansk. Yukos had filed for bankruptcy protection in a US court in an attempt to prevent the forced sale of its main production arm. The sale went ahead in December and Yugansk was sold to a little-known shell company which in turn was bought by Rosneft. Yukos claims its downfall was punishment for the political ambitions of its founder Mikhail Khodorkovsky and has vowed to sue any participant in the sale.\n",data/business/003.txt,"Said, Growth, Oil",Yukos unit buyer faces loan claim The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan. State-owned Rosneft bought the Yugansk unit for $9.3bn in,yukos unit buyer faces loan claim owners embattled russian oil giant yukos ask buyer former production unit pay back rosneft bought yugansk unit sale forced russia part settle tax claim owner menatep group says ask rosneft repay loan yugansk secured rosneft already faces similar repayment demand foreign legal experts said purchase yugansk would include pledged assets pay real money creditors avoid seizure yugansk said lawyer jamie connected menatep managing director tim osborne told reuters news fight rule law exists international arbitration clauses rosneft officials unavailable company said intends take action menatep recover tax claims debts owed yukos filed bankruptcy protection,0
3,business,"High fuel prices hit BA's profits\n\nBritish Airways has blamed high fuel prices for a 40% drop in profits.\n\nReporting its results for the three months to 31 December 2004, the airline made a pre-tax profit of £75m ($141m) compared with £125m a year earlier. Rod Eddington, BA's chief executive, said the results were ""respectable"" in a third quarter when fuel costs rose by £106m or 47.3%. BA's profits were still better than market expectation of £59m, and it expects a rise in full-year revenues.\n\nTo help offset the increased price of aviation fuel, BA last year introduced a fuel surcharge for passengers.\n\nIn October, it increased this from £6 to £10 one-way for all long-haul flights, while the short-haul surcharge was raised from £2.50 to £4 a leg. Yet aviation analyst Mike Powell of Dresdner Kleinwort Wasserstein says BA's estimated annual surcharge revenues - £160m - will still be way short of its additional fuel costs - a predicted extra £250m. Turnover for the quarter was up 4.3% to £1.97bn, further benefiting from a rise in cargo revenue. Looking ahead to its full year results to March 2005, BA warned that yields - average revenues per passenger - were expected to decline as it continues to lower prices in the face of competition from low-cost carriers. However, it said sales would be better than previously forecast. ""For the year to March 2005, the total revenue outlook is slightly better than previous guidance with a 3% to 3.5% improvement anticipated,"" BA chairman Martin Broughton said. BA had previously forecast a 2% to 3% rise in full-year revenue.\n\nIt also reported on Friday that passenger numbers rose 8.1% in January. Aviation analyst Nick Van den Brul of BNP Paribas described BA's latest quarterly results as ""pretty modest"". ""It is quite good on the revenue side and it shows the impact of fuel surcharges and a positive cargo development, however, operating margins down and cost impact of fuel are very strong,"" he said. Since the 11 September 2001 attacks in the United States, BA has cut 13,000 jobs as part of a major cost-cutting drive. ""Our focus remains on reducing controllable costs and debt whilst continuing to invest in our products,"" Mr Eddington said. ""For example, we have taken delivery of six Airbus A321 aircraft and next month we will start further improvements to our Club World flat beds."" BA's shares closed up four pence at 274.5 pence.\n",data/business/004.txt,"Said, Growth, Oil","High fuel prices hit BA's profits British Airways has blamed high fuel prices for a 40% drop in profits. Reporting its results for the three months to 31 December 2004, the airline made a pre-tax profit of £75m ($141m) compared",high fuel prices hit profits british airways blamed high fuel prices drop reporting results three months december airline made profit compared year rod chief said results third quarter fuel costs rose profits still better market expectation expects rise help offset increased price aviation last year introduced fuel surcharge increased surcharge raised yet aviation analyst mike powell dresdner kleinwort wasserstein says estimated annual surcharge revenues still way short additional fuel costs predicted extra turnover quarter benefiting rise cargo looking ahead full year results march warned yields average revenues per passenger,0
4,business,"Pernod takeover talk lifts Domecq\n\nShares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard.\n\nReports in the Wall Street Journal and the Financial Times suggested that the French spirits firm is considering a bid, but has yet to contact its target. Allied Domecq shares in London rose 4% by 1200 GMT, while Pernod shares in Paris slipped 1.2%. Pernod said it was seeking acquisitions but refused to comment on specifics.\n\nPernod's last major purchase was a third of US giant Seagram in 2000, the move which propelled it into the global top three of drinks firms. The other two-thirds of Seagram was bought by market leader Diageo. In terms of market value, Pernod - at 7.5bn euros ($9.7bn) - is about 9% smaller than Allied Domecq, which has a capitalisation of £5.7bn ($10.7bn; 8.2bn euros). Last year Pernod tried to buy Glenmorangie, one of Scotland's premier whisky firms, but lost out to luxury goods firm LVMH. Pernod is home to brands including Chivas Regal Scotch whisky, Havana Club rum and Jacob's Creek wine. Allied Domecq's big names include Malibu rum, Courvoisier brandy, Stolichnaya vodka and Ballantine's whisky - as well as snack food chains such as Dunkin' Donuts and Baskin-Robbins ice cream. The WSJ said that the two were ripe for consolidation, having each dealt with problematic parts of their portfolio. Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.\n",data/business/005.txt,"Said, Growth, Oil",Pernod takeover talk lifts Domecq Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard. Reports in the Wall Street Journal and the Financial,pernod takeover talk lifts domecq shares drinks food firm allied domecq risen speculation could target takeover pernod reports wall street journal financial times suggested french spirits firm considering yet contact allied domecq shares london rose pernod shares paris slipped pernod said seeking acquisitions refused comment last major purchase third giant seagram move propelled global top three drinks seagram bought market leader terms market pernod euros smaller allied capitalisation last year pernod tried buy one premier whisky lost luxury goods firm pernod home brands including chivas regal scotch havana club rum creek allied big names include malibu courvoisier stolichnaya,0


In [None]:
# Evaluation: top subcategories per main category (after consolidation)
view = (
    df.groupby(["Category", "Subcategory"])  # type: ignore
      .size()
      .reset_index(name="Count")
      .sort_values(["Category", "Count"], ascending=[True, False])
)
view.head(20)


Unnamed: 0,Category,Subcategory,Count
7,business,"Said, Growth, Oil",464
8,business,"Said, Would, Labour",23
1,business,"England, Six, Game",7
3,business,Miscellaneous,5
10,business,"Security, Users, Microsoft",3
0,business,"Broadband, Net, Service",2
2,business,"Games, Game, Gaming",2
4,business,"Mobile, Phone, Phones",1
5,business,Movies,1
6,business,"Music, Apple, Industry",1


In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def clean_label(label):
    if pd.isna(label) or label == "":
        return "Miscellaneous"
    # split words by commas or underscores
    tokens = [t.strip() for t in label.replace("_", " ").split(",")]
    # filter stopwords and short tokens
    tokens = [t for t in tokens if t.lower() not in stop_words and len(t) > 2]
    # return clean label
    return ", ".join(sorted(set(tokens))) if tokens else "Miscellaneous"

# clean all labels
df["Subcategory"] = df["Subcategory"].apply(clean_label)



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Cyrus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Save outputs (robust to missing parquet engines)
OUT_CSV = "bbc_with_subcategories_bertopic.csv"
OUT_PARQUET = "bbc_with_subcategories_bertopic.parquet"

cols = ["Category", "Text", "Topic", "Subcategory"]

# Always save CSV
df[cols].to_csv(OUT_CSV, index=False)

# Try Parquet if engine available; otherwise skip with a note
parquet_saved = False
try:
    import importlib
    if importlib.util.find_spec("pyarrow") or importlib.util.find_spec("fastparquet"):
        df[cols].to_parquet(OUT_PARQUET, index=False)
        parquet_saved = True
    else:
        print("Parquet engine not installed (pyarrow/fastparquet). Skipping Parquet save.")
except Exception as e:
    print(f"Parquet save failed: {e}")

(OUT_CSV, OUT_PARQUET if parquet_saved else None)


Parquet engine not installed (pyarrow/fastparquet). Skipping Parquet save.


('bbc_with_subcategories_bertopic.csv', None)

In [None]:
# Detailed view: top 10 subcategories per main Category with examples
from itertools import islice

def print_top_subcats_with_examples(frame: pd.DataFrame, per_category: int = 10, examples_per_subcat: int = 3):
    grouped = (
        frame.groupby(["Category", "Subcategory"])  # type: ignore
             .size()
             .reset_index(name="Count")
             .sort_values(["Category", "Count"], ascending=[True, False])
    )

    for category in grouped["Category"].unique():
        print(f"\n=== {category} ===")
        top_subcats = grouped[grouped["Category"] == category].head(per_category)
        for _, row in top_subcats.iterrows():
            subcat = row["Subcategory"]
            count = int(row["Count"])
            print(f"- {subcat}: {count}")
            examples = frame[(frame["Category"] == category) & (frame["Subcategory"] == subcat)]["Title"].head(examples_per_subcat).tolist()
            for ex in examples:
                print(f"    • {ex}")

print_top_subcats_with_examples(df, per_category=10, examples_per_subcat=3)



=== business ===
- Growth, Oil, Said: 464
    • Ad sales boost Time Warner profit Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier. The firm, which is now one of the biggest investors in Google, benefited
    • Dollar gains on Greenspan speech The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. And Alan Greenspan highlighted the US government's
    • Yukos unit buyer faces loan claim The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan. State-owned Rosneft bought the Yugansk unit for $9.3bn in
- Labour, Said, Would: 23
    • India calls for fair trade rules India, which attends the G7 meeting of seven leading industrialised nations on Friday, is unlikely to be cowed by its newcomer status. In London on Thursday ahead of t