## Subcategory Extraction
So there are a couple of methods i can think of to extract subcategories fom text:
#### Key word Dictionaries
- I will try to define sub categories by associating them with possible keywords and when an article has a large amount of keywords belonging to that subcategory they get assigned to the subcategory.
- Will be easy to explain and transparent but will also have glaring issues like missing synonys and unable to handle context
#### TF-IDF + Clustering
- I could represent the document with TF-IDF then run a clustering algorithm within each main category
- Inspect the clusters and label them as subcategories
- This would be more automatic than the key word dictionary
- the cluster might not be clean and will be harder to explain
#### Topic Modeling
- Use Latent Dirichlet Allocation (LDA) on each category to uncover hidden topics Or use BERTopic.
- With this each topic becomes a potential subcategory.
- This approach will give me more flexible and discovers patterns i might not think of however it can produce noisy or overlapping topics needing interpretation.
#### Supervised ML
- This is another approach but this would require me to label some of the data with subcategories and then train a classifier to predict subcategories this will probably be the most well rounded/best approach but this depends on me labeling the data.

#### Im choosing the topic modeling solution due to its advantages

import preprocessing libarary

In [27]:
import spacy
import pandas as pd
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

In [28]:
# imort spacy
nlp = spacy.load("en_core_web_sm")

# set seed for reproducibility
torch.manual_seed(42)

<torch._C.Generator at 0x252c5850fd0>

Preprocessing Function

In [29]:
def preprocess_text(text):
    doc = nlp(text.lower()) # tokenize and lemmatize
    tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop and token.pos_ in ["NOUN","PROPN","VERB","ADJ"]] # remove stop words and punctuation
    return ' '.join(tokens) # join tokens into a string

import the text data

In [30]:
df = pd.read_csv("../data/raw_bbc.csv")

In [31]:
# apply preprocessing to dataframe
df["Cleaned Text"] = df["Text"].apply(preprocess_text)
# peak 5 random rows
df.sample(5)

Unnamed: 0,Category,Text,Filename,Subcategory,Cleaned Text
780,entertainment,German music in a 'zombie' state\n\nThe German...,data/entertainment/271.txt,,german music zombie state german music busines...
537,entertainment,Potter director signs Warner deal\n\nHarry Pot...,data/entertainment/028.txt,,potter director sign warner deal harry potter ...
1208,politics,Parties warned over 'grey vote'\n\nPolitical p...,data/politics/313.txt,,party warn grey vote political party afford ol...
17,business,India's rupee hits five-year high\n\nIndia's r...,data/business/018.txt,,india rupee hit year high india rupee hit year...
1614,sport,Italy 17-28 Ireland\n\nTwo moments of magic fr...,data/sport/302.txt,,italy ireland moment magic brian guided irelan...


In [32]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

Prepare Functions

In [None]:
docs = df["Cleaned Text"].tolist()

def setup_llm_model():
    model_name = "google/flan-t5-base"  # Fixed: removed the comma
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)  # Changed class name
    return model, tokenizer

model, tokenizer = setup_llm_model()

def generate_subcategory(text, category, model, tokenizer, max_length=20):
    prompt = f"""
    Act as a news editor. Generate a concise subcategory label (2-3 words) for this news article in the {category} and try not to be too generic.
    Article: {text} 
    Only output the subcategory label.
    """
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    # Move inputs to the same device as model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        min_length=2,
        temperature=0.7,
        num_return_sequences=1
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

def run_llm_subcategory_model(df, category, model, tokenizer, batch_size=50):
    print(f"\nGenerating subcategories for {category}...")

    mask = df["Category"] == category
    texts = df.loc[mask, "Cleaned Text"].tolist()

    subcategories = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        print(f"Processing {i} to {i+len(batch)}")

        for text in batch:
            subcat = generate_subcategory(text, category, model, tokenizer)
            subcategories.append(subcat)

    df.loc[mask, "Subcategory"] = subcategories
    return df



In [None]:

model, tokenizer = setup_llm_model()

def generate_subcategory(text, category, model, tokenizer, max_length=20):
    prompt = f"""
    Act as a news editor. Generate a concise subcategory label (2-3 words) for this news article in the {category} and try not to be too generic.
    Article: {text} 
    Only output the subcategory label.
    """
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    # Move inputs to the same device as model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        min_length=2,
        temperature=0.7,
        num_return_sequences=1
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

def run_llm_subcategory_model(df, category, model, tokenizer, batch_size=50):
    print(f"\nGenerating subcategories for {category}...")

    mask = df["Category"] == category
    texts = df.loc[mask, "Cleaned Text"].tolist()

    subcategories = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        print(f"Processing {i} to {i+len(batch)}")

        for text in batch:
            subcat = generate_subcategory(text, category, model, tokenizer)
            subcategories.append(subcat)

    df.loc[mask, "Subcategory"] = subcategories
    return df



In [43]:
# Process each category with FLAN-T5
print("Starting subcategory generation with FLAN-T5...")
for category in df["Category"].unique():
    df = run_llm_subcategory_model(df, category, model, tokenizer, batch_size=50)
    print(f"Completed {category}")

Starting subcategory generation with FLAN-T5...

Generating subcategories for business...
Processing 0 to 50
Processing 50 to 100
Processing 100 to 150
Processing 150 to 200
Processing 200 to 250
Processing 250 to 300
Processing 300 to 350
Processing 350 to 400
Processing 400 to 450
Processing 450 to 500
Processing 500 to 510
Completed business

Generating subcategories for entertainment...
Processing 0 to 50
Processing 50 to 100
Processing 100 to 150
Processing 150 to 200
Processing 200 to 250
Processing 250 to 300
Processing 300 to 350
Processing 350 to 386
Completed entertainment

Generating subcategories for politics...
Processing 0 to 50
Processing 50 to 100
Processing 100 to 150
Processing 150 to 200
Processing 200 to 250
Processing 250 to 300
Processing 300 to 350
Processing 350 to 400
Processing 400 to 417
Completed politics

Generating subcategories for sport...
Processing 0 to 50
Processing 50 to 100
Processing 100 to 150
Processing 150 to 200
Processing 200 to 250
Processing

In [39]:
# Save results
df.to_csv("bbc_with_subcategories.csv", index=False)

In [40]:
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU name:", torch.cuda.get_device_name(0))
    print("CUDA version:", torch.version.cuda)



CUDA available: True
GPU name: NVIDIA GeForce RTX 4060 Laptop GPU
CUDA version: 12.1
