# Text Classification, Topic Modelling and Text Generation

**Hello World!** and welcome to this project created by me, Rikhil Singh. As established by the above title, this project aims to dive into the realm of three main areas of Natural Language processing (NLP) - from the humble classification of texts, to the unsupervised modelling of 
these texts (henceforth referred to as articles) into particular topics - as well as the generation of specific words in a particular 
order to give brief yet accurate representations of these articles to the layman. 

The applicability of these 3 broad spheres of NLP
can be extended to multiple realms - including sentiment analysis, document tagging as well as automated title generation - and is 
composed of Machine Learning (ML) models that I've built from scratch, as well as pretrained ML Algorithms optimised for beneficial
performance. It is advisable to run this notebook not in totality - but rather in the 3 main segments which will be outlined by more
markdowns that can be found as you parse the code. It is imperative however to run the code cell directly below in order to import the 
necessary libraries utilised by this code as well as to amend the `file_path` variable in the cell as need be. 

Should some of the libraries not be installed - please pip install the `requirements.txt` file 
found in the main directory where this code file, as well as the dataset, can be found. Happy Coding!

7-12-2024 - 23-12-2024

In [1]:
## Ensure that this code cell is the first to be run before all others
## Each Section (1. 2. & 3.) can be run independently of each other so long as THIS cell has been run first
import spacy
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
import torch.nn as nn 
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import Dataset as D

from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary

from datasets import Dataset
from bertopic import BERTopic
from rouge_score import rouge_scorer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

from sklearn.model_selection import train_test_split 
from sklearn.metrics import precision_score, f1_score, classification_report

file_path = "bbc-news-data.csv" # amend the file path for the dataset as necessary
## Dataset Source - https://www.kaggle.com/datasets/hgultekin/bbcnewsarchive

## 1. Text Classification

Classification, while normally thought of to be a relatively simple and trivial task, can be quite nuanced and significant when it comes to the world we live in today. With information constantly flowing through multiple channels - the classification of text - whether it be for articles into sports, politics, or entertainment - to filtering fake news from legitimate sources, the applications of text classification extends far and wide, as it finds utility across multiple domains. Here, we aim to compare how well the articles can be classified into the FIVE predetermined categories already created - not solely through the content of an entire article, but even through its plain and simple title.

In [2]:
df = pd.read_csv(file_path,sep='\t')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   filename  2225 non-null   object
 2   title     2225 non-null   object
 3   content   2225 non-null   object
dtypes: object(4)
memory usage: 69.7+ KB


In [4]:
df['category'].value_counts(normalize=True)*100 # Fortunately, this dataset is quite well balanced from the get-go

category
sport            22.966292
business         22.921348
politics         18.741573
tech             18.022472
entertainment    17.348315
Name: proportion, dtype: float64

In [5]:
df.head()

Unnamed: 0,category,filename,title,content
0,business,001.txt,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...
1,business,002.txt,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...
2,business,003.txt,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...
3,business,004.txt,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...
4,business,005.txt,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...


In [6]:
r_state = 27
train_df,test_df = train_test_split(df,train_size=0.8,random_state=r_state,stratify = df['category']) 

In [8]:
def consolidate(df):
    return df['title']+'.'+df['content']
train_df['title_and_content'] = consolidate(train_df) # this column was mainly created for the cohesiveness of the title and article itself
# it is not used in the code below but can be - as will be explained further on

In [9]:
train, val = train_test_split(train_df,train_size=0.75,random_state=r_state,stratify=train_df['category']) 

In [10]:
mapping = {k:i for i,k in enumerate(sorted(train['category'].unique()))}
mapping

{'business': 0, 'entertainment': 1, 'politics': 2, 'sport': 3, 'tech': 4}

In [11]:
train['enc_category'] = train['category'].map(mapping)
val['enc_category'] = val['category'].map(mapping)

In [12]:
nlp = spacy.load("en_core_web_md") # spacy's (medium) English pipeline is used for the tokenisation and embedding of texts
# it is perferred to other pipelines owing to its relative ease of usage alongside its compact size but capable strengths

In [13]:
train['title_embed'] = train['title'].apply(lambda x: torch.from_numpy(nlp(x.lower()).vector))
val['title_embed'] = val['title'].apply(lambda x: torch.from_numpy(nlp(x.lower()).vector))
# While word embeddings may seem to be the natural fix - particularly for something like a title, 
# document embeddings prove to be quite efficient while also being standardised in terms of length as well as 
# retaining semantic information distinctly well even in short corpuses - though this is improved in longer articles
# https://orangedatamining.com/blog/embedding-vs-bow/

In [14]:
train['embed_content'] = train['content'].apply(lambda x: torch.from_numpy(nlp(x.lower()).vector))
val['embed_content'] = val['content'].apply(lambda x: torch.from_numpy(nlp(x.lower()).vector))

In [15]:
feature_cat = "title_embed" # While the feature of interest henceforth is the embedded title,
# this can be amended to "embed_content" - which also yields much more impressive results 

In [16]:
train_data = train[[feature_cat,"enc_category"]]
val_data = val[[feature_cat,"enc_category"]]

In [17]:
class DataFrameDataset(D): # Creating a suitable dataset for pytorch to read from
    def __init__(self, df, feature, target):
        self.df = df
        self.feature = feature
        self.target = target

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # Extract features and target
        row = self.df.iloc[idx]
        feature = row[self.feature] # Already consists tensors
        target = torch.tensor(row[self.target], dtype=torch.long)
        return feature, target

In [18]:
train_torch = DataFrameDataset(train_data, feature=feature_cat, target="enc_category")
val_torch = DataFrameDataset(val_data, feature=feature_cat, target="enc_category")

In [19]:
batch = 16
# the batch size can be altered as need be, though 16 proves to be quite optimal already

train_loader = DataLoader(train_torch, batch_size=batch, shuffle=True) 
val_loader = DataLoader(val_torch, batch_size=batch, shuffle=True)

In [20]:
class RNN(nn.Module): # creation of the Recurrent Neural Network (yes, an RNN - nt even an LSTM) for text classification
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True) # use of RNN function in nn module
        self.fc = nn.Linear(hidden_size, output_size) # fully conected layer to map RNN output to number of output classes 
    
    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device) # setting of hidden state params
        out, _ = self.rnn(x, h0) # output
        out = self.fc(out[:,-1,:]) # fully connected layer output
        return out

In [21]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_size = 300 # vector w 300 elements
hidden_size = 128 # in btw 300 and output + power of 2
output_size = len(mapping)
model = RNN(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss() # classification problem --> crossentropyloss type
optimizer = optim.Adam(model.parameters(), lr=0.001) # ADAM optimizer used --> allows for relatively 'large' learning rate
epochs = 10

In [22]:
for epoch in range(epochs):
    # Training Phase
    model.train()
    train_loss = 0
    for features, target in tqdm(train_loader):
        features, target = features.to(device), target.to(device)
        features = features.unsqueeze(1)  # Adjust input for RNN
        optimizer.zero_grad() # reset gradient to zero for each loop
        output = model(features) # pass in my features into the model
        loss = criterion(output, target) # using cross entropy loss WOO
        loss.backward() # backprop 
        optimizer.step() # step taken
        train_loss += loss.item()

    # Validation Phase
    model.eval() # set model in evaluation phase
    val_loss = 0
    all_targets = []
    all_predictions = []

    with torch.no_grad():
        for features, target in tqdm(val_loader):
            features, target = features.to(device), target.to(device)
            features = features.unsqueeze(1)
            
            output = model(features)
            loss = criterion(output, target)
            val_loss += loss.item()

            _, predicted = torch.max(output, dim=1)
            all_predictions.extend(predicted.cpu().numpy())
            all_targets.extend(target.cpu().numpy())

    # Metrics Calculation 
    precision = precision_score(all_targets, all_predictions, average='weighted')
    f1 = f1_score(all_targets, all_predictions, average='weighted')
    accuracy = (torch.tensor(all_predictions) == torch.tensor(all_targets)).sum().item() / len(all_targets)

    train_loss /= len(train_loader)
    val_loss /= len(val_loader)

    print(f"Epoch {epoch + 1}/{epochs}:")
    print(f"  Train Loss: {train_loss:.4f}")
    print(f"  Val Loss: {val_loss:.4f}")
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  F1 Score: {f1:.4f}")
    print(classification_report(all_targets,all_predictions,target_names=mapping))
    print()

100%|██████████| 84/84 [00:00<00:00, 182.71it/s]
100%|██████████| 28/28 [00:00<00:00, 357.29it/s]


Epoch 1/10:
  Train Loss: 1.2835
  Val Loss: 1.0526
  Accuracy: 0.6315
  Precision: 0.6586
  F1 Score: 0.6293
               precision    recall  f1-score   support

     business       0.53      0.83      0.65       102
entertainment       0.79      0.48      0.60        77
     politics       0.62      0.60      0.61        84
        sport       0.70      0.58      0.63       102
         tech       0.68      0.62      0.65        80

     accuracy                           0.63       445
    macro avg       0.66      0.62      0.63       445
 weighted avg       0.66      0.63      0.63       445




100%|██████████| 84/84 [00:00<00:00, 150.65it/s]
100%|██████████| 28/28 [00:00<00:00, 241.11it/s]


Epoch 2/10:
  Train Loss: 0.8601
  Val Loss: 0.8904
  Accuracy: 0.6989
  Precision: 0.7050
  F1 Score: 0.6977
               precision    recall  f1-score   support

     business       0.65      0.81      0.72       102
entertainment       0.77      0.62      0.69        77
     politics       0.69      0.63      0.66        84
        sport       0.74      0.67      0.70       102
         tech       0.68      0.74      0.71        80

     accuracy                           0.70       445
    macro avg       0.71      0.69      0.70       445
 weighted avg       0.71      0.70      0.70       445




100%|██████████| 84/84 [00:00<00:00, 128.82it/s]
100%|██████████| 28/28 [00:00<00:00, 297.61it/s]


Epoch 3/10:
  Train Loss: 0.7090
  Val Loss: 0.8286
  Accuracy: 0.6966
  Precision: 0.7000
  F1 Score: 0.6945
               precision    recall  f1-score   support

     business       0.67      0.81      0.74       102
entertainment       0.68      0.70      0.69        77
     politics       0.76      0.60      0.67        84
        sport       0.70      0.73      0.71       102
         tech       0.69      0.61      0.65        80

     accuracy                           0.70       445
    macro avg       0.70      0.69      0.69       445
 weighted avg       0.70      0.70      0.69       445




100%|██████████| 84/84 [00:00<00:00, 102.20it/s]
100%|██████████| 28/28 [00:00<00:00, 243.42it/s]


Epoch 4/10:
  Train Loss: 0.6235
  Val Loss: 0.8102
  Accuracy: 0.7034
  Precision: 0.7041
  F1 Score: 0.7029
               precision    recall  f1-score   support

     business       0.72      0.74      0.73       102
entertainment       0.67      0.75      0.71        77
     politics       0.70      0.64      0.67        84
        sport       0.72      0.72      0.72       102
         tech       0.70      0.66      0.68        80

     accuracy                           0.70       445
    macro avg       0.70      0.70      0.70       445
 weighted avg       0.70      0.70      0.70       445




100%|██████████| 84/84 [00:00<00:00, 133.21it/s]
100%|██████████| 28/28 [00:00<00:00, 301.53it/s]


Epoch 5/10:
  Train Loss: 0.5675
  Val Loss: 0.7952
  Accuracy: 0.7169
  Precision: 0.7201
  F1 Score: 0.7163
               precision    recall  f1-score   support

     business       0.69      0.79      0.74       102
entertainment       0.72      0.73      0.72        77
     politics       0.73      0.63      0.68        84
        sport       0.78      0.70      0.74       102
         tech       0.68      0.72      0.70        80

     accuracy                           0.72       445
    macro avg       0.72      0.71      0.71       445
 weighted avg       0.72      0.72      0.72       445




100%|██████████| 84/84 [00:00<00:00, 134.50it/s]
100%|██████████| 28/28 [00:00<00:00, 262.73it/s]


Epoch 6/10:
  Train Loss: 0.5275
  Val Loss: 0.7941
  Accuracy: 0.7011
  Precision: 0.7032
  F1 Score: 0.7005
               precision    recall  f1-score   support

     business       0.68      0.79      0.73       102
entertainment       0.74      0.68      0.71        77
     politics       0.69      0.65      0.67        84
        sport       0.74      0.70      0.72       102
         tech       0.67      0.66      0.67        80

     accuracy                           0.70       445
    macro avg       0.70      0.70      0.70       445
 weighted avg       0.70      0.70      0.70       445




100%|██████████| 84/84 [00:00<00:00, 158.17it/s]
100%|██████████| 28/28 [00:00<00:00, 308.12it/s]


Epoch 7/10:
  Train Loss: 0.4923
  Val Loss: 0.7912
  Accuracy: 0.7169
  Precision: 0.7185
  F1 Score: 0.7163
               precision    recall  f1-score   support

     business       0.74      0.75      0.74       102
entertainment       0.77      0.69      0.73        77
     politics       0.69      0.69      0.69        84
        sport       0.69      0.78      0.73       102
         tech       0.71      0.65      0.68        80

     accuracy                           0.72       445
    macro avg       0.72      0.71      0.71       445
 weighted avg       0.72      0.72      0.72       445




100%|██████████| 84/84 [00:00<00:00, 123.56it/s]
100%|██████████| 28/28 [00:00<00:00, 322.78it/s]


Epoch 8/10:
  Train Loss: 0.4628
  Val Loss: 0.7962
  Accuracy: 0.7011
  Precision: 0.7046
  F1 Score: 0.7015
               precision    recall  f1-score   support

     business       0.76      0.71      0.73       102
entertainment       0.76      0.65      0.70        77
     politics       0.64      0.69      0.67        84
        sport       0.70      0.75      0.73       102
         tech       0.65      0.69      0.67        80

     accuracy                           0.70       445
    macro avg       0.70      0.70      0.70       445
 weighted avg       0.70      0.70      0.70       445




100%|██████████| 84/84 [00:00<00:00, 140.43it/s]
100%|██████████| 28/28 [00:00<00:00, 271.64it/s]


Epoch 9/10:
  Train Loss: 0.4432
  Val Loss: 0.8041
  Accuracy: 0.7213
  Precision: 0.7256
  F1 Score: 0.7209
               precision    recall  f1-score   support

     business       0.68      0.77      0.72       102
entertainment       0.80      0.69      0.74        77
     politics       0.74      0.64      0.69        84
        sport       0.71      0.77      0.74       102
         tech       0.72      0.70      0.71        80

     accuracy                           0.72       445
    macro avg       0.73      0.72      0.72       445
 weighted avg       0.73      0.72      0.72       445




100%|██████████| 84/84 [00:00<00:00, 135.92it/s]
100%|██████████| 28/28 [00:00<00:00, 279.71it/s]

Epoch 10/10:
  Train Loss: 0.4313
  Val Loss: 0.8387
  Accuracy: 0.7034
  Precision: 0.7128
  F1 Score: 0.7029
               precision    recall  f1-score   support

     business       0.66      0.79      0.72       102
entertainment       0.64      0.77      0.70        77
     politics       0.72      0.69      0.71        84
        sport       0.81      0.64      0.71       102
         tech       0.70      0.62      0.66        80

     accuracy                           0.70       445
    macro avg       0.71      0.70      0.70       445
 weighted avg       0.71      0.70      0.70       445







In [23]:
def processing(df,feature_name): # processing of future (test) dataframes for model suitability
    feature = df[feature_name]
    target = df["category"]
    
    enc_feature = feature.apply(lambda x: torch.from_numpy(nlp(x.lower()).vector))
    enc_target = target.map(mapping)
    enc_df = pd.DataFrame({"embed_feature":enc_feature,"enc_category":enc_target})

    torch_dataset = DataFrameDataset(enc_df, feature="embed_feature", target="enc_category")
    torch_loader = DataLoader(torch_dataset, batch_size=16, shuffle=True)

    return torch_loader

In [24]:
torch_test = processing(test_df,feature_cat.replace("embed","").replace("_",""))

In [25]:
model.eval() # Evaluation of the model on a test dataset

all_predictions = []
all_targets = []

with torch.no_grad(): 
    for batch_idx, (features, targets) in enumerate(torch_test):
        features = features.to(device)  
        features = features.unsqueeze(1)
        targets = targets.to(device)

        logits = model(features)

        predictions = torch.argmax(logits, dim=1)

        all_predictions.extend(predictions.cpu().numpy())  
        all_targets.extend(targets.cpu().numpy()) 

precision = precision_score(all_targets, all_predictions, average='weighted')
f1 = f1_score(all_targets, all_predictions, average='weighted')

print(f"Precision: {precision:.4f}")
print(f"F1 Score: {f1:.4f}")
print(classification_report(all_targets,all_predictions,target_names=mapping))
# Quite good results - especially solely off titles and using just an RNN - even more impressive results can be attained 
# when the feature "embed_content" (embedded article text) is used for training and testing --> likely that an LSTM can 
# also result in improvements 

Precision: 0.7242
F1 Score: 0.7186
               precision    recall  f1-score   support

     business       0.69      0.73      0.70       102
entertainment       0.66      0.81      0.73        77
     politics       0.77      0.73      0.75        84
        sport       0.76      0.63      0.69       102
         tech       0.74      0.74      0.74        80

     accuracy                           0.72       445
    macro avg       0.72      0.72      0.72       445
 weighted avg       0.72      0.72      0.72       445



## 2. Topic Modelling

It is useful to be able to classify articles into clear categories with which we can capture the interests of particular people / be able to pass proper decisions on - yet in the real worl, most data is unlabelled. More accurately, while a certain corpus of text may be accurately labelled (via crosschecking) by humans, newly generated articles may not - yielding problems such as the inability to consistently check if the preestablished classifier is able to perform well continuously on data that can change (even as its labels does not) / not reflecting novel categories well. This is where semi-supervised learning comes in; namely (in the case of NLP) [Topic Modelling](https://www.datacamp.com/tutorial/what-is-topic-modeling)

In [2]:
df = pd.read_csv(file_path,sep='\t')
r_state = 27

train_df,test_df = train_test_split(df,train_size=0.8,random_state=r_state,stratify = df['category'])

In [3]:
nlp = spacy.load("en_core_web_md",disable=["ner","parser"]) 
# The named-entity-recognition software and parser functions are disabled in order for increased speed during text preprocessing

In [4]:
def text_preprocessing(sentence):
    sentence = sentence.strip()
    doc = nlp(sentence.lower())  # Lowercasing
    processed_res = " ".join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct and len(token.lemma_)>2])
    return processed_res 
    
def text_embedding(processed_res): 
    return torch.from_numpy(np.array([token.vector for token in nlp(processed_res)]))

In [5]:
train_df["processed_content"] = train_df["content"].apply(text_preprocessing) 
# this aims to tokenise the text but not embed it directly

In [6]:
train_df['title_embed'] = train_df['title'].apply(lambda x:torch.from_numpy(np.array([token.vector for token in nlp(x)])))

In [7]:
train, val = train_test_split(train_df,train_size=0.75,random_state=r_state,stratify=train_df['category'])

In [8]:
tokenised_docs = list((train["processed_content"]).apply(lambda x:x.split(' ')).values)
# Topic Modelling Algorithms such as Latent Dirichlet Allocation (LDA) works off the analysis of tokens in numerous documents
# rather than the embeddings of specific words/sentences or documents - hence; the matrix of various tokens are used to form 
# a Dictionary (in the case of LDA) for a corpus to be formed for the LDA model to be trained on. More information specifically
# on LDA & how it works can be found here: https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

In [9]:
dictionary = Dictionary(tokenised_docs)

In [10]:
corpus = [dictionary.doc2bow(doc) for doc in tokenised_docs] # this specific corpus is formed using the bag of words technique
# term frequency - inverse document frequency (TF-IDF) can also be used, though for this specific task I found the bag of words
# approach to be suitable in its regard

In [None]:
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, passes=250, random_state=r_state)

In [12]:
# comparison on train set
predicted_topics = []
for tokens in tokenised_docs:
    bow = dictionary.doc2bow(tokens)
    topic_distribution = lda_model.get_document_topics(bow, minimum_probability=0.0)
    predicted_topic = max(topic_distribution, key=lambda x: x[1])[0]  # Topic with highest probability
    predicted_topics.append(predicted_topic)

In [13]:
comparison = pd.concat([train["category"].reset_index().drop("index",axis=1),pd.Series(predicted_topics,name='predicted_cat')],axis=1)
comparison.groupby("category").value_counts(normalize=True)*100 # the information presented below has been more neatly conveyed
# in cells further below - where my conclusion has also been outlined. See if you can come to the same realisation!

category       predicted_cat
business       4                91.830065
               3                 3.267974
               2                 2.941176
               1                 1.633987
               0                 0.326797
entertainment  2                65.948276
               0                25.000000
               3                 5.603448
               1                 1.724138
               4                 1.724138
politics       2                94.377510
               4                 4.016064
               1                 0.803213
               0                 0.401606
               3                 0.401606
sport          1                98.697068
               0                 0.977199
               4                 0.325733
tech           3                83.817427
               2                 5.394191
               1                 4.564315
               0                 3.319502
               4                 2.904564
Name:

In [14]:
val_tokenised_docs = list((val["processed_content"]).apply(lambda x:x.split(' ')).values)

predicted_topics = []
for tokens in val_tokenised_docs:
    bow = dictionary.doc2bow(tokens)
    topic_distribution = lda_model.get_document_topics(bow, minimum_probability=0.0)
    predicted_topic = max(topic_distribution, key=lambda x: x[1])[0]  # Topic with highest probability
    predicted_topics.append(predicted_topic)

In [15]:
comparison = pd.concat([val["category"].reset_index().drop("index",axis=1),pd.Series(predicted_topics,name='predicted_cat')],axis=1)
comparison["count"]=1 
pf = pd.crosstab(index=comparison["category"],columns=comparison["predicted_cat"],values=comparison["count"],aggfunc="sum",normalize="index")*100
pf.sort_values(by=[0,1,2,3,4],ascending=False)
# There appears to be a pretty good split of the actual categories with the predicted (unnamed) categories formed by the LDA model
# Articles labelled as 'sport' are lumped under the model's category '1' 
# Articles labelled as 'tech' are primarily lumped under the model's category '3' and so on

# Indeed, the main problem appears to be the overlap between the entertainment and political articles which are primarily
# clustered together under '2' by the LDA model - though more of entertainment is outlied into '0' 

# As LDA doesn't hold any specific semantic regard for the words themselves - working solely off of tokenised instanced of 
# preprocessed words - it naturally doesn't look out for the meaning necessarily but rather instances of frequency and likelihood
# This can point to a similarity in the construction ntertainment articles and politics - indeed politicians today do appear to be 
# attaining that 'celebrity' like status - which is helpful in providing new perspectives; though not without its flaws.

predicted_cat,0,1,2,3,4
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
entertainment,18.181818,1.298701,70.12987,9.090909,1.298701
sport,0.0,99.019608,0.980392,0.0,0.0
tech,0.0,1.25,6.25,91.25,1.25
business,0.0,0.980392,3.921569,0.0,95.098039
politics,0.0,0.0,98.809524,0.0,1.190476


In [16]:
print(pf.idxmax()) # Another illustration of the above table
print()
print(pf.T.idxmax())

predicted_cat
0    entertainment
1            sport
2         politics
3             tech
4         business
dtype: object

category
business         4
entertainment    2
politics         2
sport            1
tech             3
dtype: int64


In [17]:
cat_mapping = {k:i for i,k in enumerate(train["category"].unique())} 
train["enc_category"] = train["category"].map(cat_mapping)

In [18]:
sample_content,sample_cats = train["processed_content"].values ,train["enc_category"].values

In [19]:
topic_model = BERTopic(verbose=True, nr_topics = 5,min_topic_size=150)
topics,probs = topic_model.fit_transform(sample_content,y = sample_cats)

# Apart from LDA, BERTopic is another unsupervised Machine Learning Algorithm that can identify common topics in various articles
# through the analysis of the tokenised documents of these articles. Unlike LDA however, it is not solely reliant on individual tokens
# nor statistical probabilities - but is also able to capture the meaning of articles in its topic modelling process - yielding 
# far better results in identifying common topics without requiring the number of topics to be explicitly defined (like in LDA); though
# this can be done as shown above to acquire na outcome efficiently.

2024-12-22 22:39:55,083 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/42 [00:00<?, ?it/s]

2024-12-22 22:42:00,405 - BERTopic - Embedding - Completed ✓
2024-12-22 22:42:00,406 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-22 22:42:11,403 - BERTopic - Dimensionality - Completed ✓
2024-12-22 22:42:11,403 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-12-22 22:42:11,503 - BERTopic - Cluster - Completed ✓
2024-12-22 22:42:11,503 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-22 22:42:11,767 - BERTopic - Representation - Completed ✓
2024-12-22 22:42:11,767 - BERTopic - Topic reduction - Reducing number of topics
2024-12-22 22:42:11,767 - BERTopic - Topic reduction - Reduced number of topics from 5 to 5


In [20]:
topic_model.get_topic_info() # the Names of each topic roughly reflect the documents within quite well
# Topic 0 likely being political in nature, 1 being film/ entertainment based, 3 being business related and so on

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,249,0_say_labour_party_election,"[say, labour, party, election, government, bla...",[tony blair launch attack conservative spendin...
1,1,233,1_film_award_say_star,"[film, award, say, star, year, good, music, in...",[hollywood star bring touch glamour london sat...
2,2,307,2_win_play_game_say,"[win, play, game, say, player, england, year, ...",[double olympic champion kelly holmes good com...
3,3,306,3_say_year_company_firm,"[say, year, company, firm, market, rise, bank,...",[price home rise seasonally adjust 0.5 februar...
4,4,240,4_say_people_technology_game,"[say, people, technology, game, mobile, servic...",[mobile phone enjoy boom time sale accord rese...


In [21]:
comparison = pd.concat([train["category"].reset_index().drop("index",axis=1),pd.Series(topics,name='predicted_cat')],axis=1)
comparison["count"]=1 
pf = pd.crosstab(index=comparison["category"],columns=comparison["predicted_cat"],values=comparison["count"],aggfunc="sum",normalize="index")*100
pf # We can observe the staggering improvement in how BERTopic segregated the various topics into its categories compared to LDA 

predicted_cat,0,1,2,3,4
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
business,0.0,0.326797,0.0,99.673203,0.0
entertainment,0.0,100.0,0.0,0.0,0.0
politics,100.0,0.0,0.0,0.0,0.0
sport,0.0,0.0,100.0,0.0,0.0
tech,0.0,0.0,0.0,0.414938,99.585062


In [22]:
print(pf.idxmax())
print()
print(pf.T.idxmax())

predicted_cat
0         politics
1    entertainment
2            sport
3         business
4             tech
dtype: object

category
business         3
entertainment    1
politics         0
sport            2
tech             4
dtype: int64


In [23]:
val["enc_category"] = val["category"].map(cat_mapping) # the validation process is thus carried out below, to achieve similar results
val_content,val_cats = val["processed_content"].values ,val["enc_category"].values

In [24]:
topics,probs = topic_model.transform(val_content)

comparison = pd.concat([val["category"].reset_index().drop("index",axis=1),pd.Series(topics,name='predicted_cat')],axis=1)
comparison["count"]=1 
pf = pd.crosstab(index=comparison["category"],columns=comparison["predicted_cat"],values=comparison["count"],aggfunc="sum",normalize="index")*100
pf

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

2024-12-22 22:42:52,530 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-12-22 22:42:58,194 - BERTopic - Dimensionality - Completed ✓
2024-12-22 22:42:58,196 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-12-22 22:42:58,226 - BERTopic - Cluster - Completed ✓


predicted_cat,-1,0,1,2,3,4
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
business,0.980392,3.921569,0.0,0.0,92.156863,2.941176
entertainment,0.0,1.298701,92.207792,0.0,1.298701,5.194805
politics,0.0,98.809524,0.0,0.0,0.0,1.190476
sport,0.0,0.980392,0.0,99.019608,0.0,0.0
tech,0.0,1.25,0.0,0.0,3.75,95.0


In [25]:
print(pf.idxmax())
print()
print(pf.T.idxmax())

predicted_cat
-1         business
 0         politics
 1    entertainment
 2            sport
 3         business
 4             tech
dtype: object

category
business         3
entertainment    1
politics         0
sport            2
tech             4
dtype: int64


In [26]:
test_df["processed_content"] = test_df["content"].apply(text_preprocessing) # likewise for the test set
test_df["enc_category"] = test_df["category"].map(cat_mapping)

In [27]:
test_content,test_cats = test_df["processed_content"].values ,test_df["enc_category"].values

In [28]:
topics,probs = topic_model.transform(test_content)

comparison = pd.concat([test_df["category"].reset_index().drop("index",axis=1),pd.Series(topics,name='predicted_cat')],axis=1)
comparison["count"]=1 
pf = pd.crosstab(index=comparison["category"],columns=comparison["predicted_cat"],values=comparison["count"],aggfunc="sum",normalize="index")*100
pf

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

2024-12-22 22:43:51,070 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-12-22 22:43:52,272 - BERTopic - Dimensionality - Completed ✓
2024-12-22 22:43:52,272 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-12-22 22:43:52,293 - BERTopic - Cluster - Completed ✓


predicted_cat,0,1,2,3,4
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
business,3.921569,0.0,1.960784,89.215686,4.901961
entertainment,2.597403,92.207792,0.0,1.298701,3.896104
politics,95.238095,0.0,1.190476,2.380952,1.190476
sport,0.0,0.0,99.019608,0.980392,0.0
tech,3.75,0.0,0.0,2.5,93.75


In [29]:
print(pf.idxmax())
print()
print(pf.T.idxmax())

predicted_cat
0         politics
1    entertainment
2            sport
3         business
4             tech
dtype: object

category
business         3
entertainment    1
politics         0
sport            2
tech             4
dtype: int32


In [30]:
topic_model = BERTopic(min_topic_size=20) 
topics,probs = topic_model.fit_transform(sample_content)
topic_info = topic_model.get_topic_info()
topic_info

# indeed, BERTopic is useful since it does not demand a specific number of topics to be outlined from the get go - 
# so long as a rough estimate of the minimum number of topics can be provided - BERTopic will automatically cluster various 
# topics together and outline the result in a datafrae=me as shown below. This can be more clearly visualised in a 
# hierarchy of topics - showing the links between topics as well as outlining how defined 'catgeorised' topics like sport
# or entertainment can be subdivided by particular regions/ players/ genres etc

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,12,-1_stern_viacom_season_say,"[stern, viacom, season, say, wrist, hoddle, wo...",[medium giant viacom pay 3.5 1.8 end investiga...
1,0,287,0_say_year_company_firm,"[say, year, company, firm, rise, market, bank,...",[house price increase 1.1 december monthly ris...
2,1,262,1_say_labour_party_election,"[say, labour, party, election, government, bla...",[tony blair launch attack conservative spendin...
3,2,246,2_say_people_technology_mobile,"[say, people, technology, mobile, service, gam...",[mobile phone enjoy boom time sale accord rese...
4,3,136,3_film_award_star_good,"[film, award, star, good, actor, oscar, year, ...",[martin scorsese aviator win good film oscar a...
5,4,112,4_club_united_liverpool_goal,"[club, united, liverpool, goal, chelsea, game,...",[arsene wenger step feud sir alex ferguson cla...
6,5,92,5_england_rugby_ireland_nation,"[england, rugby, ireland, nation, game, win, c...",[lansdowne road dublin sunday february 1500 gm...
7,6,79,6_music_band_album_song,"[music, band, album, song, number, record, say...",[memory soul legend ray charles dominate music...
8,7,57,7_roddick_win_match_play,"[roddick, win, match, play, set, open, seed, n...",[andre agassi erratic display edge fourth roun...
9,8,52,8_olympic_race_win_athlete,"[olympic, race, win, athlete, athens, indoor, ...",[britain jason gardener shake upset stomach wi...


In [31]:
hierarchical_topics = topic_model.hierarchical_topics(sample_content)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

# For example, in sports, we can see that our articles can be split across Olympic events and also those in England/ the UK
# Given that this is a BBC dataset - it does make sense to have sufficient articles related to sports in the UK to warrant 
# such a number of topics arising

# We can also see links between technology and business as well as politics - pointing to the intrinsic nature of how 
# technology is at the forefront of many businesses today and indeed how governmnets are aware of their impacts - certainly
# more than sports. Together with LDA, topic modelling techniques can provide various insights into categories that may already
# be defined based on how they are constructed with regards to particular phrases that are used to how these categories are related
# to one another and how they may be further subdivided to entice particular audiences well.

100%|██████████| 8/8 [00:00<00:00, 234.69it/s]


## 3. Text Generation

As shown above - topic modelling can prove to be a useful part of NLP in clustering articles of a related nature together. Particularly with regards to BERTopic, this can allow for useful, readable information pertaining to the class of articles to be quickly derived from unlabelled articles. However, it is quite clear that labels such as `'film_award_star'` and `'say_people_technology'`, while understandable, isn't immediately appealing for people to read and make sense of on the fly. This is likewise the case for extremely long articles with no clear headers about them. 

What is required is a good summarisation - a good TITLE that can capture people's attention. Such text ought to be generated quickly and reflect the tones of categories/ articles sufficiently well. Multiple pretrained Sequence2Sequence models, as well as transformers, have been developed for this - and given the great performance of BERTopic as seen previously - a similar approach shall be taken to see if ML Algos can generated as good titles for articles as we can ourselves!

In [2]:
model_name = "czearing/article-title-generator" 
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# information of this model can be found on huggingface via https://huggingface.co/czearing/article-title-generator
# This model in particular was selected owing to its similarity with the task of title generation I aimed to accomplished
# The model is loaded and utilised directly - without any prior training - owing to hardware limitations, yet owing to its 
# similar mission it had to accomplished when trained initially, its results appear to be fairly impressive indeed. 

In [3]:
df = pd.read_csv(file_path,sep='\t')
r_state = 27
train_df,test_df = train_test_split(df,train_size=0.8,random_state=r_state,stratify = df['category'])

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   filename  2225 non-null   object
 2   title     2225 non-null   object
 3   content   2225 non-null   object
dtypes: object(4)
memory usage: 69.7+ KB


In [5]:
train, val = train_test_split(train_df,train_size=0.75,random_state=r_state,stratify=train_df['category'])

In [6]:
train["title"].str.split(" ").apply(lambda x:len(x)).value_counts() # The length of the title is analysed to give a rough approximation 
# to the maximum length of titles which the model should not exceed

title
5    618
6    384
4    229
7     78
3     23
8      2
9      1
Name: count, dtype: int64

In [7]:
max_title_len = 15 # The max title length (in the model) is based off tokens, not words. Given how sites such as
# https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them suggest the approximation of using 
# 4 tokens for 3 words, and the fact that the sole splitting of spaces to count words isn't exactly the best metric - to err
# on the side of caution - I have placed the maximum title length (of tokens) to be 15 instead of 12. This can be amended as 
# one wants - but ultimately proves to be sufficient in my opinion. 

In [8]:
# Conversion of each Pandas DataFrame to a Hugging Face Dataset
train_dataset = Dataset.from_pandas(train)
val_dataset = Dataset.from_pandas(val)
test_dataset = Dataset.from_pandas(test_df)

In [9]:
def tokenize_function(examples,feature_col="content",target_col="title"):
    # Tokenises the article as the input and the title as the target
    inputs = tokenizer(examples[feature_col], padding="max_length", truncation=True, max_length=512)
    targets = tokenizer(examples["title"], padding="max_length", truncation=True, max_length=max_title_len)
    inputs['labels'] = targets['input_ids']  # Set titles as labels
    return inputs

# Tokenise all datasets (train, val, test)
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1335 [00:00<?, ? examples/s]

Map:   0%|          | 0/445 [00:00<?, ? examples/s]

Map:   0%|          | 0/445 [00:00<?, ? examples/s]

In [10]:
rand_shuffle = test_dataset.shuffle(seed=r_state)
size_shuffle = 25 # the number of titles to be generated for and compared against (advised to keep the number constrained)

In [11]:
# Generate titles for each article in the validation set
generated_titles = []

for article in rand_shuffle['input_ids'][:size_shuffle]:
    generated_ids = model.generate(torch.from_numpy(np.array(article)).unsqueeze(0), num_beams=4, min_length = 3
        ,max_length=max_title_len, early_stopping=True)
    generated_title = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    generated_titles.append(generated_title)

In [12]:
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True) 

# The rouge score is a metric commonly used (alongside the BLEU score) for the evaluation of machine text generation and summarisation
# It has been preferred over the BLEU score owing to the latter's preference for machine translation tasks

# https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499

In [13]:
res = pd.DataFrame(columns = ["Actl_Title","Gen_Title","f_measure"])

In [14]:
# Display the generated titles alongside the original articles
for actl_title, pred_title in zip(rand_shuffle['title'][:size_shuffle], generated_titles):
    score = scorer.score(actl_title,pred_title)
    res.loc[len(res)] = [actl_title,pred_title,score['rougeL'].fmeasure]

In [15]:
pd.set_option('max_colwidth', 80) 
res.sort_values("f_measure",ascending=False)

# We can see how the first few generated titles do indeed seem to measure up to the actual titles quite well! Of course by no means 
# are the perfect replicas - indeed further training of the loaded model using the train dataset is likely to vastly improve performance
# (though this is computationally expensive and is best done with a GPU) - yet it shows how ML models of today can be used in tandem with 
# our capabilities to create eye-catching titles than retain information very well. Indeed - this entire project aims to highlight the 
# benefits of AI when used by the right hands in improving our efficiency and abilities - which indeed can be extended to more than solely
# News articles like above ;}

Unnamed: 0,Actl_Title,Gen_Title,f_measure
8,Clarke faces ID cards rebellion,Home Secretary Charles Clarke faces backbench rebellion over ID cards bill,0.5
23,Blair buys copies of new Band Aid,Prime Minister Tony Blair bought two copies of Band Aid 20 in Edinburgh,0.5
4,Women MPs reveal sexist taunts,"Women MPs endure ""shocking"" levels of sexist abuse",0.461538
18,UK troops on Ivory Coast standby,British troops on standby to help evacuate British citizens from Ivory Coast,0.444444
17,Beckham relief as Real go through,David Beckham expresses relief at Real Madrid's passage to Champions,0.352941
5,Collins to compete in Birmingham,World and Commonwealth 100m champion Kim Collins will compete in the 60m at,0.333333
19,Melzer shocks Agassi,Jurgen Melzer beat Andre Agassi 6-3 6-1,0.333333
15,Jamelia's return to the top,Jamelia Davis reveals why she's still trying to make it in,0.333333
22,Mobiles rack up 20 years of use,20 years since the first mobile phone call,0.266667
14,Security warning over 'FBI virus',Emails purporting to be from the FBI contain a computer virus,0.25
