# BBC News classification 

## Goal

The goal is to classify different news articles into their corresponding topics in a unsupervised way, matrix factorization specifically. Here I used non-negative matrix method. You can find the dataset at:
https://www.kaggle.com/competitions/learn-ai-bbc/code

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from sklearn.metrics import accuracy_score
from scipy.optimize import linear_sum_assignment
from sklearn.metrics import confusion_matrix
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

## EDA 

In [2]:
data_train = pd.read_csv('BBC News Train.csv')
data_train

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business
...,...,...,...
1485,857,double eviction from big brother model caprice...,entertainment
1486,325,dj double act revamp chart show dj duo jk and ...,entertainment
1487,1590,weak dollar hits reuters revenues at media gro...,business
1488,1587,apple ipod family expands market apple has exp...,tech


1. Check for null values across the training set

In [3]:
data_train.isnull().sum()

ArticleId    0
Text         0
Category     0
dtype: int64

2. Check and remove duplicates

In [4]:
data_train.duplicated(subset=['Text', 'Category']).sum()

50

In [5]:
data_train_cleaned = data_train.drop_duplicates(subset = ['Text', 'Category'])

In [6]:
data_train_cleaned 

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business
...,...,...,...
1485,857,double eviction from big brother model caprice...,entertainment
1486,325,dj double act revamp chart show dj duo jk and ...,entertainment
1487,1590,weak dollar hits reuters revenues at media gro...,business
1488,1587,apple ipod family expands market apple has exp...,tech


3. Building fractions of the dataset for model tuning later

In [7]:
# Take a random 10% sample
data_train_cleaned_10 = data_train_cleaned.sample(frac=0.1, random_state=42)
data_train_cleaned_10

Unnamed: 0,ArticleId,Text,Category
169,1066,bellamy under new fire newcastle boss graeme s...,sport
612,1570,schools to take part in mock poll record numbe...,politics
555,998,rap boss arrested over drug find rap mogul mar...,entertainment
65,96,dent continues adelaide progress american tayl...,sport
638,355,neeson in bid to revive theatre hollywood film...,entertainment
...,...,...,...
1477,883,web logs aid disaster recovery some of the mos...,tech
641,362,us woman sues over cartridges a us woman is su...,tech
702,32,firefox browser takes on microsoft microsoft s...,tech
689,491,giggs handed wales leading role ryan giggs wil...,sport


## Modeling 

1. Word embedding (turning text into matrix) using TF-IDF which based on the word occurrence

In [8]:
texts = data_train_cleaned ['Text'].tolist()
vectorizer = TfidfVectorizer() # max_features=500, model tuning 
X_train = vectorizer.fit_transform(texts)
# vectorizer.get_feature_names()
X_train.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.02045097, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.02409841, 0.        , ..., 0.        , 0.        ,
        0.        ]])

2. Understand the number of topics in true labels

In [9]:
data_train_cleaned['Category'].unique()

array(['business', 'tech', 'politics', 'sport', 'entertainment'],
      dtype=object)

3. Building NMF model 

In [10]:
n_topics = 5  # choose based on your dataset
nmf = NMF(n_components=n_topics)
W = nmf.fit_transform(X_train)   # [n_docs × n_topics]
H = nmf.components_        # [n_topics × n_words]

In [11]:
W

array([[0.04526626, 0.04636159, 0.00727894, 0.        , 0.00158877],
       [0.15364127, 0.        , 0.        , 0.        , 0.        ],
       [0.11399654, 0.02019131, 0.01160904, 0.01235422, 0.03983508],
       ...,
       [0.14628174, 0.        , 0.00342779, 0.        , 0.        ],
       [0.04154235, 0.        , 0.        , 0.00849862, 0.19583675],
       [0.0260935 , 0.        , 0.00672267, 0.00296844, 0.11141211]])

In [12]:
H

array([[3.96461410e-04, 1.03204129e-01, 8.45023149e-05, ...,
        4.91740295e-04, 2.77877674e-04, 0.00000000e+00],
       [7.55868529e-04, 4.04072338e-02, 1.96064181e-04, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [6.17545869e-04, 0.00000000e+00, 2.57060494e-04, ...,
        9.47998586e-03, 3.59726027e-04, 5.29288447e-03],
       [5.72065023e-04, 2.20706730e-02, 0.00000000e+00, ...,
        0.00000000e+00, 1.42998951e-03, 1.54304372e-03],
       [1.08916014e-03, 5.84294698e-02, 5.21710673e-04, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

4. Determine labels in each document based on highest number of association. 

In [13]:
predicted_labels = np.argmax(W, axis=1)
predicted_labels

array([1, 0, 0, ..., 0, 4, 4])

## Accuracy

1. Trying to match the perdicted labels with true labels by looking at a few matching labels

In [14]:
data_train_cleaned ['Category']

0            business
1            business
2            business
3                tech
4            business
            ...      
1485    entertainment
1486    entertainment
1487         business
1488             tech
1489             tech
Name: Category, Length: 1440, dtype: object

In [15]:
di = {0: 'business', 1: 'politics', 2: 'sport', 3: 'entertainment', 4: 'tech'}
new = []
for label in predicted_labels:
    new.append(di[label])
predicted_train_new = np.array(new)
acc_train = accuracy_score(data_train_cleaned['Category'], predicted_train_new)
acc_train

0.875

2. Confusion matrix + hungarian or brute force method for finding accuracy

In [16]:
# true labels (e.g., from your Category column)
true_labels = data_train_cleaned ["Category"].astype("category").cat.codes.values

# Create a confusion matrix between predicted and true
conf_matrix = confusion_matrix(true_labels, predicted_labels)
conf_matrix

array([[306,  12,   2,   0,  15],
       [ 14,   5,  20, 188,  36],
       [ 23, 207,   2,   0,  34],
       [  2,   0, 337,   2,   1],
       [  7,   1,   2,   2, 222]])

In [17]:
# Apply Hungarian algorithm to find best label mapping
row_ind, col_ind = linear_sum_assignment(-conf_matrix)

# Map predicted labels to best-aligned true labels
label_mapping = dict(zip(col_ind, row_ind))
mapped_preds = np.array([label_mapping[label] for label in predicted_labels])

# Now accuracy is valid!
acc = accuracy_score(true_labels, mapped_preds)
print(f"Adjusted Accuracy: {acc:.3f}")

Adjusted Accuracy: 0.875


3. Adjusted ran and normalized mutual for finding accuracy

In [18]:
ari = adjusted_rand_score(data_train_cleaned ['Category'], predicted_labels)
nmi = normalized_mutual_info_score(data_train_cleaned ['Category'], predicted_labels)

print(f"ARI: {ari:.3f}, NMI: {nmi:.3f}")

ARI: 0.729, NMI: 0.710


4. Display top 10 highest weighted words

In [19]:
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(H):
    top_words = [feature_names[i] for i in topic.argsort()[-10:][::-1]]
    print(f"Topic {topic_idx}: {' | '.join(top_words)}")

Topic 0: the | in | of | to | and | its | us | growth | said | economy
Topic 1: the | mr | to | he | labour | election | blair | of | and | party
Topic 2: the | to | and | in | he | we | his | of | but | england
Topic 3: the | film | best | and | awards | in | of | for | award | her
Topic 4: the | to | of | and | that | is | are | people | it | in


## Supervised model building and training using logistic regression

In [23]:
# Split data into training and test sets (80% train, 20% test)
X_train_text, X_test_text, y_train, y_test = train_test_split(
    data_train_cleaned['Text'],           # Input texts
    data_train_cleaned['Category'],          # Labels
    test_size=0.2,   # 20% of data for testing
    random_state=42  # For reproducibility
)

# Convert text to TF-IDF (after splitting to avoid data leakage)
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_text)  # Fit on training data only
X_test = vectorizer.transform(X_test_text)        # Transform test data

# Train and evaluate the model
model = LogisticRegression()
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")

Test Accuracy: 0.97


### 4. I did not apply the model to the test set as I noticed that the articles aren't correctly labeled from the solution file

## NMF Model tuning and comparison

With different feature parameters, we are able to find out that the max feature and accuracy are in a linear relationship.

| Max_feature | Accuracy |
| :-: | :-: |
| 500 | 0.772 |
|1000 | 0.828 |
|5000 | 0.869 |
| 10000 | 0.875 |

With different amout of training data, looks like 30% fraction of the data gives the highest accuracy. 

| Data fraction | Accuracy 
| :-: | :-: |
| 10% | 0.868 |
| 30% | 0.884 |
| 50% | 0.875 |
| 70% | 0.869 |

In [None]:
Supervised vs Unsupervised

| Model | Accuracy |
| :-: | :-: |
| NMF | 0.875 - 0.884 |
| Logistic | 0.97 |

## Future work

1. To build SVD models for comparison
2. To split the training set into test set and apply to nmf model too
3. Use grid search to find the best hyperparameters for supervised model

## Reference techniques

TF-IDF: https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

HUNGARIAN: https://www.geeksforgeeks.org/hungarian-algorithm-assignment-problem-set-1-introduction/

ADJUSTED RAND SCORE: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html

normalized_mutual_info_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html