# BBC News Classification: Unsupervised Learning VS Supervised Learning

I will utilize a publicly available dataset from the BBC, which can be accessed at https://www.kaggle.com/competitions/learn-ai-bbc/overview. My approach will involve applying Non-Negative Matrix Factorization (NMF) for topic modeling, followed by a supervised learning model for classification. The performance of these two methods will be compared to evaluate their effectiveness in text analysis.

## 1. Extracting word features and show Exploratory Data Analysis (EDA) — Inspect, Visualize and Clean the Data


In [2]:
# basics
import numpy as np
import itertools
import random
import os

# EDA
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data preprocessing
import re
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# Unsupervised learning
from sklearn.decomposition import NMF

# Supervised learning
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# helper functions
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import GridSearchCV

In [12]:
for dirname, _, filenames in os.walk('./data'):
    for filename in filenames:
        if not filename.startswith('.'):  # Skip hidden files
            print(os.path.join(dirname, filename))

./data/BBC News Train.csv
./data/BBC News Test.csv
./data/BBC News Sample Solution.csv


In [14]:
train_data = pd.read_csv('./data/BBC News Train.csv')
test_data = pd.read_csv('./data/BBC News Test.csv')
sample_data = pd.read_csv('./data/BBC News Sample Solution.csv')

In [17]:
train_data.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


Initial look shows there are 3 columns(ArticleId, Text and Category)

In [18]:
train_data.isnull().sum()

ArticleId    0
Text         0
Category     0
dtype: int64

There are no null values in this data, which makes it very easy to work with, since there is minimal work required for cleaning data.