# BBC News Summary - ML/DL Project

A media monitoring project that classifies BBC news articles, discovers latent topics, and builds a reinforcement learning (RL) agent to decide whether to use a classical ML model, a deep learning model, or escalate a news item to humans.

## Project Overview

This notebook implements:
1. **News Article Classification** - Classify articles into categories (business, entertainment, politics, sport, tech)
2. **Topic Discovery** - Group articles into topics using clustering
3. **Reinforcement Learning Agent** - Decide when to use classical ML, deep learning, or human escalation


### 0. Imports & global config

In [None]:
# Standard library imports
import os
import re
import string

# IPython display for multiple outputs
from IPython.display import display

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning - model selection and evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score
)

# Machine learning - feature extraction and dimensionality reduction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Deep learning - TensorFlow/Keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout

# Global configuration
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

# Set random seeds for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
tf.random.set_seed(RANDOM_STATE)

print("libraries imported successfully!")

libraries imported successfully!


### Part A: Data Mining and Preprocessing


#### A1. Load BBC dataset

**Dataset structure:**
```
BBC-News-Summary-ML-DL-Project/
└─ dataset/
   └─ News Articles/
       ├─ business/
       ├─ entertainment/
       ├─ politics/
       ├─ sport/
       └─ tech/
```

In [14]:
DATA_DIR = "dataset/News Articles"

categories = []
texts = []
filenames = []

for category in os.listdir(DATA_DIR):
    category_path = os.path.join(DATA_DIR, category)
    if not os.path.isdir(category_path):
        continue
    for fname in os.listdir(category_path):
        fpath = os.path.join(category_path, fname)
        if not os.path.isfile(fpath):
            continue
        with open(fpath, "r", encoding="latin-1") as f:
            text = f.read().strip()
        categories.append(category)
        texts.append(text)
        filenames.append(fname)

df = pd.DataFrame({
    "category": categories,
    "text": texts,
    "filename": filenames
})

print("Dataset shape:", df.shape)

print("\nFirst 5 records:")
display(df.head())

print("\nDatset categories and their counts:")
df["category"].value_counts()


Dataset shape: (2225, 3)

First 5 records:


Unnamed: 0,category,text,filename
0,business,Ad sales boost Time Warner profit\n\nQuarterly...,001.txt
1,business,Dollar gains on Greenspan speech\n\nThe dollar...,002.txt
2,business,Yukos unit buyer faces loan claim\n\nThe owner...,003.txt
3,business,High fuel prices hit BA's profits\n\nBritish A...,004.txt
4,business,Pernod takeover talk lifts Domecq\n\nShares in...,005.txt



Datset categories and their counts:


category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64