# Source
- Dataset source: http://mlg.ucd.ie/datasets/bbc.html
- Original paper: http://mlg.ucd.ie/files/publications/greene06icml.pdf
- To download the raw files click on this link: http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip

# Preprocess

In [11]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import pandas as pd
import glob
import nltk
import os
import re


In [3]:
def preprocess_bbc_dataset(root_dir):
    # our columns
    texts = []
    categories = []
    file_names = []

    # process all category directories e.g business or tech folders
    for category in os.listdir(root_dir):
        category_path = os.path.join(root_dir, category)
        # skip if not a directory
        if not os.path.isdir(category_path):
            continue

        # process all text files in the category
        for file_path in glob.glob(os.path.join(category_path, "*.txt")):
            try:
                with open(file_path, 'r', encoding='utf-8') as file:
                    text = file.read()

                # fill the row
                texts.append(text)
                categories.append(category)
                file_names.append(os.path.basename(file_path))
            except Exception as e:
                print(f"Error processing {file_path}: {e}")

    # create df
    data = {
        'file_name': file_names,
        'category': categories,
        'text': texts
    }
    df = pd.DataFrame(data)

    return df

bbc_df = preprocess_bbc_dataset("_data/2_Medium_BBC_Dataset")
bbc_df.head()

Unnamed: 0,file_name,category,text
0,001.txt,business,Ad sales boost Time Warner profit\n\nQuarterly...
1,002.txt,business,Dollar gains on Greenspan speech\n\nThe dollar...
2,003.txt,business,Yukos unit buyer faces loan claim\n\nThe owner...
3,004.txt,business,High fuel prices hit BA's profits\n\nBritish A...
4,005.txt,business,Pernod takeover talk lifts Domecq\n\nShares in...


In [4]:
print(bbc_df.head())

  file_name  category                                               text
0   001.txt  business  Ad sales boost Time Warner profit\n\nQuarterly...
1   002.txt  business  Dollar gains on Greenspan speech\n\nThe dollar...
2   003.txt  business  Yukos unit buyer faces loan claim\n\nThe owner...
3   004.txt  business  High fuel prices hit BA's profits\n\nBritish A...
4   005.txt  business  Pernod takeover talk lifts Domecq\n\nShares in...


In [6]:
bbc_df.tail()

Unnamed: 0,file_name,category,text
2220,397.txt,tech,BT program to beat dialler scams\n\nBT is intr...
2221,398.txt,tech,Spam e-mails tempt net shoppers\n\nComputer us...
2222,399.txt,tech,Be careful how you code\n\nA new European dire...
2223,400.txt,tech,US cyber security chief resigns\n\nThe man mak...
2224,401.txt,tech,Losing yourself in online gaming\n\nOnline rol...


In [5]:
bbc_df.describe()

Unnamed: 0,file_name,category,text
count,2225,2225,2225
unique,511,5,2127
top,386.txt,sport,De Niro film leads US box office\n\nFilm star ...
freq,5,511,2


In [None]:
print(f"Dataset created with {len(bbc_df)} articles across {bbc_df['category'].nunique()} categories")
print(f"Category distribution:\n{bbc_df['category'].value_counts()}")

Dataset created with 2225 articles across 5 categories
Category distribution:
category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64


## Basic text cleaning & Text normalization
- Convert all text to lowercase
- Remove punctuation and special characters
- Remove numbers
- Remove extra whitespaces
- Remove stopwords

In [8]:
bbc_df = pd.read_csv("_data/2_Medium_BBC_Dataset/bbc_dataset.csv")

def basic_clean(text):
    # Convert all text to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', ' ', text)
    # Remove numbers
    text = re.sub(r'\d+', ' ', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

bbc_df['cleaned_text'] = bbc_df['text'].apply(basic_clean)
print(bbc_df['text'].iloc[0][:200])
print(bbc_df['cleaned_text'].iloc[0][:200])
bbc_df.head()

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one o
ad sales boost time warner profit quarterly profits at us media giant timewarner jumped to bn m for the three months to december from m year earlier the firm which is now one of the biggest investors 


Unnamed: 0,file_name,category,text,cleaned_text
0,001.txt,business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sales boost time warner profit quarterly pr...
1,002.txt,business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gains on greenspan speech the dollar ha...
2,003.txt,business,Yukos unit buyer faces loan claim\n\nThe owner...,yukos unit buyer faces loan claim the owners o...
3,004.txt,business,High fuel prices hit BA's profits\n\nBritish A...,high fuel prices hit ba s profits british airw...
4,005.txt,business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeover talk lifts domecq shares in uk...


In [9]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andras.janko\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\andras.janko\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\andras.janko\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


### What are "tokens"?

In natural language processing, "tokens" are not numbers - they're the individual words or components that result from splitting up text. Tokenization is simply the process of breaking text into smaller units (usually words).

The line tokens = nltk.word_tokenize(text) was working correctly, but it was just:

- Taking a sentence like "ad sales boost time warner profit"
- Splitting it into individual words: ["ad", "sales", "boost", "time", "warner", "profit"]
- Then your code immediately joined them back together with spaces

So the output looked nearly identical to the input because you were just splitting and rejoining without any transformation.

### Why the complete function now works
- Tokenization: Breaking text into individual words
- Stopword removal: Removing common words like "the", "at", "is"
- Stemming: Reducing words to their root forms (e.g., "jumped" → "jump")

Stemming algorithms don't aim to produce real dictionary words. Instead, they apply rule-based transformations to reduce words to their "stem" or root form by chopping off common endings. The goal is to group related word forms together, not to produce grammatically correct words.

This is by design! The benefit is that different variations of the same concept get mapped to the same stem, which reduces vocabulary size and helps machine learning models recognize similar words.

Lemmatization is more linguistically accurate but slower than stemming. For document classification, stemming is usually sufficient.
```python
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def normalize_text(text):
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # Lemmatization instead of stemming
    return ' '.join(tokens)
```

In [None]:
def normalize_text(text):
    # tokenization
    tokens = nltk.word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Apply stemming
    tokens = [stemmer.stem(word) for word in tokens]
    # join tokens back into a single string
    return ' '.join(tokens)

bbc_df['normalized_text'] = bbc_df['cleaned_text'].apply(normalize_text)
print("Cleaned text:")
print(bbc_df['cleaned_text'].iloc[0][:200])
print("\nNormalized text:")
print(bbc_df['normalized_text'].iloc[0][:200])
bbc_df.head()

Cleaned text:
ad sales boost time warner profit quarterly profits at us media giant timewarner jumped to bn m for the three months to december from m year earlier the firm which is now one of the biggest investors 

Normalized text:
ad sale boost time warner profit quarterli profit us media giant timewarn jump bn three month decemb year earlier firm one biggest investor googl benefit sale high speed internet connect higher advert


Unnamed: 0,file_name,category,text,cleaned_text,normalized_text
0,001.txt,business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sales boost time warner profit quarterly pr...,ad sale boost time warner profit quarterli pro...
1,002.txt,business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gains on greenspan speech the dollar ha...,dollar gain greenspan speech dollar hit highes...
2,003.txt,business,Yukos unit buyer faces loan claim\n\nThe owner...,yukos unit buyer faces loan claim the owners o...,yuko unit buyer face loan claim owner embattl ...
3,004.txt,business,High fuel prices hit BA's profits\n\nBritish A...,high fuel prices hit ba s profits british airw...,high fuel price hit ba profit british airway b...
4,005.txt,business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeover talk lifts domecq shares in uk...,pernod takeov talk lift domecq share uk drink ...


In [12]:
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

def normalize_text(text):
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # lemmatization here instead of stemming
    return ' '.join(tokens)

bbc_df['lemmatized_normalized_text'] = bbc_df['cleaned_text'].apply(normalize_text)
print("Cleaned text:")
print(bbc_df['cleaned_text'].iloc[0][:200])
print("\nLemmatized-Normalized text:")
print(bbc_df['lemmatized_normalized_text'].iloc[0][:200])
bbc_df.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\andras.janko\AppData\Roaming\nltk_data...


Cleaned text:
ad sales boost time warner profit quarterly profits at us media giant timewarner jumped to bn m for the three months to december from m year earlier the firm which is now one of the biggest investors 

Lemmatized-Normalized text:
ad sale boost time warner profit quarterly profit u medium giant timewarner jumped bn three month december year earlier firm one biggest investor google benefited sale high speed internet connection h


Unnamed: 0,file_name,category,text,cleaned_text,normalized_text,lemmatized_normalized_text
0,001.txt,business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sales boost time warner profit quarterly pr...,ad sale boost time warner profit quarterli pro...,ad sale boost time warner profit quarterly pro...
1,002.txt,business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gains on greenspan speech the dollar ha...,dollar gain greenspan speech dollar hit highes...,dollar gain greenspan speech dollar hit highes...
2,003.txt,business,Yukos unit buyer faces loan claim\n\nThe owner...,yukos unit buyer faces loan claim the owners o...,yuko unit buyer face loan claim owner embattl ...,yukos unit buyer face loan claim owner embattl...
3,004.txt,business,High fuel prices hit BA's profits\n\nBritish A...,high fuel prices hit ba s profits british airw...,high fuel price hit ba profit british airway b...,high fuel price hit ba profit british airway b...
4,005.txt,business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeover talk lifts domecq shares in uk...,pernod takeov talk lift domecq share uk drink ...,pernod takeover talk lift domecq share uk drin...


In [13]:
bbc_df.to_csv("_data/2_Medium_BBC_Dataset/bbc_dataset_preprocessed.csv", index=False)