# Task

##### I choose the BBC news dataset to solve a text categorization problem, specifically classifying news articles into five distinct categories.
(business, entertainment, politics, sport, tech)

# Source
- Dataset source: http://mlg.ucd.ie/datasets/bbc.html
- Original paper: http://mlg.ucd.ie/files/publications/greene06icml.pdf
- To download the raw files click on this link: http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip

# Initial Data

The dataset was organized in the following directory structure, with each category in a separate folder containing individual text files. For convenience, I transformed it into an aggregated CSV format.

```
📦_data
 ┣ 📂business
 ┃ ┣ 📜001.txt
 ┃ ┣ 📜002.txt
 ┃ ┣ 📜003.txt
 ...
 ┃ ┗ 📜510.txt
 ┣ 📂entertainment
 ┃ ┣ 📜001.txt
 ┃ ┣ 📜002.txt
 ┃ ┣ 📜003.txt
 ...
 ┃ ┗ 📜386.txt
 ┣ 📂politics
 ┃ ┣ 📜001.txt
 ┃ ┣ 📜002.txt
 ┃ ┣ 📜003.txt
 ...
 ┃ ┗ 📜417.txt
 ┣ 📂sport
 ┃ ┣ 📜001.txt
 ┃ ┣ 📜002.txt
 ┃ ┣ 📜003.txt
 ...
 ┃ ┗ 📜511.txt
 ┗ 📂tech
 ┃ ┣ 📜001.txt
 ┃ ┣ 📜002.txt
 ┃ ┣ 📜003.txt
 ...
 ┃ ┗ 📜401.txt
```

# Preprocess

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import pandas as pd
import glob
import os
import re


In [None]:
def preprocess_bbc_dataset(root_dir):
    # output columns
    texts = []
    categories = []
    file_names = []

    # process all category directories e.g business or tech folders
    for category in os.listdir(root_dir):
        category_path = os.path.join(root_dir, category)

        if not os.path.isdir(category_path):
            continue

        # process all text files in the category
        for file_path in glob.glob(os.path.join(category_path, "*.txt")):
            try:
                with open(file_path, 'r', encoding='utf-8') as file:
                    text = file.read()

                # fill the row
                texts.append(text)
                categories.append(category)
                file_names.append(os.path.basename(file_path))
            except Exception as e:
                print(f"Error processing {file_path}: {e}")

    # create output dataframe
    data = {
        'file_name': file_names,
        'category': categories,
        'text': texts
    }
    df = pd.DataFrame(data)

    return df

bbc_df = preprocess_bbc_dataset('_data/')
bbc_df


Unnamed: 0,file_name,category,text
0,001.txt,business,Ad sales boost Time Warner profit\n\nQuarterly...
1,002.txt,business,Dollar gains on Greenspan speech\n\nThe dollar...
2,003.txt,business,Yukos unit buyer faces loan claim\n\nThe owner...
3,004.txt,business,High fuel prices hit BA's profits\n\nBritish A...
4,005.txt,business,Pernod takeover talk lifts Domecq\n\nShares in...
...,...,...,...
2220,397.txt,tech,BT program to beat dialler scams\n\nBT is intr...
2221,398.txt,tech,Spam e-mails tempt net shoppers\n\nComputer us...
2222,399.txt,tech,Be careful how you code\n\nA new European dire...
2223,400.txt,tech,US cyber security chief resigns\n\nThe man mak...


In [None]:
print(f"Dataset created with {len(bbc_df)} articles across {bbc_df['category'].nunique()} categories")
print(f"Category distribution:\n{bbc_df['category'].value_counts()}")
bbc_df.describe()


Dataset created with 2225 articles across 5 categories
Category distribution:
category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64


Unnamed: 0,file_name,category,text
count,2225,2225,2225
unique,511,5,2127
top,386.txt,sport,De Niro film leads US box office\n\nFilm star ...
freq,5,511,2


### Basic text cleaning and text normalization

In [None]:
def basic_clean(text):
    # convert all text to lowercase
    text = text.lower()
    # remove punctuation and special characters
    text = re.sub(r'[^\w\s]', ' ', text)
    # remove numbers
    text = re.sub(r'\d+', ' ', text)
    # remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

bbc_df['cleaned_text'] = bbc_df['text'].apply(basic_clean)
print("Before cleaning:")
print(bbc_df['text'].iloc[0][:200])
print("\nAfter cleaning:")
print(bbc_df['cleaned_text'].iloc[0][:200])
bbc_df


Before cleaning:
Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one o

After cleaning:
ad sales boost time warner profit quarterly profits at us media giant timewarner jumped to bn m for the three months to december from m year earlier the firm which is now one of the biggest investors 


Unnamed: 0,file_name,category,text,cleaned_text
0,001.txt,business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sales boost time warner profit quarterly pr...
1,002.txt,business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gains on greenspan speech the dollar ha...
2,003.txt,business,Yukos unit buyer faces loan claim\n\nThe owner...,yukos unit buyer faces loan claim the owners o...
3,004.txt,business,High fuel prices hit BA's profits\n\nBritish A...,high fuel prices hit ba s profits british airw...
4,005.txt,business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeover talk lifts domecq shares in uk...
...,...,...,...,...
2220,397.txt,tech,BT program to beat dialler scams\n\nBT is intr...,bt program to beat dialler scams bt is introdu...
2221,398.txt,tech,Spam e-mails tempt net shoppers\n\nComputer us...,spam e mails tempt net shoppers computer users...
2222,399.txt,tech,Be careful how you code\n\nA new European dire...,be careful how you code a new european directi...
2223,400.txt,tech,US cyber security chief resigns\n\nThe man mak...,us cyber security chief resigns the man making...


In [None]:
# download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# set up stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andras.janko\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\andras.janko\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\andras.janko\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
def normalize_text(text):
    # tokenization
    tokens = nltk.word_tokenize(text)
    # remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # apply lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # join tokens back into a single string
    return ' '.join(tokens)

bbc_df['normalized_text'] = bbc_df['cleaned_text'].apply(normalize_text)
print("Cleaned text:")
print(bbc_df['cleaned_text'].iloc[0][:200])
print("\nNormalized text (with lemmatization):")
print(bbc_df['normalized_text'].iloc[0][:200])
bbc_df.head()


Cleaned text:
ad sales boost time warner profit quarterly profits at us media giant timewarner jumped to bn m for the three months to december from m year earlier the firm which is now one of the biggest investors 

Normalized text (with lemmatization):
ad sale boost time warner profit quarterly profit u medium giant timewarner jumped bn three month december year earlier firm one biggest investor google benefited sale high speed internet connection h


Unnamed: 0,file_name,category,text,cleaned_text,normalized_text
0,001.txt,business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sales boost time warner profit quarterly pr...,ad sale boost time warner profit quarterly pro...
1,002.txt,business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gains on greenspan speech the dollar ha...,dollar gain greenspan speech dollar hit highes...
2,003.txt,business,Yukos unit buyer faces loan claim\n\nThe owner...,yukos unit buyer faces loan claim the owners o...,yukos unit buyer face loan claim owner embattl...
3,004.txt,business,High fuel prices hit BA's profits\n\nBritish A...,high fuel prices hit ba s profits british airw...,high fuel price hit ba profit british airway b...
4,005.txt,business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeover talk lifts domecq shares in uk...,pernod takeover talk lift domecq share uk drin...


In [None]:
bbc_df.to_csv("_data/bbc_dataset_preprocessed.csv", index=False)
