# Phase 2: Text Preprocessing Pipeline
**Project:** Document Classification System  
**Goal:** Transform raw, noisy news text into clean, lemmatized tokens for Machine Learning.

In [1]:
import pandas as pd
import re
import html
import spacy
from datasets import load_dataset
import os

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Create directories if they don't exist
os.makedirs('../data/raw', exist_ok=True)
os.makedirs('../data/processed', exist_ok=True)

## 1. Load & Version Data
We download from Hugging Face but save a local copy to `data/raw` to ensure reproducibility and offline access.

In [3]:
# Load the AG News dataset
dataset = load_dataset("ag_news")
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])

In [4]:
# Save the raw "Source of Truth"
df_train.to_csv('../data/raw/train.csv', index=False)
df_test.to_csv('../data/raw/test.csv', index=False)

print(f"Raw data saved. Training samples: {len(df_train)}")

Raw data saved. Training samples: 120000


## 2. Build the Preprocessing Engine
This function handles the "Four Pillars of Cleaning":
1. **Decoding:** Fixes `#36;` ($), `#151;` (â€”), and `&quot;`.
2. **Stripping:** Removes news source headers (e.g., "Reuters -").
3. **Regex:** Removes punctuation, numbers, and backslashes.
4. **Lemmatization:** Reduces words to their base form (e.g., "running" -> "run").

In [5]:
# Load English language model (disable parser and ner for speed)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

In [11]:
def clean_text(text):
    # 1. Decode HTML
    text = html.unescape(text)
    
    # 2. BETTER CLEANING: Remove typical news headers (Reuters, AP, etc.)
    # This looks for "(Reuters)", "(AP)", or "CITY (Reuters) -" 
    text = re.sub(r'\(Reuters\)', '', text)
    text = re.sub(r'\(AP\)', '', text)
    text = re.sub(r'^[A-Z\s,]+ \(Reuters\) - ', '', text) # e.g. "NEW YORK (Reuters) - "
    text = re.sub(r'^[A-Z\s,]+ \(AP\) - ', '', text)     # e.g. "WASHINGTON (AP) - "

    # 3. Replace problematic separators with spaces
    text = re.sub(r'[\\/_-]', ' ', text)
    
    # 4. Standard clean
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    
    # 5. Lemmatize
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop and len(token.text) > 2]
    
    return " ".join(tokens)

## 3. Execute the Pipeline
We apply the `clean_text` function to our training and testing sets. 

In [12]:
print("Processing training data...")
df_train['cleaned_text'] = df_train['text'].apply(clean_text)

print("Processing testing data...")
df_test['cleaned_text'] = df_test['text'].apply(clean_text)

# Check a sample to verify #36; and backslashes are gone
df_train[['text', 'cleaned_text']].head(10)

Processing training data...
Processing testing data...


Unnamed: 0,text,cleaned_text
0,Wall St. Bears Claw Back Into the Black (Reute...,wall bears claw black reuter short seller wall...
1,Carlyle Looks Toward Commercial Aerospace (Reu...,carlyle look commercial aerospace reuters priv...
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,oil economy cloud stock outlook reuter soar cr...
3,Iraq Halts Oil Exports from Main Southern Pipe...,iraq halt oil export main southern pipeline re...
4,"Oil prices soar to all-time record, posing new...",oil price soar time record pose new menace eco...
5,"Stocks End Up, But Near Year Lows (Reuters) Re...",stock end near year low reuter stock end sligh...
6,Money Funds Fell in Latest Week (AP) AP - Asse...,money fund fall late week asset nation retail ...
7,Fed minutes show dissent over inflation (USATO...,fed minute dissent inflation usatodaycom usato...
8,Safety Net (Forbes.com) Forbes.com - After ear...,safety net forbescom forbescom earn phd sociol...
9,Wall St. Bears Claw Back Into the Black NEW Y...,wall bears claw black new york short selle...


## 4. Save Processed "Gold" Data
We save only the necessary columns to save space. These files will be used for all future modeling.

In [13]:
df_train[['label', 'cleaned_text']].to_csv('../data/processed/news_clean_train.csv', index=False)
df_test[['label', 'cleaned_text']].to_csv('../data/processed/news_clean_test.csv', index=False)

print("Success! Processed data saved to data/processed/")

Success! Processed data saved to data/processed/


## 5 Summary & Next Steps

A robust text preprocessing pipeline was successfully implemented, transforming raw news descriptions into standardized tokens. Techniques including HTML decoding, regex-based cleaning, and spaCy lemmatization were utilized to reduce noise while preserving semantic integrity. The resulting "Gold" datasets were exported to the data/processed/ directory.

**Next Phase:** The project moves to Traditional Machine Learning Models. TF-IDF vectorization and classical algorithms (Logistic Regression and SVM) will be applied to establish a performance benchmark.