
---

## **📁 File: `02_Data_Cleaning.ipynb`**

# **🧹 02 - Data Cleaning & Preparation**

## **📑 Table of Contents**
1.  [🎯 Objectives](#-objectives)
2.  [⚙️ Setup & Import Functions](#-setup--import-functions)
3.  [📥 Load Raw Data](#-load-raw-data)
4.  [🔧 Apply Cleaning Functions](#-apply-cleaning-functions)
5.  [📊 Verify Cleaning Results](#-verify-cleaning-results)
6.  [💾 Save Cleaned Data](#-save-cleaned-data)

---

## **🎯 Objectives**
- Load the raw data from `dataset/00_raw/`
- Apply text cleaning functions to prepare for NLP
- Handle missing values and data quality issues
- Save cleaned data to `dataset
- /01_interim/` for future use

---



## **⚙️ 1. Setup & Import Functions**


In [1]:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

import sys
import os

# Add the project root directory to Python path
sys.path.append(os.path.abspath('..'))

# Now import your modules
from src.data_cleaning import clean_text, run_clean_pipeline
from src.data_cleaning import gentle_clean_text, basic_clean_text, aggressive_clean_text
from src.data_cleaning import clean_date_column, engineer_text_features, engineer_all_features


%matplotlib inline
print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!



---

## **📥 2. Load Raw Data**


In [2]:
# Load the raw datasets
print("Loading raw data...")
raw_df = pd.read_csv('../dataset/00_raw/data.csv')
raw_val_df = pd.read_csv('../dataset/00_raw/validation_data.csv')

print(f"Main dataset shape: {raw_df.shape}")
print(f"Validation dataset shape: {raw_val_df.shape}")

# Display first few rows
print("\nMain dataset preview:")
display(raw_df.head(2))
print("\nValidation dataset preview:")
display(raw_val_df.head(2))


Loading raw data...
Main dataset shape: (39942, 5)
Validation dataset shape: (4956, 5)

Main dataset preview:


Unnamed: 0,label,title,text,subject,date
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"



Validation dataset preview:


Unnamed: 0,label,title,text,subject,date
0,2,UK's May 'receiving regular updates' on London...,LONDON (Reuters) - British Prime Minister Ther...,worldnews,"September 15, 2017"
1,2,UK transport police leading investigation of L...,LONDON (Reuters) - British counter-terrorism p...,worldnews,"September 15, 2017"



---
## 📅 3. Enhanced Date Processing Section


In [5]:
## 📅 DATE COLUMN ANALYSIS & FEATURE ENGINEERING
print("📅 Engineering date features...")

# Apply enhanced date cleaning
raw_df['date_parsed'], raw_df['year'], raw_df['quarter'], raw_df['is_weekend'] = clean_date_column(raw_df['date'])
raw_val_df['date_parsed'], raw_val_df['year'], raw_val_df['quarter'], raw_val_df['is_weekend'] = clean_date_column(raw_val_df['date'])

# Fill missing dates with mode or reasonable defaults
if raw_df['year'].isna().any():
    mode_year = raw_df['year'].mode()[0] if not raw_df['year'].mode().empty else 2017
    raw_df['year'] = raw_df['year'].fillna(mode_year)
    raw_df['quarter'] = raw_df['quarter'].fillna('Q4')  # Most common quarter
    raw_df['is_weekend'] = raw_df['is_weekend'].fillna(False)

# Same for validation set
if raw_val_df['year'].isna().any():
    raw_val_df['year'] = raw_val_df['year'].fillna(mode_year)
    raw_val_df['quarter'] = raw_val_df['quarter'].fillna('Q4')
    raw_val_df['is_weekend'] = raw_val_df['is_weekend'].fillna(False)

print("✅ Date features engineered:")
print(f"   Years: {raw_df['year'].unique()}")
print(f"   Quarters: {raw_df['quarter'].unique()}")
print(f"   Weekend articles: {raw_df['is_weekend'].sum()}")

# Drop the original date column since we have extracted features
raw_df = raw_df.drop(columns=['date', 'date_parsed'])
raw_val_df = raw_val_df.drop(columns=['date', 'date_parsed'])

📅 Engineering date features...
✅ Date features engineered:
   Years: [2017. 2016. 2018. 2015.]
   Quarters: ['Q4.0' 'Q3.0' 'Q2.0' 'Q1.0' 'Q4']
   Weekend articles: 7615



---

## **🔧 3. Apply Cleaning Functions**


In [6]:
# Create copies for different cleaning strategies
df_gentle = raw_df.copy()
df_basic = raw_df.copy() 
df_aggressive = raw_df.copy()

val_gentle = raw_val_df.copy()
val_basic = raw_val_df.copy()
val_aggressive = raw_val_df.copy()

print("Applying different cleaning strategies...")

# Apply GENTLE cleaning (preserves context for embeddings)
df_gentle['clean_title'] = df_gentle['title'].apply(gentle_clean_text)
df_gentle['clean_text'] = df_gentle['text'].apply(gentle_clean_text)
val_gentle['clean_title'] = val_gentle['title'].apply(gentle_clean_text)
val_gentle['clean_text'] = val_gentle['text'].apply(gentle_clean_text)

# Apply BASIC cleaning (for sentence transformers)
df_basic['clean_title'] = df_basic['title'].apply(basic_clean_text)
df_basic['clean_text'] = df_basic['text'].apply(basic_clean_text)
val_basic['clean_title'] = val_basic['title'].apply(basic_clean_text)
val_basic['clean_text'] = val_basic['text'].apply(basic_clean_text)

# Apply AGGRESSIVE cleaning (for traditional NLP)
df_aggressive['clean_title'] = df_aggressive['title'].apply(aggressive_clean_text)
df_aggressive['clean_text'] = df_aggressive['text'].apply(aggressive_clean_text)
val_aggressive['clean_title'] = val_aggressive['title'].apply(aggressive_clean_text)
val_aggressive['clean_text'] = val_aggressive['text'].apply(aggressive_clean_text)

# # Drop date column from all datasets (as recommended from EDA)
# datasets = [df_gentle, df_basic, df_aggressive, val_gentle, val_basic, val_aggressive]
# for dataset in datasets:
#     if 'date' in dataset.columns:
#         dataset.drop(columns=['date'], inplace=True)

print("✅ All cleaning strategies completed!")

Applying different cleaning strategies...
✅ All cleaning strategies completed!



---

## **📊 Verify Cleaning Results**


In [8]:
print("CLEANING VERIFICATION:")
print("=" * 50)

# Compare different cleaning strategies on same example
example_text = "U.S. military to accept transgender recruits on Monday: Pentagon"

print("Original:", example_text)
print("Gentle:", gentle_clean_text(example_text))
print("Basic:", basic_clean_text(example_text)) 
print("Aggressive:", aggressive_clean_text(example_text))
print()

# Check dataset info
print("Dataset shapes after cleaning:")
print(f"Gentle: {df_gentle.shape}")
print(f"Basic: {df_basic.shape}")
print(f"Aggressive: {df_aggressive.shape}")

# Check first few rows of each
print("\nSample cleaned titles (first 2 rows):")
print("\nGENTLE cleaning:")
for i in range(2):
    print(f"  {df_gentle['clean_title'].iloc[i][:100]}...")

print("\nBASIC cleaning:")
for i in range(2):
    print(f"  {df_basic['clean_title'].iloc[i][:100]}...")

print("\nAGGRESSIVE cleaning:")
for i in range(2):
    print(f"  {df_aggressive['clean_title'].iloc[i][:100]}...")

CLEANING VERIFICATION:
Original: U.S. military to accept transgender recruits on Monday: Pentagon
Gentle: u.s. military to accept transgender recruits on monday pentagon
Basic: u.s. military to accept transgender recruits on monday pentagon
Aggressive: military accept transgender recruits monday pentagon

Dataset shapes after cleaning:
Gentle: (39942, 9)
Basic: (39942, 9)
Aggressive: (39942, 9)

Sample cleaned titles (first 2 rows):

GENTLE cleaning:
  as u.s. budget fight looms republicans flip their fiscal script...
  u.s. military to accept transgender recruits on monday pentagon...

BASIC cleaning:
  as u.s. budget fight looms, republicans flip their fiscal script...
  u.s. military to accept transgender recruits on monday pentagon...

AGGRESSIVE cleaning:
  budget fight looms republicans flip their fiscal script...
  military accept transgender recruits monday pentagon...


## 📝 Essential Text Feature Engineering

In [9]:
## 📝 DERIVED TEXT FEATURES
print("📝 Engineering essential text features...")

def add_essential_features(dataframe):
    """Add only the essential derived text features"""
    df_copy = dataframe.copy()
    
    # Text length features
    df_copy['title_length'] = df_copy['title'].str.len().fillna(0)
    df_copy['title_word_count'] = df_copy['title'].str.split().str.len().fillna(0)
    df_copy['text_length'] = df_copy['text'].str.len().fillna(0)
    df_copy['text_word_count'] = df_copy['text'].str.split().str.len().fillna(0)
    
    return df_copy

# Apply to all datasets
print("🔄 Adding features to all dataset versions...")

df_gentle = add_essential_features(df_gentle)
df_basic = add_essential_features(df_basic)
df_aggressive = add_essential_features(df_aggressive)

val_gentle = add_essential_features(val_gentle)
val_basic = add_essential_features(val_basic) 
val_aggressive = add_essential_features(val_aggressive)

print("✅ Essential features added to all datasets")

📝 Engineering essential text features...
🔄 Adding features to all dataset versions...
✅ Essential features added to all datasets


## 🔍 4. Feature Overview & Validation

In [10]:
## 🔍 FEATURE OVERVIEW
print("\n🔍 Final Feature Overview")
print("=" * 50)

# Display feature summary for each dataset type
datasets = [df_gentle, df_basic, df_aggressive]
names = ['Gentle', 'Basic', 'Aggressive']

for i, (dataset, name) in enumerate(zip(datasets, names)):
    print(f"\n📊 {name} Cleaning Dataset:")
    print(f"   Shape: {dataset.shape}")
    print(f"   Columns: {list(dataset.columns)}")
    
    # Sample feature stats
    if i == 0:  # Only show for first dataset to avoid repetition
        print(f"   Title length: {dataset['title_length'].mean():.1f} ± {dataset['title_length'].std():.1f} chars")
        print(f"   Text length: {dataset['text_length'].mean():.1f} ± {dataset['text_length'].std():.1f} chars")
        print(f"   Valid years: {dataset['year'].notna().sum()} ({dataset['year'].notna().mean():.1%})")
        print(f"   Weekend articles: {dataset['is_weekend'].sum()} ({dataset['is_weekend'].mean():.1%})")

# Verify we have the exact features we want
expected_features = [
    'label', 'title', 'text', 'subject', 'clean_title', 'clean_text',
    'title_length', 'title_word_count', 'text_length', 'text_word_count',
    'year', 'quarter', 'is_weekend'
]

print(f"\n✅ Verification - Expected features:")
missing_features = set(expected_features) - set(df_gentle.columns)
if missing_features:
    print(f"   ⚠️  Missing: {missing_features}")
else:
    print("   ✓ All expected features present")

print(f"   Total features: {len(df_gentle.columns)}")


🔍 Final Feature Overview

📊 Gentle Cleaning Dataset:
   Shape: (39942, 13)
   Columns: ['label', 'title', 'text', 'subject', 'year', 'quarter', 'is_weekend', 'clean_title', 'clean_text', 'title_length', 'title_word_count', 'text_length', 'text_word_count']
   Title length: 79.8 ± 24.8 chars
   Text length: 2384.6 ± 1765.9 chars
   Valid years: 39942 (100.0%)
   Weekend articles: 7615 (19.1%)

📊 Basic Cleaning Dataset:
   Shape: (39942, 13)
   Columns: ['label', 'title', 'text', 'subject', 'year', 'quarter', 'is_weekend', 'clean_title', 'clean_text', 'title_length', 'title_word_count', 'text_length', 'text_word_count']

📊 Aggressive Cleaning Dataset:
   Shape: (39942, 13)
   Columns: ['label', 'title', 'text', 'subject', 'year', 'quarter', 'is_weekend', 'clean_title', 'clean_text', 'title_length', 'title_word_count', 'text_length', 'text_word_count']

✅ Verification - Expected features:
   ✓ All expected features present
   Total features: 13


## 💾 5. Final Saving Section v3.

In [12]:
## 💾 Save Streamlined Datasets
print("\n💾 Saving streamlined datasets...")

# Save all datasets with essential features only
datasets_to_save = [
    (df_gentle, 'cleaned_data_gentle.csv'),
    (df_basic, 'cleaned_data_basic.csv'),
    (df_aggressive, 'cleaned_data_aggressive.csv'),
    (val_gentle, 'cleaned_validation_gentle.csv'), 
    (val_basic, 'cleaned_validation_basic.csv'),
    (val_aggressive, 'cleaned_validation_aggressive.csv')
]

for dataset, filename in datasets_to_save:
    # Ensure we only keep the essential features
    essential_cols = [col for col in expected_features if col in dataset.columns]
    dataset = dataset[essential_cols]
    
    filepath = f'../dataset/01_interim/{filename}'
    dataset.to_csv(filepath, index=False)
    print(f"✅ Saved {filename} with shape {dataset.shape}")

print("\n🎯 All datasets ready with essential features only!")
print("📋 Feature set:")
print("   PRIMARY: clean_title, clean_text")
print("   TEXT: title_length, title_word_count, text_length, text_word_count")  
print("   TIME: year, quarter, is_weekend")
print("   META: label, title, text, subject")


💾 Saving streamlined datasets...
✅ Saved cleaned_data_gentle.csv with shape (39942, 13)
✅ Saved cleaned_data_basic.csv with shape (39942, 13)
✅ Saved cleaned_data_aggressive.csv with shape (39942, 13)
✅ Saved cleaned_validation_gentle.csv with shape (4956, 13)
✅ Saved cleaned_validation_basic.csv with shape (4956, 13)
✅ Saved cleaned_validation_aggressive.csv with shape (4956, 13)

🎯 All datasets ready with essential features only!
📋 Feature set:
   PRIMARY: clean_title, clean_text
   TEXT: title_length, title_word_count, text_length, text_word_count
   TIME: year, quarter, is_weekend
   META: label, title, text, subject


In [13]:
print("✅ All cleaned datasets saved successfully!")
print("Files saved to: dataset/01_interim/")
print("\nMain datasets:")
print(f"- cleaned_data_gentle.csv               ({df_gentle.shape})")
print(f"- cleaned_data_basic.csv                ({df_basic.shape})")
print(f"- cleaned_data_aggressive.csv           ({df_aggressive.shape})")
print("\nValidation datasets:")
print(f"- cleaned_validation_gentle.csv         ({val_gentle.shape})")
print(f"- cleaned_validation_basic.csv          ({val_basic.shape})")
print(f"- cleaned_validation_aggressive.csv     ({val_aggressive.shape})")


✅ All cleaned datasets saved successfully!
Files saved to: dataset/01_interim/

Main datasets:
- cleaned_data_gentle.csv               ((39942, 13))
- cleaned_data_basic.csv                ((39942, 13))
- cleaned_data_aggressive.csv           ((39942, 13))

Validation datasets:
- cleaned_validation_gentle.csv         ((4956, 13))
- cleaned_validation_basic.csv          ((4956, 13))
- cleaned_validation_aggressive.csv     ((4956, 13))


In [14]:
display(df_gentle.head(1))

Unnamed: 0,label,title,text,subject,year,quarter,is_weekend,clean_title,clean_text,title_length,title_word_count,text_length,text_word_count
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,2017.0,Q4.0,True,as u.s. budget fight looms republicans flip th...,washington reuters the head of a conservative ...,64,10,4659,749


In [15]:
display(df_basic.head(1))

Unnamed: 0,label,title,text,subject,year,quarter,is_weekend,clean_title,clean_text,title_length,title_word_count,text_length,text_word_count
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,2017.0,Q4.0,True,"as u.s. budget fight looms, republicans flip t...",washington reuters the head of a conservative ...,64,10,4659,749


In [12]:
display(df_aggressive.head(1))

Unnamed: 0,label,title,text,subject,date_cleaned,year,date_category,clean_title,clean_text
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",2017,Medium (2015-2019),budget fight looms republicans flip their fisc...,washington reuters the head conservative repub...



---

## **🚀 How to Use This Structure**

### **Option 1: Run from Notebook (Recommended for Learning)**
1.  Create `02_Data_Cleaning.ipynb` with the content above
2.  Run each cell step by step to see what happens

### **Option 2: Run from Command Line (More Professional)**
```bash
# Run the cleaning pipeline directly
python src/data_cleaning.py
```

### **Option 3: Import and Use in Other Notebooks**
```python
# In any notebook, you can now import your functions
from src.data_cleaning import clean_text, run_clean_pipeline

# Use individual function
clean_text("Some messy text!")

# Or run the whole pipeline
cleaned_df = run_clean_pipeline('input.csv', 'output.csv')
```

This structure gives you both the interactive notebook for learning and the reusable Python functions for professional development!