
---

## **📁 File: `02_Data_Cleaning.ipynb`**

# **🧹 02 - Data Cleaning & Preparation**

## **📑 Table of Contents**
1.  [🎯 Objectives](#-objectives)
2.  [⚙️ Setup & Import Functions](#-setup--import-functions)
3.  [📥 Load Raw Data](#-load-raw-data)
4.  [🔧 Apply Cleaning Functions](#-apply-cleaning-functions)
5.  [📊 Verify Cleaning Results](#-verify-cleaning-results)
6.  [💾 Save Cleaned Data](#-save-cleaned-data)

---

## **🎯 Objectives**
- Load the raw data from `dataset/00_raw/`
- Apply text cleaning functions to prepare for NLP
- Handle missing values and data quality issues
- Save cleaned data to `dataset
- /01_interim/` for future use

---



## **⚙️ Setup & Import Functions**


In [1]:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

import sys
import os

# Add the project root directory to Python path
sys.path.append(os.path.abspath('..'))

# Now import your modules
from src.data_cleaning import clean_text, run_clean_pipeline
from src.data_cleaning import gentle_clean_text, basic_clean_text, aggressive_clean_text


%matplotlib inline
print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!



---

## **📥 Load Raw Data**


In [2]:
# Load the raw datasets
print("Loading raw data...")
raw_df = pd.read_csv('../dataset/00_raw/data.csv')
raw_val_df = pd.read_csv('../dataset/00_raw/validation_data.csv')

print(f"Main dataset shape: {raw_df.shape}")
print(f"Validation dataset shape: {raw_val_df.shape}")

# Display first few rows
print("\nMain dataset preview:")
display(raw_df.head(2))
print("\nValidation dataset preview:")
display(raw_val_df.head(2))


Loading raw data...
Main dataset shape: (39942, 5)
Validation dataset shape: (4956, 5)

Main dataset preview:


Unnamed: 0,label,title,text,subject,date
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"



Validation dataset preview:


Unnamed: 0,label,title,text,subject,date
0,2,UK's May 'receiving regular updates' on London...,LONDON (Reuters) - British Prime Minister Ther...,worldnews,"September 15, 2017"
1,2,UK transport police leading investigation of L...,LONDON (Reuters) - British counter-terrorism p...,worldnews,"September 15, 2017"



---

## **🔧 Apply Cleaning Functions**


In [None]:
# 2 DELETE?
# # Create a copy to work with
# df_clean = raw_df.copy()
# val_df_clean = raw_val_df.copy()

# print("Applying cleaning functions...")

# # Apply text cleaning to title column
# df_clean['clean_title'] = df_clean['title'].apply(clean_text)
# val_df_clean['clean_title'] = val_df_clean['title'].apply(clean_text)

# # Apply text cleaning to text column (optional)
# df_clean['clean_text'] = df_clean['text'].apply(clean_text)
# val_df_clean['clean_text'] = val_df_clean['text'].apply(clean_text)

# # Drop the date column (as recommended from EDA)
# df_clean = df_clean.drop(columns=['date'])
# val_df_clean = val_df_clean.drop(columns=['date'])

# print("✅ Cleaning completed!")

In [4]:
# Create copies for different cleaning strategies
df_gentle = raw_df.copy()
df_basic = raw_df.copy() 
df_aggressive = raw_df.copy()

val_gentle = raw_val_df.copy()
val_basic = raw_val_df.copy()
val_aggressive = raw_val_df.copy()

print("Applying different cleaning strategies...")

# Apply GENTLE cleaning (preserves context for embeddings)
df_gentle['clean_title'] = df_gentle['title'].apply(gentle_clean_text)
df_gentle['clean_text'] = df_gentle['text'].apply(gentle_clean_text)
val_gentle['clean_title'] = val_gentle['title'].apply(gentle_clean_text)
val_gentle['clean_text'] = val_gentle['text'].apply(gentle_clean_text)

# Apply BASIC cleaning (for sentence transformers)
df_basic['clean_title'] = df_basic['title'].apply(basic_clean_text)
df_basic['clean_text'] = df_basic['text'].apply(basic_clean_text)
val_basic['clean_title'] = val_basic['title'].apply(basic_clean_text)
val_basic['clean_text'] = val_basic['text'].apply(basic_clean_text)

# Apply AGGRESSIVE cleaning (for traditional NLP)
df_aggressive['clean_title'] = df_aggressive['title'].apply(aggressive_clean_text)
df_aggressive['clean_text'] = df_aggressive['text'].apply(aggressive_clean_text)
val_aggressive['clean_title'] = val_aggressive['title'].apply(aggressive_clean_text)
val_aggressive['clean_text'] = val_aggressive['text'].apply(aggressive_clean_text)

# Drop date column from all datasets (as recommended from EDA)
datasets = [df_gentle, df_basic, df_aggressive, val_gentle, val_basic, val_aggressive]
for dataset in datasets:
    if 'date' in dataset.columns:
        dataset.drop(columns=['date'], inplace=True)

print("✅ All cleaning strategies completed!")

Applying different cleaning strategies...
✅ All cleaning strategies completed!


In [5]:

# Test the new cleaning functions
test_titles = [
    "As U.S. budget fight looms, Republicans flip their fiscal script",
    "U.S. military to accept transgender recruits on Monday: Pentagon",
    "Senior U.S. Republican senator: 'Let Mr. Mueller do his job'"
]

print("COMPARISON OF CLEANING APPROACHES:")
print("=" * 60)

for title in test_titles:
    print(f"\nOriginal:    {title}")
    print(f"Gentle:      {gentle_clean_text(title)}")
    print(f"Basic:       {basic_clean_text(title)}")
    print(f"Aggressive:  {aggressive_clean_text(title)}")

COMPARISON OF CLEANING APPROACHES:

Original:    As U.S. budget fight looms, Republicans flip their fiscal script
Gentle:      as u.s. budget fight looms republicans flip their fiscal script
Basic:       as u.s. budget fight looms, republicans flip their fiscal script
Aggressive:  budget fight looms republicans flip their fiscal script

Original:    U.S. military to accept transgender recruits on Monday: Pentagon
Gentle:      u.s. military to accept transgender recruits on monday pentagon
Basic:       u.s. military to accept transgender recruits on monday pentagon
Aggressive:  military accept transgender recruits monday pentagon

Original:    Senior U.S. Republican senator: 'Let Mr. Mueller do his job'
Gentle:      senior u.s. republican senator let mr. mueller do his job
Basic:       senior u.s. republican senator let mr. mueller do his job
Aggressive:  senior republican senator let mr mueller his job



---

## **📊 Verify Cleaning Results**


In [6]:
print("CLEANING VERIFICATION:")
print("=" * 50)

# Compare different cleaning strategies on same example
example_text = "U.S. military to accept transgender recruits on Monday: Pentagon"

print("Original:", example_text)
print("Gentle:", gentle_clean_text(example_text))
print("Basic:", basic_clean_text(example_text)) 
print("Aggressive:", aggressive_clean_text(example_text))
print()

# Check dataset info
print("Dataset shapes after cleaning:")
print(f"Gentle: {df_gentle.shape}")
print(f"Basic: {df_basic.shape}")
print(f"Aggressive: {df_aggressive.shape}")

# Check first few rows of each
print("\nSample cleaned titles (first 2 rows):")
print("\nGENTLE cleaning:")
for i in range(2):
    print(f"  {df_gentle['clean_title'].iloc[i][:100]}...")

print("\nBASIC cleaning:")
for i in range(2):
    print(f"  {df_basic['clean_title'].iloc[i][:100]}...")

print("\nAGGRESSIVE cleaning:")
for i in range(2):
    print(f"  {df_aggressive['clean_title'].iloc[i][:100]}...")

CLEANING VERIFICATION:
Original: U.S. military to accept transgender recruits on Monday: Pentagon
Gentle: u.s. military to accept transgender recruits on monday pentagon
Basic: u.s. military to accept transgender recruits on monday pentagon
Aggressive: military accept transgender recruits monday pentagon

Dataset shapes after cleaning:
Gentle: (39942, 6)
Basic: (39942, 6)
Aggressive: (39942, 6)

Sample cleaned titles (first 2 rows):

GENTLE cleaning:
  as u.s. budget fight looms republicans flip their fiscal script...
  u.s. military to accept transgender recruits on monday pentagon...

BASIC cleaning:
  as u.s. budget fight looms, republicans flip their fiscal script...
  u.s. military to accept transgender recruits on monday pentagon...

AGGRESSIVE cleaning:
  budget fight looms republicans flip their fiscal script...
  military accept transgender recruits monday pentagon...



---

## **💾 Save Cleaned Data**


In [7]:
# Save all cleaned datasets
print("Saving cleaned datasets...")

# Main datasets
df_gentle.to_csv('../dataset/01_interim/cleaned_data_gentle.csv', index=False)
df_basic.to_csv('../dataset/01_interim/cleaned_data_basic.csv', index=False) 
df_aggressive.to_csv('../dataset/01_interim/cleaned_data_aggressive.csv', index=False)

# Validation datasets
val_gentle.to_csv('../dataset/01_interim/cleaned_validation_gentle.csv', index=False)
val_basic.to_csv('../dataset/01_interim/cleaned_validation_basic.csv', index=False)
val_aggressive.to_csv('../dataset/01_interim/cleaned_validation_aggressive.csv', index=False)

print("✅ All cleaned datasets saved successfully!")
print("Files saved to: dataset/01_interim/")
print("\nMain datasets:")
print(f"- cleaned_data_gentle.csv               ({df_gentle.shape})")
print(f"- cleaned_data_basic.csv                ({df_basic.shape})")
print(f"- cleaned_data_aggressive.csv           ({df_aggressive.shape})")
print("\nValidation datasets:")
print(f"- cleaned_validation_gentle.csv         ({val_gentle.shape})")
print(f"- cleaned_validation_basic.csv          ({val_basic.shape})")
print(f"- cleaned_validation_aggressive.csv     ({val_aggressive.shape})")


Saving cleaned datasets...
✅ All cleaned datasets saved successfully!
Files saved to: dataset/01_interim/

Main datasets:
- cleaned_data_gentle.csv               ((39942, 6))
- cleaned_data_basic.csv                ((39942, 6))
- cleaned_data_aggressive.csv           ((39942, 6))

Validation datasets:
- cleaned_validation_gentle.csv         ((4956, 6))
- cleaned_validation_basic.csv          ((4956, 6))
- cleaned_validation_aggressive.csv     ((4956, 6))


In [8]:
print("✅ All cleaned datasets saved successfully!")
print("Files saved to: dataset/01_interim/")
print("\nMain datasets:")
print(f"- cleaned_data_gentle.csv               ({df_gentle.shape})")
print(f"- cleaned_data_basic.csv                ({df_basic.shape})")
print(f"- cleaned_data_aggressive.csv           ({df_aggressive.shape})")
print("\nValidation datasets:")
print(f"- cleaned_validation_gentle.csv         ({val_gentle.shape})")
print(f"- cleaned_validation_basic.csv          ({val_basic.shape})")
print(f"- cleaned_validation_aggressive.csv     ({val_aggressive.shape})")


✅ All cleaned datasets saved successfully!
Files saved to: dataset/01_interim/

Main datasets:
- cleaned_data_gentle.csv               ((39942, 6))
- cleaned_data_basic.csv                ((39942, 6))
- cleaned_data_aggressive.csv           ((39942, 6))

Validation datasets:
- cleaned_validation_gentle.csv         ((4956, 6))
- cleaned_validation_basic.csv          ((4956, 6))
- cleaned_validation_aggressive.csv     ((4956, 6))


In [9]:
display(df_gentle)

Unnamed: 0,label,title,text,subject,clean_title,clean_text
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,as u.s. budget fight looms republicans flip th...,washington reuters the head of a conservative ...
1,1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,u.s. military to accept transgender recruits o...,washington reuters transgender people will be ...
2,1,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,senior u.s. republican senator let mr. mueller...,washington reuters the special counsel investi...
3,1,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,fbi russia probe helped by australian diplomat...,washington reuters trump campaign adviser geor...
4,1,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,trump wants postal service to charge much more...,seattle washington reuters president donald tr...
...,...,...,...,...,...,...
39937,0,THIS IS NOT A JOKE! Soros-Linked Group Has Pla...,"The Left has been organizing for decades, and ...",left-news,this is not a joke soros linked group has plan...,the left has been organizing for decades and g...
39938,0,THE SMARTEST WOMAN In Politics: “How Trump Can...,Monica Crowley offers some of the most brillia...,left-news,the smartest woman in politics how trump can k...,monica crowley offers some of the most brillia...
39939,0,BREAKING! SHOCKING VIDEO FROM CHARLOTTE RIOTS:...,Protest underway in Charlotte: Things got com...,left-news,breaking shocking video from charlotte riots t...,protest underway in charlotte things got compl...
39940,0,BREAKING! Charlotte News Station Reports Cops ...,"Local Charlotte, NC news station WSOCTV is rep...",left-news,breaking charlotte news station reports cops h...,local charlotte nc news station wsoctv is repo...


In [10]:
display(df_basic)

Unnamed: 0,label,title,text,subject,clean_title,clean_text
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"as u.s. budget fight looms, republicans flip t...",washington reuters the head of a conservative ...
1,1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,u.s. military to accept transgender recruits o...,washington reuters transgender people will be ...
2,1,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,senior u.s. republican senator let mr. mueller...,washington reuters the special counsel investi...
3,1,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,fbi russia probe helped by australian diplomat...,washington reuters trump campaign adviser geor...
4,1,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,trump wants postal service to charge much more...,seattle washington reuters president donald tr...
...,...,...,...,...,...,...
39937,0,THIS IS NOT A JOKE! Soros-Linked Group Has Pla...,"The Left has been organizing for decades, and ...",left-news,this is not a joke! soros linked group has pla...,"the left has been organizing for decades, and ..."
39938,0,THE SMARTEST WOMAN In Politics: “How Trump Can...,Monica Crowley offers some of the most brillia...,left-news,the smartest woman in politics how trump can k...,monica crowley offers some of the most brillia...
39939,0,BREAKING! SHOCKING VIDEO FROM CHARLOTTE RIOTS:...,Protest underway in Charlotte: Things got com...,left-news,breaking! shocking video from charlotte riots ...,protest underway in charlotte things got compl...
39940,0,BREAKING! Charlotte News Station Reports Cops ...,"Local Charlotte, NC news station WSOCTV is rep...",left-news,breaking! charlotte news station reports cops ...,"local charlotte, nc news station wsoctv is rep..."


In [12]:
display(df_aggressive)

Unnamed: 0,label,title,text,subject,clean_title,clean_text
0,1,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,budget fight looms republicans flip their fisc...,washington reuters the head conservative repub...
1,1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,military accept transgender recruits monday pe...,washington reuters transgender people will all...
2,1,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,senior republican senator let mr mueller his job,washington reuters the special counsel investi...
3,1,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,fbi russia probe helped australian diplomat ti...,washington reuters trump campaign adviser geor...
4,1,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,trump wants postal service charge much more fo...,seattle washington reuters president donald tr...
...,...,...,...,...,...,...
39937,0,THIS IS NOT A JOKE! Soros-Linked Group Has Pla...,"The Left has been organizing for decades, and ...",left-news,this not joke soros linked group has plan dest...,the left has been organizing for decades and g...
39938,0,THE SMARTEST WOMAN In Politics: “How Trump Can...,Monica Crowley offers some of the most brillia...,left-news,the smartest woman politics how trump can knoc...,monica crowley offers some the most brilliant ...
39939,0,BREAKING! SHOCKING VIDEO FROM CHARLOTTE RIOTS:...,Protest underway in Charlotte: Things got com...,left-news,breaking shocking video from charlotte riots t...,protest underway charlotte things got complete...
39940,0,BREAKING! Charlotte News Station Reports Cops ...,"Local Charlotte, NC news station WSOCTV is rep...",left-news,breaking charlotte news station reports cops h...,local charlotte news station wsoctv reporting ...


In [None]:
# 2 DELETE?
# # Save cleaned data to interim folder
# print("Saving cleaned data...")

# df_clean.to_csv('../dataset/01_interim/cleaned_data.csv', index=False)
# val_df_clean.to_csv('../dataset/01_interim/cleaned_validation.csv', index=False)

# print("✅ Cleaned data saved successfully!")
# print("Files saved to: dataset/01_interim/")
# print(f"- cleaned_data.csv ({df_clean.shape})")
# print(f"- cleaned_validation.csv ({val_df_clean.shape})")



---

## **🚀 How to Use This Structure**

### **Option 1: Run from Notebook (Recommended for Learning)**
1.  Create `02_Data_Cleaning.ipynb` with the content above
2.  Run each cell step by step to see what happens

### **Option 2: Run from Command Line (More Professional)**
```bash
# Run the cleaning pipeline directly
python src/data_cleaning.py
```

### **Option 3: Import and Use in Other Notebooks**
```python
# In any notebook, you can now import your functions
from src.data_cleaning import clean_text, run_clean_pipeline

# Use individual function
clean_text("Some messy text!")

# Or run the whole pipeline
cleaned_df = run_clean_pipeline('input.csv', 'output.csv')
```

This structure gives you both the interactive notebook for learning and the reusable Python functions for professional development!