# Explore Labeled RSS Data

This notebook loads the collected RSS articles and their labels to explore the dataset before model training.

## Goals:
1. Load RSS articles and labels
2. Explore the dataset structure
3. Show examples of each class (advertisement vs news)
4. Analyze label distribution
5. Prepare data for future model training


In [1]:
# Import necessary libraries
import json
import pandas as pd
import numpy as np
from collections import Counter
import glob
import os

print("Libraries imported successfully!")


Libraries imported successfully!


In [2]:
# Load the latest RSS articles and labels
data_dir = '../data'

# Load articles
articles_files = glob.glob(os.path.join(data_dir, 'rss_articles_*.json'))
latest_articles_file = max(articles_files, key=os.path.getctime)
print(f"Loading articles from: {latest_articles_file}")

with open(latest_articles_file, 'r', encoding='utf-8') as f:
    articles = json.load(f)

print(f"✅ Loaded {len(articles)} articles")

# Load labels
labels_files = glob.glob(os.path.join(data_dir, 'labels_*.json'))
latest_labels_file = max(labels_files, key=os.path.getctime)
print(f"Loading labels from: {latest_labels_file}")

with open(latest_labels_file, 'r', encoding='utf-8') as f:
    labels_data = json.load(f)

labels = labels_data['labels']
print(f"✅ Loaded {len(labels)} labels")

# Convert to DataFrame for easier analysis
df = pd.DataFrame(articles)
print(f"\nDataFrame shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")


Loading articles from: ../data/rss_articles_20250922_000542.json
✅ Loaded 1608 articles
Loading labels from: ../data/labels_20250922_112846.json
✅ Loaded 1213 labels

DataFrame shape: (1608, 8)
Columns: ['source', 'title', 'link', 'description', 'published', 'summary', 'tags', 'feed_url']


In [3]:
# Add labels to DataFrame
df['label'] = df.index.astype(str).map(labels)
df['is_labeled'] = df['label'].notna()

# Filter to only labeled articles
labeled_df = df[df['is_labeled']].copy()
print(f"Labeled articles: {len(labeled_df)}")

# Label distribution
label_counts = labeled_df['label'].value_counts()
print(f"\nLabel distribution:")
print(label_counts)
print(f"\nLabel percentages:")
print((label_counts / len(labeled_df) * 100).round(1))


Labeled articles: 1213

Label distribution:
label
news             1034
advertisement     179
Name: count, dtype: int64

Label percentages:
label
news             85.2
advertisement    14.8
Name: count, dtype: float64


In [4]:
# Show examples of each class
def show_examples(df, label, num_examples=5):
    """Show examples of a specific label"""
    examples = df[df['label'] == label].head(num_examples)
    
    print(f"\n{'='*60}")
    print(f"📋 {label.upper()} EXAMPLES ({len(examples)} shown)")
    print(f"{'='*60}")
    
    for idx, row in examples.iterrows():
        print(f"\n🔹 Example {idx + 1}:")
        print(f"   Source: {row['source']}")
        print(f"   Title: {row['title']}")
        if pd.notna(row['description']) and row['description']:
            desc = row['description'][:200] + "..." if len(row['description']) > 200 else row['description']
            print(f"   Description: {desc}")
        print(f"   Published: {row['published']}")
        print("-" * 50)

# Show examples of each class
for label in ['advertisement', 'news']:
    if label in labeled_df['label'].values:
        show_examples(labeled_df, label, num_examples=3)



📋 ADVERTISEMENT EXAMPLES (3 shown)

🔹 Example 7:
   Source: TechCrunch
   Title: 6 days left: Last chance for Regular Bird savings for TechCrunch Disrupt 2025 passes
   Description: Time’s ticking! Register by September 26 at 11:59 p.m. PT to lock in Regular-Bird pricing and save up to $668 on your pass to TechCrunch Disrupt 2025, happening in San Francisco's Moscone West on Octo...
   Published: Sun, 21 Sep 2025 14:00:00 +0000
--------------------------------------------------

🔹 Example 12:
   Source: TechCrunch
   Title: Only 7 days left to save on TechCrunch Disrupt 2025 tickets — lock in Regular Bird pricing now
   Description: Time is running out to grab your pass to TechCrunch Disrupt 2025, happening October 27–29 in San Francisco. With less than 7 days left to lock in Regular Bird pricing, now’s your chance to save up to ...
   Published: Sat, 20 Sep 2025 14:00:00 +0000
--------------------------------------------------

🔹 Example 17:
   Source: TechCrunch
   Title: Best Apple

In [5]:
# Analyze by source
print(f"\n{'='*60}")
print("📊 LABEL DISTRIBUTION BY SOURCE")
print(f"{'='*60}")

source_analysis = labeled_df.groupby(['source', 'label']).size().unstack(fill_value=0)
source_analysis['total'] = source_analysis.sum(axis=1)
source_analysis['ad_percentage'] = (source_analysis['advertisement'] / source_analysis['total'] * 100).round(1)
source_analysis['news_percentage'] = (source_analysis['news'] / source_analysis['total'] * 100).round(1)

print(source_analysis.sort_values('total', ascending=False).head(10))



📊 LABEL DISTRIBUTION BY SOURCE
label                      advertisement  news  total  ad_percentage  \
source                                                                 
9to5Mac                                0    50     50            0.0   
BBC News - Technology                  0    50     50            0.0   
ExtremeTech                            6    44     50           12.0   
Engadget                              13    37     50           26.0   
DeepMind Blog                          1    49     50            2.0   
Lifehacker                            26    24     50           52.0   
Mashable (Tech)                       23    27     50           46.0   
Google AI Blog (Research)              0    50     50            0.0   
OpenAI Blog                            0    50     50            0.0   
Tom's Hardware                         9    41     50           18.0   

label                      news_percentage  
source                                      
9to5Mac      

In [6]:
# Dataset summary for model training preparation
print(f"\n{'='*60}")
print("📈 DATASET SUMMARY FOR MODEL TRAINING")
print(f"{'='*60}")

print(f"Total articles collected: {len(df)}")
print(f"Total articles labeled: {len(labeled_df)}")
print(f"Labeling completion: {(len(labeled_df) / len(df) * 100):.1f}%")

print(f"\nClass distribution:")
for label, count in label_counts.items():
    percentage = (count / len(labeled_df) * 100)
    print(f"  {label}: {count} articles ({percentage:.1f}%)")

print(f"\nSources represented: {labeled_df['source'].nunique()}")
print(f"Unique sources: {sorted(labeled_df['source'].unique())}")

# Check for missing descriptions
missing_desc = labeled_df['description'].isna().sum()
print(f"\nArticles with missing descriptions: {missing_desc}")

# Text length analysis
labeled_df['title_length'] = labeled_df['title'].str.len()
labeled_df['desc_length'] = labeled_df['description'].str.len()

print(f"\nText length statistics:")
print(f"Title length - Mean: {labeled_df['title_length'].mean():.0f}, Max: {labeled_df['title_length'].max()}")
print(f"Description length - Mean: {labeled_df['desc_length'].mean():.0f}, Max: {labeled_df['desc_length'].max()}")



📈 DATASET SUMMARY FOR MODEL TRAINING
Total articles collected: 1608
Total articles labeled: 1213
Labeling completion: 75.4%

Class distribution:
  news: 1034 articles (85.2%)
  advertisement: 179 articles (14.8%)

Sources represented: 53
Unique sources: ['9to5Mac', 'AWS News Blog', 'All Things Distributed (AWS CTO)', 'Analytics Vidhya', 'Android Authority', 'Apple Newsroom', 'Ars Technica', 'BAIR (Berkeley AI Research) Blog', 'BBC News - Technology', 'Business Insider (Tech)', 'CNET News', 'CNN Technology', 'CloudTech News', 'CloudTweaks', 'Daring Fireball', 'Datanami', 'DeepMind Blog', 'Engadget', 'ExtremeTech', 'Facebook Newsroom (Meta)', 'GeekWire', 'Gizmodo', 'Google (The Keyword)', 'Google AI Blog (Research)', 'Google Cloud Blog', 'Hacker News (Top)', 'HuffPost Tech', 'KDnuggets', 'Kaggle Blog', 'Lifehacker', 'MIT Technology Review', 'Machine Learning Mastery', 'MarkTechPost (AI News)', 'Mashable (Tech)', 'Meta Engineering Blog', 'Meta Research Blog', 'Microsoft Azure Blog', 'NVI

In [7]:
# Save processed dataset for future model training
print(f"\n{'='*60}")
print("💾 SAVING PROCESSED DATASET")
print(f"{'='*60}")

# Create training-ready dataset
training_data = labeled_df[['title', 'description', 'source', 'label']].copy()

# Combine title and description for training
training_data['text'] = training_data['title'] + ' ' + training_data['description'].fillna('')

# Remove any rows with empty text
training_data = training_data[training_data['text'].str.strip() != '']

print(f"Training dataset shape: {training_data.shape}")
print(f"Columns: {training_data.columns.tolist()}")

# Save to CSV for easy loading later
output_file = '../data/training_dataset.csv'
training_data.to_csv(output_file, index=False)
print(f"✅ Saved training dataset to: {output_file}")

# Also save as JSON for alternative loading
output_json = '../data/training_dataset.json'
training_data.to_json(output_json, orient='records', indent=2)
print(f"✅ Saved training dataset to: {output_json}")

print(f"\n🎯 Ready for model training with {len(training_data)} labeled examples!")



💾 SAVING PROCESSED DATASET
Training dataset shape: (1213, 5)
Columns: ['title', 'description', 'source', 'label', 'text']
✅ Saved training dataset to: ../data/training_dataset.csv
✅ Saved training dataset to: ../data/training_dataset.json

🎯 Ready for model training with 1213 labeled examples!


## Next Steps

This notebook has:
1. ✅ Loaded your labeled RSS data
2. ✅ Analyzed label distribution 
3. ✅ Shown examples of each class
4. ✅ Analyzed sources and text characteristics
5. ✅ Created training-ready dataset files

**Files created:**
- `../data/training_dataset.csv` - CSV format for easy loading
- `../data/training_dataset.json` - JSON format for alternative loading

**Ready for future model training notebook!** 🚀
