# RSS Data Collection for Advertisement Detection

This notebook focuses on collecting RSS feed data from various news sources to build a dataset for training a BERT-based advertisement classifier.

## Goals:
1. Collect RSS feeds from multiple news sources
2. Parse and clean the RSS content
3. Identify potential advertisements vs. legitimate news content
4. Create a labeled dataset for model training

## RSS Sources to Consider:
- Major news outlets (CNN, BBC, Reuters, etc.)
- Tech news sites (TechCrunch, Ars Technica, etc.)
- Business news (Bloomberg, Financial Times, etc.)


In [1]:
# Import necessary libraries
import os
import json
import pandas as pd
import feedparser
import requests
from datetime import datetime
from typing import List, Dict, Optional

print("All required libraries imported successfully!")
print(f"Current working directory: {os.getcwd()}")
print(f"Project structure: {os.listdir('.')}")


All required libraries imported successfully!
Current working directory: /Users/soroushxyz/Documents/Dev/Python/Bert-Advertisement-Detection/rss-ad-filter-bert/notebooks
Project structure: ['01_rss_data_collection.ipynb']


In [2]:
# Updated RSS feed sources - comprehensive tech news collection
RSS_FEEDS = {
    "TechCrunch": "https://techcrunch.com/feed",
    "WIRED": "https://www.wired.com/feed/rss",
    "The Verge": "https://www.theverge.com/rss/index.xml",
    "Ars Technica": "http://feeds.arstechnica.com/arstechnica/index/",
    "Engadget": "https://www.engadget.com/rss.xml",
    "Gizmodo": "https://gizmodo.com/rss",
    "CNET News": "https://www.cnet.com/rss/news/",
    "TechRadar": "https://www.techradar.com/feeds",
    "Digital Trends": "https://www.digitaltrends.com/feed/",
    "VentureBeat": "https://venturebeat.com/feed/",
    "Recode (Vox Technology)": "https://www.vox.com/rss/technology/index.xml",
    "GeekWire": "https://www.geekwire.com/feed/",
    "Mashable (Tech)": "http://feeds.mashable.com/Mashable",
    "Hacker News (Top)": "https://news.ycombinator.com/rss",
    "TechMeme": "https://www.techmeme.com/feed.xml",
    "Slashdot": "http://rss.slashdot.org/Slashdot/slashdotMain",
    "Lifehacker": "https://lifehacker.com/rss",
    "MIT Technology Review": "https://www.technologyreview.com/feed/",
    "BBC News - Technology": "http://feeds.bbci.co.uk/news/technology/rss.xml",
    "The Guardian - Technology": "https://www.theguardian.com/technology/rss",
    "NY Times - Technology": "https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml",
    "Reuters Technology News": "http://feeds.reuters.com/reuters/technologyNews",
    "CNN Technology": "http://rss.cnn.com/rss/cnn_tech.rss",
    "Business Insider (Tech)": "https://www.businessinsider.com/rss",
    "HuffPost Tech": "https://www.huffpost.com/section/technology/feed",
    "ZDNet (All Topics)": "https://www.zdnet.com/rss.xml",
    "InfoWorld": "https://www.infoworld.com/index.rss",
    "Computerworld": "https://www.computerworld.com/index.rss",
    "OpenAI Blog": "https://openai.com/news/rss.xml",
    "Google AI Blog (Research)": "https://research.google/blog/rss/",
    "DeepMind Blog": "https://deepmind.com/blog/feed/basic",
    "BAIR (Berkeley AI Research) Blog": "https://bair.berkeley.edu/blog/feed.xml",
    "Machine Learning Mastery": "https://machinelearningmastery.com/blog/feed/",
    "MarkTechPost (AI News)": "https://marktechpost.com/feed/",
    "Analytics Vidhya": "https://analyticsvidhya.com/feed",
    "KDnuggets": "https://www.kdnuggets.com/feed",
    "Towards Data Science": "https://towardsdatascience.com/feed",
    "Datanami": "https://www.datanami.com/feed/",
    "Kaggle Blog": "https://medium.com/feed/kaggle-blog",
    "AWS News Blog": "https://aws.amazon.com/blogs/aws/feed/",
    "All Things Distributed (AWS CTO)": "http://www.allthingsdistributed.com/atom.xml",
    "Microsoft Azure Blog": "https://azure.microsoft.com/en-us/blog/feed/",
    "Google Cloud Blog": "https://cloudblog.withgoogle.com/rss",
    "CloudTech News": "https://cloudcomputing-news.net/feed",
    "CloudTweaks": "https://cloudtweaks.com/feed",
    "TechRepublic (Cloud)": "https://www.techrepublic.com/rssfeeds/topic/cloud/",
    "IBM Cloud Blog": "https://www.ibm.com/blogs/cloud-computing/feed/",
    "AnandTech": "https://www.anandtech.com/rss/",
    "Tom's Hardware": "https://www.tomshardware.com/feeds/all",
    "9to5Mac": "https://9to5mac.com/feed/",
    "Android Authority": "https://www.androidauthority.com/feed",
    "ExtremeTech": "https://www.extremetech.com/feed",
    "NVIDIA Blog": "https://blogs.nvidia.com/feed/",
    "Official Microsoft Blog": "https://blogs.microsoft.com/feed/",
    "Apple Newsroom": "https://www.apple.com/newsroom/rss-feed.rss",
    "Google (The Keyword)": "https://blog.google/rss/",
    "Facebook Newsroom (Meta)": "https://about.fb.com/feed/",
    "Meta Engineering Blog": "https://engineering.fb.com/feed/",
    "Meta Research Blog": "https://research.facebook.com/feed/",
    "Stratechery (Ben Thompson)": "https://stratechery.com/feed/",
    "Daring Fireball": "https://daringfireball.net/index.xml",
    "The Hacker News": "https://feeds.feedburner.com/TheHackersNews",
    "BleepingComputer": "https://www.bleepingcomputer.com/feed/",
    "Microsoft Security Response Center": "https://msrc.microsoft.com/blog/feed",
    "Cloudflare Blog": "https://blog.cloudflare.com/rss/",
    "Netflix Tech Blog": "https://netflixtechblog.com/rss",
    "Datafloq": "https://datafloq.com/feed",
    "Xtract.io Blog": "https://xtract.io/blog/feed",
    "Silicon Valley Journals": "https://siliconvalleyjournals.com/feed",
    "TechSpot": "https://www.techspot.com/backend.xml",
    "AppleInsider": "https://appleinsider.com/rss/news",
    "gHacks": "https://www.ghacks.net/feed",
    "eWeek": "https://www.eweek.com/feed",
    "Droid Life": "https://www.droid-life.com/feed",
    "TechJuice": "https://www.techjuice.pk/feed",
    "Developer Tech News": "https://developer-tech.com/feed",
    "TechPlugged": "https://www.techplugged.com/feed"
}

print(f"✅ Updated RSS feed sources: {len(RSS_FEEDS)} sources")
print("\nCategories represented:")
print("- Tech News (major outlets)")
print("- AI/ML Research") 
print("- Data Science")
print("- Cloud/Enterprise")
print("- Hardware/Gadgets")
print("- Big Tech Companies")
print("- Cybersecurity")
print("- Developer/Engineering")


✅ Updated RSS feed sources: 77 sources

Categories represented:
- Tech News (major outlets)
- AI/ML Research
- Data Science
- Cloud/Enterprise
- Hardware/Gadgets
- Big Tech Companies
- Cybersecurity
- Developer/Engineering


## Next Steps

1. **Install RSS parsing library**: We'll need `feedparser` to parse RSS feeds
2. **Create data collection functions**: Functions to fetch and parse RSS feeds
3. **Data cleaning**: Remove HTML tags, normalize text, handle encoding
4. **Initial labeling**: Manual inspection to identify advertisements vs news
5. **Data export**: Save collected data in structured format (CSV/JSON)

Let's start by installing the required dependencies:


In [3]:
# RSS Feed Collection Functions
# We'll implement the core RSS collection logic here

def collect_rss_feeds(feed_urls: Dict[str, str]) -> List[Dict]:
    """
    Collect articles from multiple RSS feeds
    
    Args:
        feed_urls: Dictionary mapping source names to RSS URLs
        
    Returns:
        List of article dictionaries
    """
    articles = []
    
    for source_name, url in feed_urls.items():
        print(f"Collecting from {source_name}...")
        # TODO: Implement RSS parsing logic
        # For now, just placeholder structure
        articles.append({
            'source': source_name,
            'url': url,
            'status': 'pending'
        })
    
    return articles

# Test the function
sample_articles = collect_rss_feeds(RSS_FEEDS)
print(f"Collected {len(sample_articles)} feed sources")


Collecting from TechCrunch...
Collecting from WIRED...
Collecting from The Verge...
Collecting from Ars Technica...
Collecting from Engadget...
Collecting from Gizmodo...
Collecting from CNET News...
Collecting from TechRadar...
Collecting from Digital Trends...
Collecting from VentureBeat...
Collecting from Recode (Vox Technology)...
Collecting from GeekWire...
Collecting from Mashable (Tech)...
Collecting from Hacker News (Top)...
Collecting from TechMeme...
Collecting from Slashdot...
Collecting from Lifehacker...
Collecting from MIT Technology Review...
Collecting from BBC News - Technology...
Collecting from The Guardian - Technology...
Collecting from NY Times - Technology...
Collecting from Reuters Technology News...
Collecting from CNN Technology...
Collecting from Business Insider (Tech)...
Collecting from HuffPost Tech...
Collecting from ZDNet (All Topics)...
Collecting from InfoWorld...
Collecting from Computerworld...
Collecting from OpenAI Blog...
Collecting from Google AI

In [4]:
# Implement actual RSS parsing logic

def parse_rss_feed(feed_url: str, source_name: str) -> List[Dict]:
    """
    Parse a single RSS feed and extract articles
    
    Args:
        feed_url: URL of the RSS feed
        source_name: Name of the source
        
    Returns:
        List of article dictionaries
    """
    articles = []
    
    try:
        print(f"  Fetching {source_name}...")
        feed = feedparser.parse(feed_url)
        
        if feed.bozo:
            print(f"  Warning: {source_name} has parsing issues")
        
        for entry in feed.entries[:50]:  # Limit to max 50 articles per feed
            article = {
                'source': source_name,
                'title': entry.get('title', ''),
                'link': entry.get('link', ''),
                'description': entry.get('description', ''),
                'published': entry.get('published', ''),
                'summary': entry.get('summary', ''),
                'tags': [tag.get('term', '') for tag in entry.get('tags', [])],
                'feed_url': feed_url
            }
            articles.append(article)
        
        print(f"  ✅ Collected {len(articles)} articles from {source_name}")
        
    except Exception as e:
        print(f"  ❌ Error fetching {source_name}: {str(e)}")
    
    return articles

def collect_all_feeds(feed_urls: Dict[str, str]) -> List[Dict]:
    """
    Collect articles from all RSS feeds
    
    Args:
        feed_urls: Dictionary mapping source names to RSS URLs
        
    Returns:
        List of all collected articles
    """
    all_articles = []
    
    for source_name, url in feed_urls.items():
        articles = parse_rss_feed(url, source_name)
        all_articles.extend(articles)
    
    return all_articles

# Collect from all RSS feeds
print("Collecting articles from all RSS feeds...")
all_articles = collect_all_feeds(RSS_FEEDS)
print(f"\nTotal articles collected: {len(all_articles)}")


Collecting articles from all RSS feeds...
  Fetching TechCrunch...
  ✅ Collected 20 articles from TechCrunch
  Fetching WIRED...
  ✅ Collected 50 articles from WIRED
  Fetching The Verge...
  ✅ Collected 10 articles from The Verge
  Fetching Ars Technica...
  ✅ Collected 20 articles from Ars Technica
  Fetching Engadget...
  ✅ Collected 50 articles from Engadget
  Fetching Gizmodo...
  ✅ Collected 20 articles from Gizmodo
  Fetching CNET News...
  ✅ Collected 25 articles from CNET News
  Fetching TechRadar...
  ✅ Collected 0 articles from TechRadar
  Fetching Digital Trends...
  ✅ Collected 0 articles from Digital Trends
  Fetching VentureBeat...
  ✅ Collected 0 articles from VentureBeat
  Fetching Recode (Vox Technology)...
  ✅ Collected 10 articles from Recode (Vox Technology)
  Fetching GeekWire...
  ✅ Collected 35 articles from GeekWire
  Fetching Mashable (Tech)...
  ✅ Collected 50 articles from Mashable (Tech)
  Fetching Hacker News (Top)...
  ✅ Collected 30 articles from Hacker 

In [5]:
# Convert to DataFrame and examine the data
if all_articles:
    df = pd.DataFrame(all_articles)
    
    print("DataFrame shape:", df.shape)
    print("\nColumns:", df.columns.tolist())
    print("\nFirst few rows:")
    print(df[['source', 'title', 'published']].head())
    
    print(f"\nArticles by source:")
    print(df['source'].value_counts())
    
    # Show a sample article
    print(f"\nSample article from {df.iloc[0]['source']}:")
    print(f"Title: {df.iloc[0]['title']}")
    print(f"Description: {df.iloc[0]['description'][:200]}...")
    print(f"Published: {df.iloc[0]['published']}")
    
else:
    print("No articles collected. Check your RSS feed URLs.")


DataFrame shape: (1608, 8)

Columns: ['source', 'title', 'link', 'description', 'published', 'summary', 'tags', 'feed_url']

First few rows:
       source                                              title  \
0  TechCrunch  VCs are still hiring MBAs, but firms are start...   
1  TechCrunch  Trump says Lachlan and Rupert Murdoch might in...   
2  TechCrunch  Silicon Valley bets big on ‘environments’ to t...   
3  TechCrunch  TechCrunch Mobility: The two robotaxi battlegr...   
4  TechCrunch  Hundreds of flights delayed at Heathrow and ot...   

                         published  
0  Sun, 21 Sep 2025 22:23:40 +0000  
1  Sun, 21 Sep 2025 19:43:34 +0000  
2  Sun, 21 Sep 2025 19:22:56 +0000  
3  Sun, 21 Sep 2025 16:01:00 +0000  
4  Sun, 21 Sep 2025 15:03:26 +0000  

Articles by source:
source
BBC News - Technology      50
Tom's Hardware             50
AppleInsider               50
Engadget                   50
WIRED                      50
                           ..
Microsoft Azure Blog

In [6]:
# Save the collected data as JSON
import os
from datetime import datetime

# Create data directory if it doesn't exist
os.makedirs('../data', exist_ok=True)

# Generate filename with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"../data/rss_articles_{timestamp}.json"

# Save to JSON
with open(filename, 'w', encoding='utf-8') as f:
    json.dump(all_articles, f, ensure_ascii=False, indent=2)

print(f"✅ Saved {len(all_articles)} articles to {filename}")

# Also save a summary
summary = {
    'collection_date': datetime.now().isoformat(),
    'total_articles': len(all_articles),
    'sources': df['source'].value_counts().to_dict(),
    'filename': filename
}

summary_filename = f"../data/collection_summary_{timestamp}.json"
with open(summary_filename, 'w', encoding='utf-8') as f:
    json.dump(summary, f, ensure_ascii=False, indent=2)

print(f"✅ Saved collection summary to {summary_filename}")
print(f"\nArticles by source:")
for source, count in summary['sources'].items():
    print(f"  {source}: {count} articles")


✅ Saved 1608 articles to ../data/rss_articles_20250922_000542.json
✅ Saved collection summary to ../data/collection_summary_20250922_000542.json

Articles by source:
  BBC News - Technology: 50 articles
  Tom's Hardware: 50 articles
  AppleInsider: 50 articles
  Engadget: 50 articles
  WIRED: 50 articles
  DeepMind Blog: 50 articles
  The Hacker News: 50 articles
  Google AI Blog (Research): 50 articles
  Mashable (Tech): 50 articles
  OpenAI Blog: 50 articles
  ExtremeTech: 50 articles
  Android Authority: 50 articles
  Lifehacker: 50 articles
  9to5Mac: 50 articles
  Microsoft Security Response Center: 50 articles
  Daring Fireball: 48 articles
  gHacks: 40 articles
  GeekWire: 35 articles
  Hacker News (Top): 30 articles
  TechSpot: 30 articles
  NY Times - Technology: 28 articles
  CNET News: 25 articles
  The Guardian - Technology: 24 articles
  TechRepublic (Cloud): 20 articles
  Google Cloud Blog: 20 articles
  Towards Data Science: 20 articles
  Apple Newsroom: 20 articles
  Go