# Scrape Times Of India website using RSS Feed


## Disclaimer & Warning: 
### ⚠️ <span style="color:red"> Warning: Scrape Responsibly <span> ⚠️

You must **NOT** scrape the website at a very high frequency. Excessive requests may  
lead to your IP being blocked or other restrictions imposed by the website.  

Please scrape **responsibly and mindfully**. You are **solely responsible** for any  
consequences resulting from your scraping activities.

> **Disclaimer:** This notebook is provided for educational purposes only.  
> The author is not responsible for any misuse or violation of the website’s  
> terms of service.


Times Of India - [RSS Feed](https://timesofindia.indiatimes.com/rss.cms) 

# Part-1: Scraping From RSS Feed to get the Articles Links:

In [1]:
import requests
import xml.etree.ElementTree as ET
import re
import time
import datetime
import os
import json

## 1.1 Select RSS FEED NAME:

In [2]:
# Set Feed Name Here: 
RSS_FEED_NAME = "Science"

In [3]:
rss_feed_links = {
    "Top Stories": "http://timesofindia.indiatimes.com/rssfeedstopstories.cms",
    "Most Recent Stories": "http://timesofindia.indiatimes.com/rssfeedmostrecent.cms",
    "India": "http://timesofindia.indiatimes.com/rssfeeds/-2128936835.cms",
    "World": "http://timesofindia.indiatimes.com/rssfeeds/296589292.cms",
    "NRI": "http://timesofindia.indiatimes.com/rssfeeds/7098551.cms",
    "Business": "http://timesofindia.indiatimes.com/rssfeeds/1898055.cms",
    "US": "https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms",
    "Cricket": "http://timesofindia.indiatimes.com/rssfeeds/54829575.cms",
    "Sports": "http://timesofindia.indiatimes.com/rssfeeds/4719148.cms",
    "Science": "http://timesofindia.indiatimes.com/rssfeeds/-2128672765.cms",
    "Environment": "http://timesofindia.indiatimes.com/rssfeeds/2647163.cms",
    "Tech": "http://timesofindia.indiatimes.com/rssfeeds/66949542.cms",
    "Education": "http://timesofindia.indiatimes.com/rssfeeds/913168846.cms",
    "Entertainment": "http://timesofindia.indiatimes.com/rssfeeds/1081479906.cms",
    "Life & Style": "http://timesofindia.indiatimes.com/rssfeeds/2886704.cms",
    "Most Read": "http://timesofindia.indiatimes.com/rssfeedmostread.cms",
    "Most Shared": "http://timesofindia.indiatimes.com/rssfeedmostshared.cms",
    "Most Commented": "http://timesofindia.indiatimes.com/rssfeedmostcommented.cms",
    "Astrology": "https://timesofindia.indiatimes.com/rssfeeds/65857041.cms",
    "Auto": "https://timesofindia.indiatimes.com/rssfeeds/74317216.cms"
}
RSS_FEED_URL = rss_feed_links.get(RSS_FEED_NAME)

## 1.2 Generate Input JSON:

In [4]:
RSS_FEED_ARTICLES = {}

try:
    response = requests.get(RSS_FEED_URL, timeout=10)
    response.raise_for_status()

    # XML Parsing
    root = ET.fromstring(response.content)

    # Extract links from <item> tags
    articles = []
    for item in root.findall("./channel/item"):
        link = item.find("link")
        if link is not None and link.text:
            article_link = link.text.strip()
            cms_id = article_link.split("/")[-1].replace(".cms", "")            
            comment_link = (
                "https://timesofindia.indiatimes.com/commentsdata.cms"
                f"?msid={cms_id}&curpg=1&commenttype=agree&pcode=TOI&appkey=TOI"
                "&sortcriteria=AgreeCount&order=desc&size=100&after=true"
                "&withReward=true&medium=WEB&comment_block_count=3&pagenum=1"
            )
            articles.append({
                "article_link": article_link,
                "cms_id": cms_id,
                "comment_link": comment_link
            })

    # Build final dict
    RSS_FEED_ARTICLES = {
        "rss_feed_name": RSS_FEED_NAME,
        "rss_feed_scraped_date": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        "rss_feed_articles": articles
    }

except requests.RequestException as e:
    print(f"Failed getRequest RSS_FEED : {e}")

except ET.ParseError as e:
    print(f"Failed to parse getRequest RSS_FEED XML: {e}")


## 1.3 Write the Input Json to a File: 

In [5]:
 # Ensure output directory exists
os.makedirs("Input", exist_ok=True)

# Create filename with timestamp
filename = f"Input/{RSS_FEED_NAME}-{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}.json"

# Write JSON file
with open(filename, "w", encoding="utf-8") as f:
    json.dump(RSS_FEED_ARTICLES, f, indent=4, ensure_ascii=False)

# Part-2: Scrape the each Articles Links:

## 2.1 Read Input JSON: 


In [6]:
# Pick the latest saved file from Input/ folder OR Specify manually 
input_folder = "Input"

latest_file = max(
    [os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.endswith(".json")],
    key=os.path.getctime
)

input_file = latest_file
# input_file = "Input/India-20250926-085255.json"

print(f"Reading from {input_file}")

# Load JSON
with open(latest_file, "r", encoding="utf-8") as f:
    feed_article_list = json.load(f)

# print(feed_data)

Reading from Input\Science-20250926-195555.json


## 2.2 Scrape Article details and Comments for each Article in Feed: 

In [7]:
def extract_comments(data, collected=None):
    """Recursively extract all C_T fields (comments) from JSON."""
    if collected is None:
        collected = []

    if isinstance(data, dict):
        if "C_T" in data:
            collected.append(data["C_T"])
        if "REPLIES" in data and isinstance(data["REPLIES"], list):
            for reply in data["REPLIES"]:
                extract_comments(reply, collected)

    elif isinstance(data, list):
        for item in data:
            extract_comments(item, collected)

    return collected

In [8]:
combined_articles = []

# Loop over articles
for article in feed_article_list.get("rss_feed_articles", []):
    article_url = article["article_link"]
    comment_url = article["comment_link"]
    cms_id = article["cms_id"]

    print(f"\nFetching: {article_url}")
    try:
        time.sleep(2)
        r = requests.get(article_url, timeout=10)
        r.raise_for_status()

        # Find ALL <script type="application/ld+json"> ... </script>
        matches = re.findall(
            r'<script type="application/ld\+json">\s*(\{.*?\})\s*</script>',
            r.text,
            re.DOTALL
        )

        json_article_details = {}
        if len(matches) >= 2:
            try:
                json_article_details = json.loads(matches[1])  # second occurrence
                print(f"✅ Extracted article JSON for {cms_id}")
            except json.JSONDecodeError as e:
                print(f"⚠️ JSON decode error for article {cms_id}: {e}")
        else:
            print(f"⚠️ Could not find article demarker for {cms_id}")

    except Exception as e:
        print(f"Error fetching {article_url}: {e}")
        continue

    # Fetch comments
    comments = []
    try:
        time.sleep(2)
        r = requests.get(comment_url, timeout=10)
        r.raise_for_status()
        data = r.json()

        if isinstance(data, dict):
            comments = extract_comments(data.get("items", []))
        elif isinstance(data, list):
            comments = extract_comments(data)

        print(f"✅ Extracted {len(comments)} comments for {cms_id}")

    except Exception as e:
        print(f"Error fetching comments for {cms_id}: {e}")

    # Combine article + comments into one JSON structure
    combined_json = {
        "cms_id": cms_id,
        "article_link": article_url,
        "article_details": json_article_details,
        "comments": comments
    }

    combined_articles.append(combined_json)


# Final combined JSON for all articles
# print(json.dumps(combined_articles[:2], indent=2, ensure_ascii=False)[:1000], "...")


Fetching: https://timesofindia.indiatimes.com/science/study-reveals-offshore-wind-farm-cables-affect-female-crabs-and-marine-ecosystems/articleshow/124154932.cms
✅ Extracted article JSON for 124154932
✅ Extracted 0 comments for 124154932

Fetching: https://timesofindia.indiatimes.com/science/million-year-old-skull-suggests-humans-emerged-earlier-than-thought-challenging-the-africa-centric-theory-of-evolution/articleshow/124149376.cms
✅ Extracted article JSON for 124149376
✅ Extracted 0 comments for 124149376

Fetching: https://timesofindia.indiatimes.com/science/understanding-why-rose-petals-curl-how-stress-geometry-and-biology-shape-their-elegant-forms/articleshow/124148809.cms
✅ Extracted article JSON for 124148809
✅ Extracted 0 comments for 124148809

Fetching: https://timesofindia.indiatimes.com/science/nasa-isros-nisar-sends-first-radar-images-of-earths-surface-reveals-exceptional-details-of-land-forests-and-agriculture/articleshow/124147402.cms
✅ Extracted article JSON for 12414

## 2.3 Write Article details and Comments for each Article in json files:

In [9]:

# Base output path
base_path = f"Output/toi-articles/toi-articles-{RSS_FEED_NAME}"
os.makedirs(base_path, exist_ok=True)

# Current date for file naming
date_str = datetime.datetime.now().strftime('%Y%m%d-%H%M%S')

# Write each article to a separate JSON file
for article in combined_articles:
    cms_id = article.get("cms_id", "unknown")
    file_name = f"{cms_id}-{date_str}.json"
    file_path = os.path.join(base_path, file_name)
    
    with open(file_path, "w", encoding="utf-8") as f:
        json.dump(article, f, ensure_ascii=False, indent=4)
    
    print(f"Saved article {cms_id} to {file_path}")


Saved article 124154932 to Output/toi-articles/toi-articles-Science\124154932-20250926-195737.json
Saved article 124149376 to Output/toi-articles/toi-articles-Science\124149376-20250926-195737.json
Saved article 124148809 to Output/toi-articles/toi-articles-Science\124148809-20250926-195737.json
Saved article 124147402 to Output/toi-articles/toi-articles-Science\124147402-20250926-195737.json
Saved article 124113646 to Output/toi-articles/toi-articles-Science\124113646-20250926-195737.json
Saved article 124112893 to Output/toi-articles/toi-articles-Science\124112893-20250926-195737.json
Saved article 124113708 to Output/toi-articles/toi-articles-Science\124113708-20250926-195737.json
Saved article 124112683 to Output/toi-articles/toi-articles-Science\124112683-20250926-195737.json
Saved article 124108001 to Output/toi-articles/toi-articles-Science\124108001-20250926-195737.json
Saved article 124108906 to Output/toi-articles/toi-articles-Science\124108906-20250926-195737.json
Saved arti