In [9]:
pip install wikipedia-api feedparser pandas



## **Data collection**

In [10]:
import wikipediaapi
import feedparser
import pandas as pd
import time
import numpy as np
import re
from google.colab import drive

**Wikipedia Collection**

Define categories and topics (wiki_categories)

Four main categories: Machine Learning, Climate Change, Sports, Politics

Each has a list of seed topics (e.g., Artificial neural network, Football).

These mirror of project’s requirement to collect a topic-diverse dataset.

 **function(get_links)**

it Starts from a Wikipedia page (e.g., “Climate change”).

Collects its title, summary, and full text.

Also collects related pages via internal links - this expands coverage naturally.

Stops once the requested number of articles (max_articles) is reached.

**Loop through categories & topics**

For each topic, get_links is called.

Articles are saved into a list of dictionaries with metadata:

title, source = Wikipedia, category, content.

A short pause (time.sleep(0.1)) is added - polite scraping etiquette.

Thousands of Wikipedia pages organized by category.

In [11]:
# ----------- Wikipedia Setup -----------
wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='RanjithaNagaraj-MScDSProject/1.0 (ranjithanagaraj911@gmail.com)'
)


wiki_categories = {
    "Machine learning": [
        "Artificial neural network", "Supervised learning", "Unsupervised learning"
    ],
    "Climate change": [
        "Climate change", "Carbon dioxide", "Climate change mitigation", "Greenhouse gas"
    ],
    "Sports": [
        "Sports", "Football", "Olympic Games", "Basketball", "Cricket"
    ],
    "Politics": [
        "Political science", "Elections", "Political party"
    ]
}

wikipedia_docs = []
max_per_topic = 3000  # Increase/decrease to tune total dataset size

def get_links(page_title, max_articles):
    page = wiki.page(page_title)
    articles = []
    # Collect the main page
    if page.exists() and len(page.summary) > 100:
        articles.append((page.title, page.summary + "\n" + page.text))
    # Add links from this page (sub-articles)
    links = list(page.links.keys())
    for link in links[:max_articles - 1]:
        linked_page = wiki.page(link)
        if linked_page.exists() and len(linked_page.summary) > 100:
            articles.append((linked_page.title, linked_page.summary + "\n" + linked_page.text))
        if len(articles) >= max_articles:
            break
    return articles

# Scrape Wikipedia
for category, topics in wiki_categories.items():
    for topic in topics:
        try:
            print(f"Extracting Wikipedia articles for: {topic} ({category})")
            articles = get_links(topic, max_per_topic // len(topics))
            for title, content in articles:
                wikipedia_docs.append({
                    "title": title,
                    "source": "Wikipedia",
                    "category": category,
                    "content": content
                })
        except Exception as e:
            print(f"Failed for {topic}: {e}")
        time.sleep(0.1)


Extracting Wikipedia articles for: Artificial neural network (Machine learning)
Extracting Wikipedia articles for: Supervised learning (Machine learning)
Extracting Wikipedia articles for: Unsupervised learning (Machine learning)
Extracting Wikipedia articles for: Climate change (Climate change)
Extracting Wikipedia articles for: Carbon dioxide (Climate change)
Extracting Wikipedia articles for: Climate change mitigation (Climate change)
Extracting Wikipedia articles for: Greenhouse gas (Climate change)
Extracting Wikipedia articles for: Sports (Sports)
Extracting Wikipedia articles for: Football (Sports)
Extracting Wikipedia articles for: Olympic Games (Sports)
Extracting Wikipedia articles for: Basketball (Sports)
Extracting Wikipedia articles for: Cricket (Sports)
Extracting Wikipedia articles for: Political science (Politics)
Extracting Wikipedia articles for: Elections (Politics)
Extracting Wikipedia articles for: Political party (Politics)


In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**RSS News Collection**

**Define RSS feed sources (rss_feeds)**

BBC feeds: Technology, Health, Science, World

Guardian feeds: UK News, World

These adds real-world, up-to-date news articles to the dataset.

**Parse feeds**

Use feedparser to download RSS feeds.

For each entry: collect title + summary.

Store in a dictionary:

title, source = RSS, category (feed name), content.

Hundreds of fresh news articles with summaries.

In [13]:
# ----------- RSS Setup -----------
rss_feeds = {
    "BBC_Technology": "http://feeds.bbci.co.uk/news/technology/rss.xml",
    "BBC_Health": "http://feeds.bbci.co.uk/news/health/rss.xml",
    "BBC_Science": "http://feeds.bbci.co.uk/news/science_and_environment/rss.xml",
    "BBC_World": "http://feeds.bbci.co.uk/news/world/rss.xml",
    "Guardian_UK_News": "https://www.theguardian.com/uk-news/rss",
    "Guardian_World": "https://www.theguardian.com/world/rss"
}

rss_docs = []

for category, url in rss_feeds.items():
    print(f"Parsing RSS feed: {category}")
    feed = feedparser.parse(url)
    for entry in feed.entries:
        content = entry.title + ". " + (entry.summary if 'summary' in entry else "")
        rss_docs.append({
            "title": entry.title,
            "source": "RSS",
            "category": category,
            "content": content
        })


Parsing RSS feed: BBC_Technology
Parsing RSS feed: BBC_Health
Parsing RSS feed: BBC_Science
Parsing RSS feed: BBC_World
Parsing RSS feed: Guardian_UK_News
Parsing RSS feed: Guardian_World


Combine Wikipedia + RSS data into one pandas DataFrame.

Remove duplicates (by title) to avoid redundancy.

Reset index - keep dataset tidy.And

Save dataset to combined_corpus.csv.

In [14]:
# ----------- Combine and Save -----------
# Combine both datasets, remove duplicates by title
all_docs = pd.DataFrame(wikipedia_docs + rss_docs)
#all_docs.drop_duplicates(subset=["title"], inplace=True)
all_docs.reset_index(drop=True, inplace=True)

print(f"Total unique documents: {len(all_docs)}")

drive.mount('/content/drive')
all_docs.to_csv("/content/drive/MyDrive/Ranjitha-DS-Project/combined_dataset_New.csv", index=False)
print("Saved as combined_dataset_New.csv")

Total unique documents: 8577
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Saved as combined_dataset_New.csv
