
# **TASK B** : Text Dataset Collection: Web Crawler and NLP Preprocessing
  

 
- **This notebook demonstrates:**
    - 1. Listing 20 different categories and selecting three websites for each.
    - 2. Implementing a web crawler to extract relevant text (article title, content, publication date) from these websites.
    - 3. Storing the collected data into 20 text files (one per category).
    - 4. Cleaning the text using NLP preprocessing (removing HTML tags, punctuation, and stop words).
    - 5. Naming the dataset and demonstrating a use case (here, a simple frequency analysis).

## **Dataset Name:** `IITG_MultimodalTextDataset`
 

## 1. Setup and Imports
- We will import necessary libraries including `requests`, `BeautifulSoup` for web scraping, `nltk` for NLP preprocessing, and other utilities.


In [6]:
import os
import re
import string
import requests
from bs4 import BeautifulSoup

# For NLP preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter

In [7]:
# Download nltk stopwords if not already available
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\iitia\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iitia\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 2. Define Categories and Websites

- We define 20 categories and for each, list three example websites.

In [19]:
categories = {
    "Technology": [
        "https://techcrunch.com",
        "https://www.wired.com",
        "https://www.theverge.com"
    ],
    "Sports": [
        "https://www.espn.com",
        "https://www.bbc.com/sport",
        "https://www.si.com"
    ],
    "Health": [
        "https://www.webmd.com",
        "https://www.healthline.com",
        "https://www.medicalnewstoday.com"
    ],
    "Politics": [
        "https://www.cnn.com/politics",
        "https://www.bbc.com/news/politics",
        "https://www.nbcnews.com/politics"
    ],
    "Entertainment": [
        "https://ew.com",
        "https://www.rollingstone.com",
        "https://variety.com"
    ],
    "Business": [
        "https://www.businessinsider.com",
        "https://www.forbes.com",
        "https://www.cnbc.com"
    ],
    "Science": [
        "https://www.livescience.com",
        "https://www.nature.com",
        "https://www.scientificamerican.com"
    ],
    "Education": [
        "https://www.edutopia.org",
        "https://www.insidehighered.com",
        "https://www.timeshighereducation.com"
    ],
    "Environment": [
        "https://www.nationalgeographic.com/environment",
        "https://www.theguardian.com/environment",
        "https://www.ecowatch.com"
    ],
    "Travel": [
        "https://www.lonelyplanet.com",
        "https://www.cntraveler.com",
        "https://www.travelandleisure.com"
    ],
    "Food": [
        "https://www.foodnetwork.com",
        "https://www.seriouseats.com",
        "https://www.bonappetit.com"
    ],
    "Fashion": [
        "https://www.vogue.com",
        "https://www.elle.com",
        "https://www.harpersbazaar.com"
    ],
    "Art & Culture": [
        "https://www.artsy.net",
        "https://www.theartnewspaper.com",
        "https://www.artforum.com"
    ],
    "Finance": [
        "https://www.ft.com",
        "https://www.investopedia.com",
        "https://www.fool.com"
    ],
    "History": [
        "https://www.history.com",
        "https://www.bbc.co.uk/history",
        "https://www.historyextra.com"
    ],
    "Literature": [
        "https://lithub.com",
        "https://themillions.com",
        "https://www.poetryfoundation.org"
    ],
    "Music": [
        "https://pitchfork.com",
        "https://www.rollingstone.com/music",
        "https://www.nme.com"
    ],
    "Gaming": [
        "https://www.polygon.com",
        "https://www.gamesradar.com",
        "https://www.destructoid.com"
    ],
    "Lifestyle": [
        "https://www.lifehacker.com",
        "https://www.thecut.com",
        "https://www.refinery29.com"
    ],
    "World News": [
        "https://www.bbc.com/news/world",
        "https://www.aljazeera.com",
        "https://www.cnn.com/world"
    ]
}


## 3. Define Functions for Web Crawling and Text Extraction

- **We define a function `crawl_website` which:**
     - Sends a GET request to the website.
     - Uses BeautifulSoup to parse the HTML.
     - Extracts the page title.
     - Tries to find an `<article>` tag and extracts paragraphs from it (if available).
     - Extracts a simulated publication date (if available in meta tags).
    .

In [9]:
def crawl_website(url):
    """
    Crawl the given URL and extract article title, content, and publication date.
    Returns a dictionary with keys: 'title', 'content', 'date'.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; IITGDataCrawler/1.0)"
    }
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Extract title (from <title> or first <h1>)
    title = soup.title.string if soup.title else "No Title"
    
    # Attempt to extract article content: look for <article> tag or all <p> tags
    article = soup.find('article')
    if article:
        paragraphs = article.find_all('p')
    else:
        paragraphs = soup.find_all('p')
    
    content = "\n".join([p.get_text() for p in paragraphs])
    
    # Extract publication date from meta tags (this is a simple heuristic)
    pub_date = None
    for meta in soup.find_all('meta'):
        if meta.get("name", "").lower() in ["pubdate", "publication_date", "date"]:
            pub_date = meta.get("content")
            break
    
    return {
        "title": title,
        "content": content,
        "date": pub_date if pub_date else "Unknown"
    }

## 4. Define Function for Cleaning Text
- **This function cleans the text by:**
     - Removing HTML tags (if any remain).
     - Lowercasing the text.
     - Removing punctuation.
     - Removing stop words using NLTK.


In [22]:
def clean_text(text):
    # Remove any residual HTML tags
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    # Convert text to lowercase
    text = text.lower()
    
    # Define a custom punctuation string including typical punctuation and additional typographic quotes/dashes
    custom_punct = string.punctuation + "“”‘’—–"
    
    # Remove punctuation using regex (all characters in custom_punct)
    text = re.sub(f"[{re.escape(custom_punct)}]", "", text)
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [word for word in tokens if word not in stop_words]
    
    # Rejoin tokens into a cleaned string
    cleaned_text = " ".join(cleaned_tokens)
    return cleaned_text


## 5. Crawl and Store Data for Each Category
- For each category, we iterate through its three websites, crawl the pages, extract the text, and clean it.
- The results are then stored in a text file named after the category.


In [23]:

# Create a main directory for the dataset and two subdirectories for raw and cleaned data
output_dir = "IITG_MultimodalTextDataset"
raw_dir = os.path.join(output_dir, "raw")
cleaned_dir = os.path.join(output_dir, "cleaned")

if not os.path.exists(output_dir):
    os.makedirs(output_dir)
if not os.path.exists(raw_dir):
    os.makedirs(raw_dir)
if not os.path.exists(cleaned_dir):
    os.makedirs(cleaned_dir)

# Iterate over each category and its websites
for category, urls in categories.items():
    print(f"Processing category: {category}")
    
    # Lists to store raw and cleaned text entries
    raw_texts = []
    cleaned_texts = []
    
    for url in urls:
        print(f"  Crawling {url}")
        data = crawl_website(url)
        if data:
            # Create the raw entry
            entry_raw = (f"Title: {data['title']}\n"
                         f"Publication Date: {data['date']}\n"
                         f"Content:\n{data['content']}\n"
                         f"{'-'*80}\n")
            raw_texts.append(entry_raw)
            
            # Clean the raw entry and add to cleaned list
            entry_cleaned = clean_text(entry_raw)
            cleaned_texts.append(entry_cleaned)
    
    # Combine all entries for the category
    combined_raw = "\n".join(raw_texts)
    combined_cleaned = "\n".join(cleaned_texts)
    
    # Save raw data file in the raw folder
    file_path_raw = os.path.join(raw_dir, f"{category.replace(' ', '_')}.txt")
    with open(file_path_raw, "w", encoding="utf-8") as f:
        f.write(combined_raw)
    
    # Save cleaned data file in the cleaned folder
    file_path_cleaned = os.path.join(cleaned_dir, f"{category.replace(' ', '_')}.txt")
    with open(file_path_cleaned, "w", encoding="utf-8") as f:
        f.write(combined_cleaned)
    
    print(f"  Raw data for category '{category}' saved to {file_path_raw}")
    print(f"  Cleaned data for category '{category}' saved to {file_path_cleaned}")

Processing category: Technology
  Crawling https://techcrunch.com
  Crawling https://www.wired.com
  Crawling https://www.theverge.com
  Raw data for category 'Technology' saved to text_dataset\raw\Technology.txt
  Cleaned data for category 'Technology' saved to text_dataset\cleaned\Technology.txt
Processing category: Sports
  Crawling https://www.espn.com
  Crawling https://www.bbc.com/sport
  Crawling https://www.si.com
  Raw data for category 'Sports' saved to text_dataset\raw\Sports.txt
  Cleaned data for category 'Sports' saved to text_dataset\cleaned\Sports.txt
Processing category: Health
  Crawling https://www.webmd.com
  Crawling https://www.healthline.com
  Crawling https://www.medicalnewstoday.com
  Raw data for category 'Health' saved to text_dataset\raw\Health.txt
  Cleaned data for category 'Health' saved to text_dataset\cleaned\Health.txt
Processing category: Politics
  Crawling https://www.cnn.com/politics
  Crawling https://www.bbc.com/news/politics
  Crawling https://w

## 6. Demonstration of a Use Case
- **Use Case:** We demonstrate a simple use case by performing a word frequency analysis on the "Technology" category data.
- This can be useful for understanding common terms in technology-related articles.

- ### In a real project, you might use this dataset for tasks like topic modeling, sentiment analysis, or clustering. 
- ### This is done in a separate notebook provided


In [24]:
# Load the technology category text file
tech_file = os.path.join(cleaned_dir, "Technology.txt")
with open(tech_file, "r", encoding="utf-8") as f:
    tech_text = f.read()

# Tokenize the technology text
tokens = word_tokenize(tech_text)

# Calculate word frequencies
freq = Counter(tokens)

# Display the 10 most common words
print("Most common words in the Technology category:")
for word, count in freq.most_common(10):
    print(f"{word}: {count}")

Most common words in the Technology category:
ai: 10
new: 8
2025: 8
game: 7
techcrunch: 6
meta: 5
tech: 5
get: 5
title: 4
content: 4


## 7. Summary
- In this notebook we:
 - Defined 20 categories with three websites each.
  - Implemented a basic web crawler to extract article titles, content, and publication dates.
 - Cleaned the text data using NLP preprocessing techniques.
 - Stored the data into separate text files for each category.
 - Named the dataset as **IITG_MultimodalTextDataset**.
 - Demonstrated a simple use case (word frequency analysis) on the Technology category.
### This framework can be extended further by refining the extraction methods and applying more advanced NLP techniques.