#**Lab Assignment1: Introduction to NLP and Text Data Collection**

### **Step 1: Install Required Packages & Import Libraries**

In this step, we install all external Python libraries needed for the NLP assignment and import the modules required for web scraping, text processing, data handling, and logging.

####  **Installed Libraries**
- **requests** – For sending HTTP requests to websites.
- **beautifulsoup4 (BS4)** – For parsing and extracting data from HTML pages.
- **pandas** – To store and manipulate scraped data in tabular form.
- **nltk** – Natural Language Toolkit for text preprocessing such as tokenization, stopwords, etc.
- **tqdm** – For progress bars during scraping.
  
####  **Imported Modules Overview**
- **requests**: Makes GET requests to fetch webpage content.
- **BeautifulSoup**: Parses HTML and extracts text.
- **time**: Used to add delays between requests to avoid overloading servers.
- **re**: Regular expressions for cleaning text.
- **nltk**: For tokenization and linguistic preprocessing.
- **Counter**: Helps in counting word frequencies.
- **pandas**: Creates DataFrames to store cleaned and structured data.
- **urljoin & urlparse**: Helps in resolving internal website links correctly.
- **urllib.robotparser**: Checks a website’s *robots.txt* to confirm whether scraping is allowed.
- **tqdm**: Provides progress bars for better visualization.
- **logging**: Handles error/exception logging during the scraping process.

This setup prepares the environment for safe, ethical, and efficient web scraping followed by NLP-based text analysis.


In [19]:
# Install required packages
!pip install requests beautifulsoup4 pandas nltk tqdm

import requests
from bs4 import BeautifulSoup
import time
import re
import nltk
from collections import Counter
import pandas as pd
from urllib.parse import urljoin, urlparse
import urllib.robotparser
from tqdm import tqdm
import logging



### **Step 2: Download NLTK Tokenizers**

To process and analyze text effectively, we need tokenizers from the **Natural Language Toolkit (NLTK)**.  
This step downloads the required tokenizer models:

- **punkt** → Sentence and word tokenizer used widely in NLP tasks.
- **punkt_tab** → Adds extended tokenization support for additional languages and token patterns.

Downloading these ensures that the script can reliably split text into words and sentences for further preprocessing.


In [45]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

### **Step 3: Configure Logging, Settings & NLTK Tools**

In this step, we set up the essential environment for our web-scraping script:

#### ✔ Configure Logging  
Logging helps track the scraper's progress, warnings, and errors in a readable format.  
`logging.basicConfig()` is used to format and display log messages.

#### ✔ Download Required NLTK Resources  
We download the **punkt** tokenizer silently (without extra console output) to ensure the script can handle sentence and word tokenization later.

#### ✔ Define Scraper Settings  
These parameters control how the scraper behaves:

- **BASE_SECTION_URL** → The root website section to crawl (The Guardian Technology).
- **USER_AGENT** → Identifies our scraper (important for ethical scraping).
- **SCRAPE_DELAY** → Wait time between requests to avoid overwhelming servers.
- **MAX_ARTICLES** → Number of articles we plan to collect.
- **TIMEOUT** → Maximum time to wait for website responses.

These settings provide safety, politeness, and structure to the scraping process.


In [20]:
logging.basicConfig(level=logging.INFO, format="%(message)s")
log = logging.getLogger()

# ---------- NLTK downloads ----------
nltk.download('punkt', quiet=True)

# ---------- Settings (change if needed) ----------
BASE_SECTION_URL = "https://www.theguardian.com/uk/technology"  # Guardian tech section root
USER_AGENT = "Colab NLP Scraper - educational use (your_email@example.com)"
SCRAPE_DELAY = 1.0   # seconds between requests (be polite)
MAX_ARTICLES = 100   # target number of articles to collect
TIMEOUT = 15

#### **Robots Check & Safe Page Fetching**

This part adds two utility functions:

- **`check_robots_allow()`**  
  Reads the site's `robots.txt` and checks whether scraping the given URL is allowed.

- **`fetch_url()`**  
  Safely downloads a webpage using a custom User-Agent, with error handling and logging.

These functions ensure the scraper stays polite and handles failures smoothly.


In [21]:
def check_robots_allow(domain_url, user_agent=USER_AGENT):
    """Check robots.txt - returns True if allowed to fetch."""
    parsed = urlparse(domain_url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    rp = urllib.robotparser.RobotFileParser()
    try:
        rp.set_url(robots_url)
        rp.read()
        allowed = rp.can_fetch(user_agent, domain_url)
        log.info(f"[robots.txt] can_fetch {user_agent} -> {domain_url} = {allowed}")
        return allowed
    except Exception as e:
        log.warning(f"[robots.txt] Could not read robots.txt ({robots_url}): {e}. Proceeding with caution.")
        return False  # default to conservative choice (you can allow with explicit consent)

def fetch_url(url, session=None):
    headers = {"User-Agent": USER_AGENT}
    s = session or requests
    try:
        r = s.get(url, headers=headers, timeout=TIMEOUT)
        r.raise_for_status()
        return r.text
    except Exception as e:
        log.warning(f"Failed to fetch {url}: {e}")
        return None


#### **Collecting Article Links from the Section Page**

This function crawls the Guardian Technology section and extracts article URLs.  
It scans multiple listing pages, finds valid article links using CSS selectors, normalizes their URLs, avoids duplicates, and stops once the required number of links is collected.  
A delay between requests ensures polite scraping.


In [22]:
# ---------- 1) Gather article listing links from section pages ----------
def gather_article_links(section_url, max_links=MAX_ARTICLES):
    """
    Crawl the section landing & subsequent pages to collect article URLs.
    This function is simple: it extracts article links that appear in the
    section listing. For robust scraping inspect site structure and change selectors.
    """
    log.info("Gathering article links from section...")
    article_urls = []
    page = 1
    session = requests.Session()
    while len(article_urls) < max_links:
        page_url = section_url if page == 1 else f"{section_url}?page={page}"
        html = fetch_url(page_url, session=session)
        if not html:
            break
        soup = BeautifulSoup(html, "html.parser")

        # The Guardian uses 'a' tags with data-link-name="article" in listings; fallback to article list
        anchors = soup.select("a.js-headline-text, a.u-faux-block-link__overlay, a[data-link-name='article']") \
                  or soup.select("a[href*='/technology/']")

        for a in anchors:
            href = a.get("href")
            if not href:
                continue
            # Normalize and filter domain (avoid non-guardian or short duplicates)
            if href.startswith("/"):
                href = urljoin("https://www.theguardian.com", href)
            if "theguardian.com" not in href:
                continue
            if href not in article_urls:
                article_urls.append(href)
                if len(article_urls) >= max_links:
                    break

        log.info(f"  Page {page} scanned, collected {len(article_urls)} links so far.")
        page += 1
        time.sleep(SCRAPE_DELAY)

        # Safety break to avoid infinite loop
        if page > 20:
            log.info("Reached page limit (20). Stopping link collection.")
            break

    return article_urls[:max_links]

#### **Extracting Structured Information from Each Article**

This function parses an article page and extracts key fields:  
- **Title** (from `<h1>` or `<title>` fallback)  
- **Author** (using author tags or byline classes)  
- **Publication date** (from meta tags or visible date patterns)  
- **Main article text** (from `<article>` or main content containers)

It cleans the content, merges all text paragraphs, and returns a structured dictionary for each scraped article.


In [23]:
# ---------- 2) Parse & extract article content ----------
def extract_article_fields(html, url):
    """Extract title, author, pub_date and cleaned article text from an article HTML."""
    soup = BeautifulSoup(html, "html.parser")

    # Title
    title_tag = soup.find("h1")
    title = title_tag.get_text(strip=True) if title_tag else (soup.title.get_text(strip=True) if soup.title else "")

    # Author - many Guardian articles use <a rel="author"> or span[itemprop="author"]
    author = ""
    a = soup.find(attrs={"rel": "author"})
    if a:
        author = a.get_text(strip=True)
    else:
        # alternative
        byline = soup.select_one(".byline, .contributor")
        if byline:
            author = byline.get_text(strip=True)

    # Pub date - check meta
    pub_date = ""
    meta_date = soup.find("meta", attrs={"property": "article:published_time"}) or soup.find("meta", attrs={"name":"publication_date"})
    if meta_date and meta_date.get("content"):
        pub_date = meta_date["content"]
    else:
        # visible text fallback
        m = re.search(r'([A-Za-z]{3,9}\s+\d{1,2},\s*\d{4})', soup.get_text())
        if m:
            pub_date = m.group(1)

    # Article paragraphs: prefer article element or div with role main
    article_container = soup.find("article") or soup.find("main") or soup.find("div", {"class":"content__article-body"})
    paragraphs = []
    if article_container:
        # keep main textual tags: p, h2, li
        for tag in article_container.find_all(['p', 'h2', 'h3', 'li']):
            text = tag.get_text(separator=" ", strip=True)
            if text:
                paragraphs.append(text)
    else:
        # Fallback: large blocks of text
        for p in soup.find_all("p"):
            txt = p.get_text(strip=True)
            if txt:
                paragraphs.append(txt)

    raw_text = " ".join(paragraphs)

    # Clean limited HTML artifacts
    raw_text = re.sub(r'\s+', ' ', raw_text).strip()

    return {
        "title": title,
        "author": author,
        "pub_date": pub_date,
        "raw_text": raw_text,
        "source_url": url
    }

#### **Cleaning and Tokenizing Article Text**

This step preprocesses each article by:
- Removing noise such as “Read more”, ads, and publication labels  
- Normalizing whitespace  
- Tokenizing the cleaned text using **NLTK word_tokenize**  
- Converting all tokens to lowercase  
- Storing the token list, vocabulary, and token count  

The function returns the article dictionary enriched with cleaned text and token-level information.


In [24]:
# ---------- 3) Cleaning & preprocessing ----------
def clean_and_tokenize(doc):
    """
    - Remove explicit "Published" labels, footer lines, etc.
    - Tokenize with NLTK word_tokenize
    - Lowercase tokens
    """
    raw = doc.get("raw_text", "") or ""
    # Remove common 'Read more' or 'Advertisement' cues
    raw = re.sub(r'Read more on.*', '', raw, flags=re.I)
    raw = re.sub(r'Advertisement', '', raw, flags=re.I)
    # Remove "Published: Month D, YYYY" patterns
    raw = re.sub(r'Published[:\s]*[A-Za-z]{3,9}\s+\d{1,2},\s*\d{4}', '', raw)
    raw = re.sub(r'\s+', ' ', raw).strip()
    doc['cleaned_text'] = raw

    # Tokenize
    try:
        tokens = nltk.word_tokenize(raw)
    except Exception:
        tokens = raw.split()
    tokens = [t.lower() for t in tokens]
    doc['tokens'] = tokens
    doc['vocab'] = sorted(set(tokens))
    doc['token_count'] = len(tokens)
    return doc


#### **Language Statistics Function**

This step computes overall linguistic patterns across all scraped articles.  
The function performs:

- **Total word count** across the corpus  
- **Vocabulary size** based on unique tokens  
- **Sentence segmentation** using NLTK  
- **Average sentence length** (in tokens)  
- **Top 15 meaningful words**, excluding common stopwords  
- Returns a summary dictionary containing all computed statistics  


In [42]:
# ---------- 4) Language analysis ----------
def language_statistics(docs):
    all_tokens = [t for d in docs for t in d.get('tokens', [])]
    total_words = len(all_tokens)
    vocab_size = len(set(all_tokens))
    # sentence splitting
    all_sentences = []
    for d in docs:
        txt = d.get('cleaned_text', '')
        if txt:
            sents = nltk.sent_tokenize(txt)
            all_sentences.extend(sents)
    avg_sentence_len = 0.0
    if all_sentences:
        sent_lens = [len(nltk.word_tokenize(s)) for s in all_sentences]
        avg_sentence_len = sum(sent_lens)/len(sent_lens)
    # top words (exclude noise)
    noise = set(['the','a','to','of','and','is','in','that','it','for','this','on','with','as','by','be','are','i'])
    meaningful = [t for t in all_tokens if t.isalpha() and t not in noise]
    top = Counter(meaningful).most_common(15)
    stats = {
        "total_docs": len(docs),
        "total_word_count": total_words,
        "vocab_size": vocab_size,
        "avg_sentence_length": avg_sentence_len,
        "top_words": top
    }
    return stats

#### **Main Pipeline Execution**

This section integrates all previous components into one complete NLP pipeline:

- **Robots.txt validation** to ensure ethical web scraping  
- **Article link collection** from the Technology section of The Guardian  
- **Scraping & parsing** each article (title, author, date, body text)  
- **Text cleaning and tokenization** using NLTK  
- **Corpus-level language analysis** (word counts, vocabulary size, sentences, top words)  
- **Dataset creation** with `pandas` and automatic CSV export  
- Optionally saves the dataset to **Google Drive** if a path is provided  

The function returns:  
`(pandas DataFrame, list_of_documents, language_statistics)`


In [43]:
# ---------- Main pipeline ----------
def run_pipeline(section_url=BASE_SECTION_URL, max_articles=MAX_ARTICLES, save_to_drive=False, drive_path=None):
    # 1) Robots.txt check
    if not check_robots_allow(section_url):
        log.error("Robots.txt disallows crawling this section for the provided user agent. STOP and get permission.")
        # You may change behavior to allow proceeding (not recommended). For educational purposes you can override here.
        # return None

    # 2) Gather article links
    links = gather_article_links(section_url, max_links=max_articles)
    log.info(f"Collected {len(links)} article links.")

    # 3) Scrape each article
    collected = []
    session = requests.Session()
    for url in tqdm(links, desc="Scraping articles"):
        html = fetch_url(url, session=session)
        if not html:
            # skip or use mock; here we skip
            log.warning(f"Skipping {url} (no HTML).")
            time.sleep(SCRAPE_DELAY)
            continue
        article = extract_article_fields(html, url)
        article = clean_and_tokenize(article)
        collected.append(article)
        time.sleep(SCRAPE_DELAY)

    if not collected:
        log.error("No articles scraped successfully.")
        return None

    # 4) Language analysis
    stats = language_statistics(collected)
    log.info("\nLanguage Statistics:")
    for k,v in stats.items():
        log.info(f" - {k}: {v}")

    # 5) Save dataset
    df = pd.DataFrame([{
        "doc_id": i+1,
        "title": d['title'],
        "author": d['author'],
        "pub_date": d['pub_date'],
        "source_url": d['source_url'],
        "cleaned_text": d['cleaned_text'],
        "token_count": d.get('token_count', 0),
        "vocab_size": len(d.get('vocab', []))
    } for i,d in enumerate(collected)])

    csv_name = "nlp_articles_cleaned.csv"
    df.to_csv(csv_name, index=False)
    log.info(f"\nSaved dataset CSV to: {csv_name}")

    # Optional: save to Google Drive path if requested
    if save_to_drive and drive_path:
        try:
            df.to_csv(drive_path, index=False)
            log.info(f"Saved a copy to Google Drive: {drive_path}")
        except Exception as e:
            log.warning(f"Could not save to Drive: {e}")

    return df, collected, stats

#### **Example Execution**

This final block runs the entire scraping and preprocessing pipeline.  
- Limits scraping to **25 articles** for quick testing in Colab  
- Executes the full workflow end-to-end  
- Displays the first few cleaned records as a markdown table  


In [46]:
# ---------- Example run (limit to e.g., 25 while testing) ----------
if __name__ == "__main__":
    # In Colab you may want to test with a smaller number first:
    df, docs, stats = run_pipeline(max_articles=25)  # change to 100 when ready
    # Print first few rows
    print(df.head().to_markdown(index=False))

Scraping articles: 100%|██████████| 25/25 [00:27<00:00,  1.09s/it]


|   doc_id | title                                                                           | author        | pub_date                 | source_url                                                                                               | cleaned_text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

## 📌 Conclusion & Discussion

In this assignment, a complete NLP preprocessing pipeline was implemented — starting from web scraping, followed by text extraction, cleaning, tokenization, and statistical language analysis.

###  Key Outcomes
- Successfully scraped multiple real news articles from *The Guardian – Technology* section.
- Extracted essential metadata such as **title, author, publication date, and full article text**.
- Performed **cleaning, tokenization, vocabulary extraction**, and removal of noisy text.
- Computed meaningful language statistics including:
  - Total word count  
  - Vocabulary size  
  - Average sentence length  
  - Most frequent meaningful words

###  What the Results Indicate
- The vocabulary size and token distribution reflect the diversity of language in technology journalism.
- Frequent words highlight the core themes in the scraped articles (e.g., tech, AI, devices, digital trends).
- Sentence lengths show that news articles typically use moderately long sentences for clarity and detail.
- Cleaning and preprocessing significantly improved the quality of text used for further NLP tasks.

###  Learning Significance
This assignment demonstrates:
- How real-world data is messy and requires robust preprocessing.
- How NLP pipelines are structured in real scraping-based projects.
- The importance of tokenization and statistical analysis before applying advanced NLP models.

Overall, this exercise gave hands-on experience with the foundational steps of Natural Language Processing and prepared the pipeline for downstream tasks such as text classification, clustering, topic modeling, and embeddings.
