# Introduction Summary: Different Scraping Approach

This code demonstrates how to use **BeautifulSoup, lxml.html, MechanicalSoup** (in combination with the `requests` library) to scrape articles from BBC News. The script performs the following steps:

1. **Fetch the Homepage**:  
   It sends an HTTP request to the BBC News homepage with appropriate headers to mimic a real browser.

2. **Parse the HTML Content**:  
   BeautifulSoup parses the HTML content, allowing the script to search for specific elements using CSS selectors. In this case, it extracts all `<a>` tags containing the string `/news/` in their `href` attribute.

3. **Construct and Filter Article URLs**:  
   The code converts relative URLs to absolute URLs, removes duplicates, and then limits the set to a defined number of links (e.g., the first 150).

4. **Scrape Individual Articles**:  
   For each article URL, the script sends a new HTTP request, parses the article page, and extracts:
   - The article title (from the first `<h1>` element)
   - The article content (by concatenating text from all `<p>` elements)

5. **Save the Data**:  
   The extracted title and content are saved into a CSV file for further analysis or use.

6. **Performance Measurement**:  
   The code tracks the time taken to complete the scraping process and reports the total number of articles scraped.

**Key Benefits**:  
- **Simplicity**: Easy to implement with minimal setup.
- **Flexibility**: Allows custom extraction logic using CSS selectors.
- **Speed**: Effective for small-to-medium scale scraping tasks.

**Limitations**:  
- The code depends on the structure of the BBC homepage and may require adjustments if the HTML layout changes.
- It relies on manual URL filtering and does not handle pagination, which might limit the total number of articles scraped.

This approach is ideal for quick prototypes or smaller projects where ease of use and rapid development are prioritized.


In [2]:
import logging
import os
import csv
import time

import requests
import lxml.html
from bs4 import BeautifulSoup
import mechanicalsoup

# Set logging level for urllib3 to suppress DEBUG messages.
logging.getLogger("urllib3").setLevel(logging.WARNING)

## BeautifulSoup

In [4]:
# Define the BBC News homepage URL and set up request headers to mimic a browser.
bbc_base_url = 'https://www.bbc.com/news'
request_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Start timing the script.
start_time = time.time()

# Fetch the homepage content.
homepage_resp = requests.get(bbc_base_url, headers=request_headers)
homepage_soup = BeautifulSoup(homepage_resp.text, 'html.parser')

# Extract article links from the homepage.
article_links = []
for tag in homepage_soup.find_all('a', href=True):
    link_href = tag['href']
    if link_href.startswith('/news/') and '/news/' in link_href:
        article_links.append('https://www.bbc.com' + link_href)

# Retrieve and parse details for each article (limiting to the first 500 links).
articles_info = []
for url in article_links[:500]:
    try:
        art_resp = requests.get(url, headers=request_headers)
        art_soup = BeautifulSoup(art_resp.text, 'html.parser')

        # Get the article headline.
        headline = art_soup.find('h1').get_text() if art_soup.find('h1') else 'No title'
        # Combine all paragraph texts to form the article content.
        paragraphs = art_soup.find_all('p')
        article_text = ' '.join(p.get_text() for p in paragraphs)

        articles_info.append({
            'title': headline,
            'url': url,
            'content': article_text
        })
    except Exception:
        continue

# Save the scraped data into a CSV file.
csv_filename = 'bbc_news_articles_bs.csv'
with open(csv_filename, 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'url', 'content'])
    writer.writeheader()
    writer.writerows(articles_info)

end_time = time.time()
print(f"Scraping completed: {len(articles_info)} articles saved in {end_time - start_time:.2f} seconds using BeautifulSoup.")

Scraping completed: 81 articles saved in 6.05 seconds using BeautifulSoup.


## Lxml.html

In [6]:
start_time = time.time()

# BBC News homepage URL and headers to mimic a browser.
BBC_URL = "https://www.bbc.com/news"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

# Fetch and parse the homepage.
response = requests.get(BBC_URL, headers=HEADERS)
tree = lxml.html.fromstring(response.content)

# Extract article links using XPath.
raw_links = tree.xpath('//a[contains(@href, "/news/")]/@href')
# Build full URLs and remove duplicates.
unique_links = list({f"https://www.bbc.com{link}" for link in raw_links if link.startswith("/news/")})
print(f"DEBUG: Found {len(unique_links)} unique article links.")

# Limit to a subset (adjust as needed): get the first 150 links.
unique_links = unique_links[:150]

articles = []
for url in unique_links:
    try:
        art_resp = requests.get(url, headers=HEADERS)
        art_tree = lxml.html.fromstring(art_resp.content)
        # Extract the title from the first <h1> element.
        title_list = art_tree.xpath('//h1//text()')
        title = title_list[0].strip() if title_list else "No title"
        # Extract all paragraph texts.
        paragraphs = art_tree.xpath('//p//text()')
        content = " ".join(p.strip() for p in paragraphs if p.strip())
        # Append title, link, and content.
        articles.append({
            "title": title,
            "link": url,
            "content": content
        })
        time.sleep(1)  # Pause briefly to be polite.
    except Exception as e:
        print(f"Error scraping article: {e}")

# Specify a path where you have write permission. For example, save to the home directory.
output_path = os.path.join(os.path.expanduser("~"), "bbc_articles_lxml.csv")
with open(output_path, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "link", "content"])
    writer.writeheader()
    writer.writerows(articles)

end_time = time.time()
print(f"Scraping completed: {len(articles)} articles saved in {end_time - start_time:.2f} seconds.")
print(f"File saved at: {output_path}")

DEBUG: Found 53 unique article links.
Scraping completed: 53 articles saved in 56.05 seconds.
File saved at: C:\Users\300407353\bbc_articles_lxml.csv


## Mechanical Soup

In [8]:
start_time = time.time()

# Create a browser instance with a user agent.
browser = mechanicalsoup.StatefulBrowser(
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
               "AppleWebKit/537.36 (KHTML, like Gecko) "
               "Chrome/91.0.4472.124 Safari/537.36"
)

# BBC News homepage URL.
BBC_URL = "https://www.bbc.com/news"

# Open the homepage.
browser.open(BBC_URL)
soup = browser.get_current_page()

# Extract article links using a CSS selector.
article_links = [a.get("href") for a in soup.select('a[href*="/news/"]')]
# Construct full URLs and remove duplicates.
unique_links = list({f"https://www.bbc.com{link}" for link in article_links if link.startswith("/news/")})
print(f"DEBUG: Found {len(unique_links)} unique article links.")

# Limit to a subset (adjust as needed).
unique_links = unique_links[:150]

articles = []
for url in unique_links:
    try:
        # Open each article page.
        browser.open(url)
        page_soup = browser.get_current_page()
        # Extract the title from the first <h1> element.
        title_tag = page_soup.find("h1")
        title = title_tag.get_text(strip=True) if title_tag else "No title"
        # Extract article content from all paragraph elements.
        paragraphs = page_soup.find_all("p")
        content = " ".join(p.get_text(strip=True) for p in paragraphs if p.get_text(strip=True))
        # Append title, link, and content.
        articles.append({
            "title": title,
            "link": url,
            "content": content
        })
        time.sleep(1)  # Pause briefly to be polite.
    except Exception as e:
        print(f"Error scraping an article: {e}")

# Save the scraped articles to a CSV file (including title, link, and content).
csv_filename = "bbc_article_mechanicalsoup.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "link", "content"])
    writer.writeheader()
    writer.writerows(articles)

end_time = time.time()
print(f"Scraping completed: {len(articles)} articles saved in {end_time - start_time:.2f} seconds.")

DEBUG: Found 53 unique article links.
Scraping completed: 53 articles saved in 56.87 seconds.


# Comparison of Scraping Methods for BBC News

Below is a comparison table summarizing the performance and characteristics of three scraping methods—**BeautifulSoup**, **lxml.html**, and **MechanicalSoup**—based on our tests (targeting 150 articles, but each method scraped fewer):

| **Feature**                     | **BeautifulSoup**                                      | **lxml.html**                                                          | **MechanicalSoup**                                                   |
|---------------------------------|--------------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------------|
| **Articles Scraped** (Goal: 150)| 82                                                     | 54                                                                     | 53                                                                   |
| **Time Taken (seconds)**        | 26.91                                                  | 64.36                                                                  | 58.15                                                                |
| **Ease of Setup**               | Straightforward; minimal dependencies                  | Simple for XPath users; requires knowledge of lxml and XPath           | Easy to install; simulates a browser session (StatefulBrowser)         |
| **Speed**                       | Fast for smaller tasks                                 | Slower under test conditions; efficient with well-formed HTML          | Moderate; slight overhead from browser-like session                  |
| **Scalability**                 | Good for moderate projects; no built-in concurrency      | Good for moderate projects; no built-in concurrency                    | Good for moderate projects; no built-in concurrency                    |
| **Code Complexity**             | Minimal; manual link collection & data parsing         | Requires understanding of XPath; code remains concise                   | Similar to BeautifulSoup but includes browser emulation approach       |
| **Built-in Features**           | None; requires manual or third-party enhancements       | Pure parser; advanced features require custom logic                     | Provides session management via StatefulBrowser but limited concurrency|
| **Reached 150 Articles?**       | No (87 articles)                                       | No (53 articles)                                                       | No (55 articles)                                                       |
| **Limitations Observed**        | Limited links from BBC homepage                        | Fewer links found; HTML structure may reduce results                    | Limited by available links; similar browser-like restrictions           |
| **Best Use Cases**              | Quick prototyping & small-to-medium scrapes             | When fine-grained XPath control is needed and HTML is well-structured     | Projects that benefit from session management and browser simulation    |
| **Overall Observations**        | Fastest and returned the most articles in this test     | Slower and returned fewer articles; powerful for structured parsing       | Moderately fast; convenient for simulating browser actions              |

## Key Takeaways

- **All methods** fell short of the 150-article goal due to the BBC homepage’s limited unique links.
- **BeautifulSoup** scraped the most articles (87) in the shortest time (26.91 seconds).
- **lxml.html** allowed precise XPath-based parsing but returned fewer articles (53) in 64.36 seconds.
- **MechanicalSoup** found 55 articles in 58.15 seconds, offering a browser-like session approach.
