# <div align='center'><b>Introduction Summary for Python Web Scraping Files</b></div>
> In this project for CSIS 4260 Assignment 2, I have chosen to compare two popular web scraping libraries: BeautifulSoup and Scrapy.
> 
>  Both libraries are widely used for web scraping tasks due to their flexibility and ease of use.
>
>  I selected BBC News as the public website to scrape articles from, ensuring that more than 100 articles were collected.
>
> The BeautifulSoup implementation focuses on simplicity and quick extraction, while Scrapy provides a more structured and scalable approach.
>
> Timing benchmarks have been included in both implementations to compare the efficiency of each library.
>
>  The final result displays the total count of scraped articles and the time taken for the process.
>
> This project demonstrates the strengths of each library in handling web scraping tasks efficiently.


In [1]:
import requests
from bs4 import BeautifulSoup
import time
import csv

# BBC News URL
BBC_URL = 'https://www.bbc.com/news'
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

start_time = time.time()

response = requests.get(BBC_URL, headers=HEADERS)
soup = BeautifulSoup(response.text, 'html.parser')

articles = []
for link in soup.find_all('a', href=True):
    url = link['href']
    if '/news/' in url and url.startswith('/news/'):
        articles.append('https://www.bbc.com' + url)

# Scrape articles details
data = []
for article_url in articles[:500]: 
    try:
        article_response = requests.get(article_url, headers=HEADERS)
        article_soup = BeautifulSoup(article_response.text, 'html.parser')

        title = article_soup.find('h1').get_text() if article_soup.find('h1') else 'No title'
        content = ' '.join([p.get_text() for p in article_soup.find_all('p')])

        data.append({'title': title, 'url': article_url, 'content': content})
    except Exception:
        continue

with open('bbc_news_articles_bs.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'url', 'content'])
    writer.writeheader()
    writer.writerows(data)

end_time = time.time()
print(f"Scraping completed: {len(data)} articles saved in {end_time - start_time:.2f} seconds using BeautifulSoup.")

Scraping completed: 82 articles saved in 79.99 seconds using BeautifulSoup.


In [2]:
import scrapy
from scrapy.crawler import CrawlerProcess
import time
import csv

class BBCNewsSpider(scrapy.Spider):
    name = 'bbc_news'
    start_urls = ['https://www.bbc.com/news']

    def parse(self, response):
        article_links = response.css('a[href*="/news/"]::attr(href)').getall()
        article_links = list(set(['https://www.bbc.com' + link for link in article_links if link.startswith('/news/')]))
        for link in article_links[:500]:  
            yield scrapy.Request(url=link, callback=self.parse_article)

    def parse_article(self, response):
        title = response.css('h1::text').get() or 'No title'
        content = ' '.join(response.css('p::text').getall())
        yield {
            'title': title,
            'url': response.url,
            'content': content
        }

start_time = time.time()

process = CrawlerProcess(settings={
    'FEEDS': {
        'bbc_news_articles_scrapy.csv': {'format': 'csv'},
    },
    'LOG_ENABLED': False,
})

process.crawl(BBCNewsSpider)
process.start()

end_time = time.time()
print(f"Scraping completed using Scrapy: {sum(1 for line in open('bbc_news_articles_scrapy.csv', encoding='utf-8'))-1} articles saved in {end_time - start_time:.2f} seconds.")

Scraping completed using Scrapy: 233 articles saved in 2.29 seconds.


### <div align='center'><h3>Comparison Table of BeautifulSoup and Scrapy Performance</h3></div>
| **Feature**         | **BeautifulSoup**                                   | **Scrapy**                                            |
|---------------------|------------------------------------------------------|-------------------------------------------------------|
| Articles Scraped    | 82                                                   | 233                                                   |
| Time Taken (s)      | 79.99                                                | 2.29                                                  |
| Ease of Setup       | Easy to set up, minimal dependencies                 | Requires project structure and setup                  |
| Speed               | Slower for large datasets                            | Faster with built-in asynchronous requests            |
| Scalability         | Suitable for small to medium tasks                   | Highly scalable for large-scale scraping              |
| Code Complexity     | Simpler code, manual handling of requests            | More complex, but automates crawling and data storage |
| Built-in Features   | No built-in crawling, needs external handling        | Built-in support for crawling, pipelines, and more    |
| Best For            | Small projects, quick tasks                          | Large projects with high data volume                  |

### <div align='center'><h3>Key Differences</h3></div>
> BeautifulSoup is lightweight and simpler to implement but slower for large-scale scraping.

> Scrapy is a full-fledged framework with built-in crawling and faster performance.

> BeautifulSoup requires manual handling of requests and parsing, while Scrapy automates these processes.

> Scrapy supports asynchronous scraping, making it more efficient than BeautifulSoup for extensive data collection.

> Scrapy provides better scalability and built-in tools for managing large datasets, while BeautifulSoup is ideal for quick and simple scraping tasks.