# Advanced Web Scraping Lab

In this lab you will first learn the following code snippet which is a simple web spider class that allows you to scrape paginated webpages. Read the code, run it, and make sure you understand how it work. In the challenges of this lab, we will guide you in building up this class so that eventually you will have a more robus web spider that you can further work on in the Web Scraping Project.

In [1]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    return content

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n    \n    \n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinki

## Challenge 1 - Custom Parser Function

In this challenge, complete the custom `quotes_parser()` function so that the returned result contains the quote string instead of the whole html page content.

In the cell below, write your updated `quotes_parser()` function and kickstart the spider. Make sure the results being printed contain a list of quote strings extracted from the html content.

In [3]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        response = requests.get(url)
        result = self.content_parser(response.content)
        self.output_results(result)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge

"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(content, 'html.parser')
    
    # Find all quote elements
    quote_elements = soup.find_all('div', class_='quote')
    
    # Extract the text from each quote
    quotes = []
    for quote_element in quote_elements:
        # Find the span with class 'text' which contains the quote
        text_span = quote_element.find('span', class_='text')
        if text_span:
            # Get the text and clean it up (remove the quotation marks)
            quote_text = text_span.get_text().strip()
            # Remove the surrounding quotation marks if present
            if quote_text.startswith('"') and quote_text.endswith('"'):
                quote_text = quote_text[1:-1]
            elif quote_text.startswith('"'):
                quote_text = quote_text[1:]
            elif quote_text.endswith('"'):
                quote_text = quote_text[:-1]
            quotes.append(quote_text)
    
    return quotes

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']


## Challenge 2 - Error Handling

In `IronhackSpider.scrape_url()`, catch any error that might occur when you make requests to scrape the webpage. This includes checking the response status code and catching http request errors such as timeout, SSL, and too many redirects.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [4]:
# your code here
import requests
from bs4 import BeautifulSoup
import time
from requests.exceptions import RequestException, Timeout, SSLError, TooManyRedirects, ConnectionError, HTTPError

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url with comprehensive error handling.
    """
    def scrape_url(self, url):
        try:
            # Make the HTTP request with timeout
            response = requests.get(url, timeout=10)
            
            # Check if the request was successful
            response.raise_for_status()
            
            # Parse the content if successful
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout:
            print(f"Timeout error occurred while accessing: {url}")
            self.output_results(f"Error: Request timed out for {url}")
            
        except SSLError:
            print(f"SSL error occurred while accessing: {url}")
            self.output_results(f"Error: SSL certificate problem for {url}")
            
        except TooManyRedirects:
            print(f"Too many redirects for: {url}")
            self.output_results(f"Error: Too many redirects for {url}")
            
        except ConnectionError:
            print(f"Connection error occurred while accessing: {url}")
            self.output_results(f"Error: Connection failed for {url}")
            
        except HTTPError as e:
            print(f"HTTP error occurred: {e} for URL: {url}")
            self.output_results(f"Error: HTTP {response.status_code} for {url}")
            
        except RequestException as e:
            print(f"General request exception occurred: {e} for URL: {url}")
            self.output_results(f"Error: Request failed for {url} - {str(e)}")
            
        except Exception as e:
            print(f"Unexpected error occurred: {e} for URL: {url}")
            self.output_results(f"Error: Unexpected error for {url} - {str(e)}")
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape + 1):
            url = self.url_pattern % i
            print(f"Scraping: {url}")
            
            self.scrape_url(url)
            
            # Add delay between requests if sleep_interval is positive
            if self.sleep_interval > 0:
                print(f"Sleeping for {self.sleep_interval} seconds...")
                time.sleep(self.sleep_interval)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrape

"""
Custom parser function to extract quotes from the HTML content.
"""
def quotes_parser(content):
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(content, 'html.parser')
    
    # Find all quote elements
    quote_elements = soup.find_all('div', class_='quote')
    
    # Extract the text from each quote
    quotes = []
    for quote_element in quote_elements:
        # Find the span with class 'text' which contains the quote
        text_span = quote_element.find('span', class_='text')
        if text_span:
            # Get the text and clean it up (remove the quotation marks)
            quote_text = text_span.get_text().strip()
            # Remove the surrounding quotation marks if present
            if quote_text.startswith('"') and quote_text.endswith('"'):
                quote_text = quote_text[1:-1]
            elif quote_text.startswith('"'):
                quote_text = quote_text[1:]
            elif quote_text.endswith('"'):
                quote_text = quote_text[:-1]
            quotes.append(quote_text)
    
    return quotes

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(
    URL_PATTERN, 
    PAGES_TO_SCRAPE, 
    sleep_interval=1,  # 1 second delay between requests
    content_parser=quotes_parser
)

# Start scraping jobs
print("Starting spider...")
my_spider.kickstart()
print("Spider finished!")

Starting spider...
Scraping: http://quotes.toscrape.com/page/1/
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']


# Challenge 3 - Sleep Interval

In `IronhackSpider.kickstart()`, implement `sleep_interval`. You will check if `self.sleep_interval` is larger than 0. If so, tell the FOR loop to sleep the given amount of time before making the next request.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [5]:
# your code here
import requests
from bs4 import BeautifulSoup
import time
from requests.exceptions import RequestException, Timeout, SSLError, TooManyRedirects, ConnectionError, HTTPError

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url with comprehensive error handling.
    """
    def scrape_url(self, url):
        try:
            # Make the HTTP request with timeout
            response = requests.get(url, timeout=10)
            
            # Check if the request was successful
            response.raise_for_status()
            
            # Parse the content if successful
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout:
            print(f"Timeout error occurred while accessing: {url}")
            self.output_results(f"Error: Request timed out for {url}")
            
        except SSLError:
            print(f"SSL error occurred while accessing: {url}")
            self.output_results(f"Error: SSL certificate problem for {url}")
            
        except TooManyRedirects:
            print(f"Too many redirects for: {url}")
            self.output_results(f"Error: Too many redirects for {url}")
            
        except ConnectionError:
            print(f"Connection error occurred while accessing: {url}")
            self.output_results(f"Error: Connection failed for {url}")
            
        except HTTPError as e:
            print(f"HTTP error occurred: {e} for URL: {url}")
            self.output_results(f"Error: HTTP {response.status_code} for {url}")
            
        except RequestException as e:
            print(f"General request exception occurred: {e} for URL: {url}")
            self.output_results(f"Error: Request failed for {url} - {str(e)}")
            
        except Exception as e:
            print(f"Unexpected error occurred: {e} for URL: {url}")
            self.output_results(f"Error: Unexpected error for {url} - {str(e)}")
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    Implements sleep interval between requests if specified.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape + 1):
            url = self.url_pattern % i
            print(f"Scraping: {url}")
            
            self.scrape_url(url)
            
            # Add delay between requests if sleep_interval is positive
            # Only sleep if this is not the last page
            if self.sleep_interval > 0 and i < self.pages_to_scrape:
                print(f"Sleeping for {self.sleep_interval} seconds before next request...")
                time.sleep(self.sleep_interval)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 3 # how many webpages to scrape

"""
Custom parser function to extract quotes from the HTML content.
"""
def quotes_parser(content):
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(content, 'html.parser')
    
    # Find all quote elements
    quote_elements = soup.find_all('div', class_='quote')
    
    # Extract the text from each quote
    quotes = []
    for quote_element in quote_elements:
        # Find the span with class 'text' which contains the quote
        text_span = quote_element.find('span', class_='text')
        if text_span:
            # Get the text and clean it up (remove the quotation marks)
            quote_text = text_span.get_text().strip()
            # Remove the surrounding quotation marks if present
            if quote_text.startswith('"') and quote_text.endswith('"'):
                quote_text = quote_text[1:-1]
            elif quote_text.startswith('"'):
                quote_text = quote_text[1:]
            elif quote_text.endswith('"'):
                quote_text = quote_text[:-1]
            quotes.append(quote_text)
    
    return quotes

# Instantiate the IronhackSpider class with a 2-second sleep interval
my_spider = IronhackSpider(
    URL_PATTERN, 
    PAGES_TO_SCRAPE, 
    sleep_interval=2,  # 2-second delay between requests
    content_parser=quotes_parser
)

# Start scraping jobs
print("Starting spider with sleep interval...")
my_spider.kickstart()
print("Spider finished!")

Starting spider with sleep interval...
Scraping: http://quotes.toscrape.com/page/1/
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, 

# Challenge 4 - Test Batch Scraping

Change the `PAGES_TO_SCRAPE` value from `1` to `10`. Try if your code still works as intended to scrape 10 webpages. If there are errors in your code, fix them.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [6]:
# your code here
import requests
from bs4 import BeautifulSoup
import time
from requests.exceptions import RequestException, Timeout, SSLError, TooManyRedirects, ConnectionError, HTTPError

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url with comprehensive error handling.
    """
    def scrape_url(self, url):
        try:
            # Make the HTTP request with timeout
            response = requests.get(url, timeout=10)
            
            # Check if the request was successful
            response.raise_for_status()
            
            # Parse the content if successful
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout:
            print(f"Timeout error occurred while accessing: {url}")
            self.output_results(f"Error: Request timed out for {url}")
            
        except SSLError:
            print(f"SSL error occurred while accessing: {url}")
            self.output_results(f"Error: SSL certificate problem for {url}")
            
        except TooManyRedirects:
            print(f"Too many redirects for: {url}")
            self.output_results(f"Error: Too many redirects for {url}")
            
        except ConnectionError:
            print(f"Connection error occurred while accessing: {url}")
            self.output_results(f"Error: Connection failed for {url}")
            
        except HTTPError as e:
            print(f"HTTP error occurred: {e} for URL: {url}")
            self.output_results(f"Error: HTTP {response.status_code} for {url}")
            
        except RequestException as e:
            print(f"General request exception occurred: {e} for URL: {url}")
            self.output_results(f"Error: Request failed for {url} - {str(e)}")
            
        except Exception as e:
            print(f"Unexpected error occurred: {e} for URL: {url}")
            self.output_results(f"Error: Unexpected error for {url} - {str(e)}")
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    Implements sleep interval between requests if specified.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape + 1):
            url = self.url_pattern % i
            print(f"Scraping: {url}")
            
            self.scrape_url(url)
            
            # Add delay between requests if sleep_interval is positive
            # Only sleep if this is not the last page
            if self.sleep_interval > 0 and i < self.pages_to_scrape:
                print(f"Sleeping for {self.sleep_interval} seconds before next request...")
                time.sleep(self.sleep_interval)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrape - CHANGED FROM 1 TO 10

"""
Custom parser function to extract quotes from the HTML content.
"""
def quotes_parser(content):
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(content, 'html.parser')
    
    # Find all quote elements
    quote_elements = soup.find_all('div', class_='quote')
    
    # Extract the text from each quote
    quotes = []
    for quote_element in quote_elements:
        # Find the span with class 'text' which contains the quote
        text_span = quote_element.find('span', class_='text')
        if text_span:
            # Get the text and clean it up (remove the quotation marks)
            quote_text = text_span.get_text().strip()
            # Remove the surrounding quotation marks if present
            if quote_text.startswith('"') and quote_text.endswith('"'):
                quote_text = quote_text[1:-1]
            elif quote_text.startswith('"'):
                quote_text = quote_text[1:]
            elif quote_text.endswith('"'):
                quote_text = quote_text[:-1]
            quotes.append(quote_text)
    
    return quotes

# Instantiate the IronhackSpider class with a 2-second sleep interval
my_spider = IronhackSpider(
    URL_PATTERN, 
    PAGES_TO_SCRAPE, 
    sleep_interval=2,  # 2-second delay between requests
    content_parser=quotes_parser
)

# Start scraping jobs
print("Starting spider to scrape 10 pages...")
print("=" * 50)
my_spider.kickstart()
print("=" * 50)
print("Spider finished! Successfully scraped 10 pages.")

Starting spider to scrape 10 pages...
Scraping: http://quotes.toscrape.com/page/1/
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, y

# Challenge 5 - Scrape a Different Website

Update the parameters passed to the `IronhackSpider` constructor so that you coder can crawl [books.toscrape.com](http://books.toscrape.com/). You will need to use a different `URL_PATTERN` (figure out the new url pattern by yourself) and write another parser function to be passed to `IronhackSpider`. 

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [7]:
# your code here
import requests
from bs4 import BeautifulSoup
import time
from requests.exceptions import RequestException, Timeout, SSLError, TooManyRedirects, ConnectionError, HTTPError

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    """
    Scrape the content of a single url with comprehensive error handling.
    """
    def scrape_url(self, url):
        try:
            # Make the HTTP request with timeout
            response = requests.get(url, timeout=10)
            
            # Check if the request was successful
            response.raise_for_status()
            
            # Parse the content if successful
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout:
            print(f"Timeout error occurred while accessing: {url}")
            self.output_results(f"Error: Request timed out for {url}")
            
        except SSLError:
            print(f"SSL error occurred while accessing: {url}")
            self.output_results(f"Error: SSL certificate problem for {url}")
            
        except TooManyRedirects:
            print(f"Too many redirects for: {url}")
            self.output_results(f"Error: Too many redirects for {url}")
            
        except ConnectionError:
            print(f"Connection error occurred while accessing: {url}")
            self.output_results(f"Error: Connection failed for {url}")
            
        except HTTPError as e:
            print(f"HTTP error occurred: {e} for URL: {url}")
            self.output_results(f"Error: HTTP {response.status_code} for {url}")
            
        except RequestException as e:
            print(f"General request exception occurred: {e} for URL: {url}")
            self.output_results(f"Error: Request failed for {url} - {str(e)}")
            
        except Exception as e:
            print(f"Unexpected error occurred: {e} for URL: {url}")
            self.output_results(f"Error: Unexpected error for {url} - {str(e)}")
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    Implements sleep interval between requests if specified.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape + 1):
            url = self.url_pattern % i
            print(f"Scraping: {url}")
            
            self.scrape_url(url)
            
            # Add delay between requests if sleep_interval is positive
            # Only sleep if this is not the last page
            if self.sleep_interval > 0 and i < self.pages_to_scrape:
                print(f"Sleeping for {self.sleep_interval} seconds before next request...")
                time.sleep(self.sleep_interval)


# Updated URL pattern for books.toscrape.com
URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 5 # how many webpages to scrape (books.toscrape.com has 50 pages total)

"""
Custom parser function to extract book information from books.toscrape.com
"""
def books_parser(content):
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(content, 'html.parser')
    
    # Find all book elements
    book_elements = soup.find_all('article', class_='product_pod')
    
    # Extract information from each book
    books = []
    for book_element in book_elements:
        # Extract book title
        title_element = book_element.find('h3').find('a')
        title = title_element.get('title', '').strip() if title_element else 'No title'
        
        # Extract book price
        price_element = book_element.find('p', class_='price_color')
        price = price_element.get_text().strip() if price_element else 'No price'
        
        # Extract rating
        rating_element = book_element.find('p', class_='star-rating')
        rating = rating_element.get('class')[1] if rating_element and len(rating_element.get('class', [])) > 1 else 'No rating'
        
        # Extract availability
        availability_element = book_element.find('p', class_='instock')
        availability = availability_element.get_text().strip() if availability_element else 'Unknown availability'
        
        # Create book info dictionary
        book_info = {
            'title': title,
            'price': price,
            'rating': rating,
            'availability': availability
        }
        
        books.append(book_info)
    
    return books

# Instantiate the IronhackSpider class for books.toscrape.com
my_spider = IronhackSpider(
    URL_PATTERN, 
    PAGES_TO_SCRAPE, 
    sleep_interval=1,  # 1-second delay between requests
    content_parser=books_parser
)

# Start scraping jobs
print("Starting spider to scrape books.toscrape.com...")
print("=" * 60)
my_spider.kickstart()
print("=" * 60)
print(f"Spider finished! Successfully scraped {PAGES_TO_SCRAPE} pages from books.toscrape.com.")

Starting spider to scrape books.toscrape.com...
Scraping: http://books.toscrape.com/catalogue/page-1.html
[{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'availability': 'In stock'}, {'title': 'Tipping the Velvet', 'price': '£53.74', 'rating': 'One', 'availability': 'In stock'}, {'title': 'Soumission', 'price': '£50.10', 'rating': 'One', 'availability': 'In stock'}, {'title': 'Sharp Objects', 'price': '£47.82', 'rating': 'Four', 'availability': 'In stock'}, {'title': 'Sapiens: A Brief History of Humankind', 'price': '£54.23', 'rating': 'Five', 'availability': 'In stock'}, {'title': 'The Requiem Red', 'price': '£22.65', 'rating': 'One', 'availability': 'In stock'}, {'title': 'The Dirty Little Secrets of Getting Your Dream Job', 'price': '£33.34', 'rating': 'Four', 'availability': 'In stock'}, {'title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'price': '£17.93', 'rating': 'Three', 'availability': 'In stock'}, {'ti

# Bonus Challenge 1 - Making Your Spider Unblockable

Use techniques such as randomizing user agents and referers in your requests to reduce the likelihood that your spider is blocked by websites. [Here](http://blog.adnansiddiqi.me/5-strategies-to-write-unblock-able-web-scrapers-in-python/) is a great article to learn these techniques.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [8]:
# your code here
import requests
from bs4 import BeautifulSoup
import time
import random
from requests.exceptions import RequestException, Timeout, SSLError, TooManyRedirects, ConnectionError, HTTPError

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
        
        # List of common referers to randomize
        self.referers = [
            'https://www.google.com/',
            'https://www.bing.com/',
            'https://www.yahoo.com/',
            'https://www.duckduckgo.com/',
            'https://www.reddit.com/',
            'https://www.wikipedia.org/',
            'https://www.amazon.com/',
            'https://www.ebay.com/',
            'https://www.facebook.com/',
            'https://www.twitter.com/'
        ]
        
        # List of user agents to randomize (various browsers and devices)
        self.user_agents = [
            # Chrome
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            
            # Firefox
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0',
            'Mozilla/5.0 (X11; Linux i686; rv:89.0) Gecko/20100101 Firefox/89.0',
            
            # Safari
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
            'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1',
            
            # Edge
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59',
            
            # Opera
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 OPR/77.0.4054.203',
            
            # Mobile devices
            'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/91.0.4472.80 Mobile/15E148 Safari/604.1',
            'Mozilla/5.0 (Linux; Android 10; SM-G981B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Mobile Safari/537.36'
        ]
    
    """
    Get random headers for each request to avoid detection
    """
    def get_random_headers(self):
        return {
            'User-Agent': random.choice(self.user_agents),
            'Referer': random.choice(self.referers),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Cache-Control': 'max-age=0'
        }
    
    """
    Scrape the content of a single url with comprehensive error handling and anti-blocking measures.
    """
    def scrape_url(self, url):
        try:
            # Get random headers for this request
            headers = self.get_random_headers()
            
            # Add random delay between 1-3 seconds before making the request
            random_delay = random.uniform(1, 3)
            time.sleep(random_delay)
            
            # Make the HTTP request with timeout and random headers
            response = requests.get(url, headers=headers, timeout=15)
            
            # Check if the request was successful
            response.raise_for_status()
            
            # Parse the content if successful
            result = self.content_parser(response.content)
            self.output_results(result)
            
        except Timeout:
            print(f"Timeout error occurred while accessing: {url}")
            self.output_results(f"Error: Request timed out for {url}")
            
        except SSLError:
            print(f"SSL error occurred while accessing: {url}")
            self.output_results(f"Error: SSL certificate problem for {url}")
            
        except TooManyRedirects:
            print(f"Too many redirects for: {url}")
            self.output_results(f"Error: Too many redirects for {url}")
            
        except ConnectionError:
            print(f"Connection error occurred while accessing: {url}")
            self.output_results(f"Error: Connection failed for {url}")
            
        except HTTPError as e:
            print(f"HTTP error occurred: {e} for URL: {url}")
            self.output_results(f"Error: HTTP {response.status_code if 'response' in locals() else 'Unknown'} for {url}")
            
        except RequestException as e:
            print(f"General request exception occurred: {e} for URL: {url}")
            self.output_results(f"Error: Request failed for {url} - {str(e)}")
            
        except Exception as e:
            print(f"Unexpected error occurred: {e} for URL: {url}")
            self.output_results(f"Error: Unexpected error for {url} - {str(e)}")
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    Implements sleep interval between requests if specified.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape + 1):
            url = self.url_pattern % i
            print(f"Scraping: {url}")
            print(f"Using random headers for this request...")
            
            self.scrape_url(url)
            
            # Add additional delay between requests if sleep_interval is positive
            # Only sleep if this is not the last page
            if self.sleep_interval > 0 and i < self.pages_to_scrape:
                print(f"Sleeping for {self.sleep_interval} seconds before next request...")
                time.sleep(self.sleep_interval)


# URL pattern for books.toscrape.com
URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html'
PAGES_TO_SCRAPE = 3  # Reduced for demonstration

"""
Custom parser function to extract book information from books.toscrape.com
"""
def books_parser(content):
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(content, 'html.parser')
    
    # Find all book elements
    book_elements = soup.find_all('article', class_='product_pod')
    
    # Extract information from each book
    books = []
    for book_element in book_elements:
        # Extract book title
        title_element = book_element.find('h3').find('a')
        title = title_element.get('title', '').strip() if title_element else 'No title'
        
        # Extract book price
        price_element = book_element.find('p', class_='price_color')
        price = price_element.get_text().strip() if price_element else 'No price'
        
        # Extract rating
        rating_element = book_element.find('p', class_='star-rating')
        rating = rating_element.get('class')[1] if rating_element and len(rating_element.get('class', [])) > 1 else 'No rating'
        
        # Extract availability
        availability_element = book_element.find('p', class_='instock')
        availability = availability_element.get_text().strip() if availability_element else 'Unknown availability'
        
        # Create book info dictionary
        book_info = {
            'title': title,
            'price': price,
            'rating': rating,
            'availability': availability
        }
        
        books.append(book_info)
    
    return books

# Instantiate the IronhackSpider class for books.toscrape.com
my_spider = IronhackSpider(
    URL_PATTERN, 
    PAGES_TO_SCRAPE, 
    sleep_interval=random.randint(2, 5),  # Random sleep interval between 2-5 seconds
    content_parser=books_parser
)

# Start scraping jobs
print("Starting spider with anti-blocking techniques...")
print("=" * 70)
print("Features enabled: Random User-Agents, Random Referers, Random Delays")
print("=" * 70)
my_spider.kickstart()
print("=" * 70)
print(f"Spider finished! Successfully scraped {PAGES_TO_SCRAPE} pages with anti-blocking measures.")

Starting spider with anti-blocking techniques...
Features enabled: Random User-Agents, Random Referers, Random Delays
Scraping: http://books.toscrape.com/catalogue/page-1.html
Using random headers for this request...
[{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'availability': 'In stock'}, {'title': 'Tipping the Velvet', 'price': '£53.74', 'rating': 'One', 'availability': 'In stock'}, {'title': 'Soumission', 'price': '£50.10', 'rating': 'One', 'availability': 'In stock'}, {'title': 'Sharp Objects', 'price': '£47.82', 'rating': 'Four', 'availability': 'In stock'}, {'title': 'Sapiens: A Brief History of Humankind', 'price': '£54.23', 'rating': 'Five', 'availability': 'In stock'}, {'title': 'The Requiem Red', 'price': '£22.65', 'rating': 'One', 'availability': 'In stock'}, {'title': 'The Dirty Little Secrets of Getting Your Dream Job', 'price': '£33.34', 'rating': 'Four', 'availability': 'In stock'}, {'title': 'The Coming Woman: A Novel Based on the Life of the

# Bonus Challenge 2 - Making Asynchronous Calls

Implement asynchronous calls to `IronhackSpider`. You will make requests in parallel to complete your tasks faster.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import random
import concurrent.futures
from requests.exceptions import RequestException, Timeout, SSLError, TooManyRedirects, ConnectionError, HTTPError
from typing import List, Dict, Any
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class IronhackSpider:
    """
    Asynchronous web spider with anti-blocking techniques and parallel requests using ThreadPoolExecutor.
    """
    def __init__(self, url_pattern: str, pages_to_scrape: int = 10, max_workers: int = 5, 
                 content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.max_workers = max_workers  # Maximum worker threads
        self.content_parser = content_parser
        
        # List of common referers to randomize
        self.referers = [
            'https://www.google.com/',
            'https://www.bing.com/',
            'https://www.yahoo.com/',
            'https://www.duckduckgo.com/',
            'https://www.reddit.com/',
            'https://www.wikipedia.org/',
            'https://www.amazon.com/',
            'https://www.ebay.com/',
            'https://www.facebook.com/',
            'https://www.twitter.com/'
        ]
        
        # List of user agents to randomize
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
        ]
    
    """
    Get random headers for each request to avoid detection
    """
    def get_random_headers(self) -> Dict[str, str]:
        return {
            'User-Agent': random.choice(self.user_agents),
            'Referer': random.choice(self.referers),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Cache-Control': 'max-age=0'
        }
    
    """
    Function to scrape a single URL (thread pool, I think!)
    """
    def scrape_single_url(self, url: str) -> Any:
        try:
            # Get random headers
            headers = self.get_random_headers()
            
            # Add random delay between requests
            time.sleep(random.uniform(0.5, 2.0))
            
            logger.info(f"Scraping: {url}")
            
            # Make the HTTP request with timeout
            response = requests.get(url, headers=headers, timeout=15)
            
            # Is the request successful
            response.raise_for_status()
            
            # Parse the content if successful
            result = self.content_parser(response.content)
            self.output_results(result)
            return {'url': url, 'success': True, 'data': result}
            
        except Timeout:
            logger.error(f"Timeout error for {url}")
            self.output_results(f"Error: Request timed out for {url}")
            return {'url': url, 'success': False, 'error': 'Timeout'}
            
        except (SSLError, ConnectionError) as e:
            logger.error(f"Connection error for {url}: {e}")
            self.output_results(f"Error: Connection failed for {url}")
            return {'url': url, 'success': False, 'error': f'Connection: {str(e)}'}
            
        except HTTPError as e:
            logger.error(f"HTTP error {response.status_code} for {url}")
            self.output_results(f"Error: HTTP {response.status_code} for {url}")
            return {'url': url, 'success': False, 'error': f'HTTP {response.status_code}'}
            
        except RequestException as e:
            logger.error(f"Request exception for {url}: {e}")
            self.output_results(f"Error: Request failed for {url} - {str(e)}")
            return {'url': url, 'success': False, 'error': f'Request: {str(e)}'}
            
        except Exception as e:
            logger.error(f"Unexpected error for {url}: {e}")
            self.output_results(f"Error: Unexpected error for {url} - {str(e)}")
            return {'url': url, 'success': False, 'error': f'Unexpected: {str(e)}'}
    
    """
    Export the scraped content
    """
    def output_results(self, r):
        print(r)
    
    """
    Main method to kickstart the spider with parallel requests using ThreadPoolExecutor
    """
    def kickstart(self):
        # Create list of all URLs to scrape
        urls = [self.url_pattern % i for i in range(1, self.pages_to_scrape + 1)]
        
        logger.info(f"Starting parallel spider with {self.max_workers} worker threads")
        logger.info(f"Total pages to scrape: {len(urls)}")
        
        start_time = time.time()
        
        # Use ThreadPoolExecutor for parallel execution
        with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            # Submit all tasks to the executor
            future_to_url = {executor.submit(self.scrape_single_url, url): url for url in urls}
            
            results = []
            successful = 0
            failed = 0
            
            # Process completed tasks as they finish
            for future in concurrent.futures.as_completed(future_to_url):
                url = future_to_url[future]
                try:
                    result = future.result()
                    results.append(result)
                    if result['success']:
                        successful += 1
                    else:
                        failed += 1
                except Exception as e:
                    logger.error(f"Exception processing {url}: {e}")
                    results.append({'url': url, 'success': False, 'error': f'Processing: {str(e)}'})
                    failed += 1
        
        end_time = time.time()
        
        # Calculate statistics
        logger.info(f"Spider completed in {end_time - start_time:.2f} seconds")
        logger.info(f"Successful: {successful}, Failed: {failed}")
        
        # Print summary
        if failed > 0:
            logger.warning("Failed URLs:")
            for result in results:
                if not result['success']:
                    logger.warning(f"  {result['url']}: {result.get('error', 'Unknown error')}")
        
        return results


# URL pattern for books.toscrape.com
URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html'
PAGES_TO_SCRAPE = 10  # Number of pages to scrape

"""
Custom parser function to extract book information from books.toscrape.com
"""
def books_parser(content):
    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(content, 'html.parser')
    
    # Find all book elements
    book_elements = soup.find_all('article', class_='product_pod')
    
    # Extract information from each book
    books = []
    for book_element in book_elements:
        # Extract book title
        title_element = book_element.find('h3').find('a')
        title = title_element.get('title', '').strip() if title_element else 'No title'
        
        # Extract book price
        price_element = book_element.find('p', class_='price_color')
        price = price_element.get_text().strip() if price_element else 'No price'
        
        # Extract rating
        rating_element = book_element.find('p', class_='star-rating')
        rating = rating_element.get('class')[1] if rating_element and len(rating_element.get('class', [])) > 1 else 'No rating'
        
        # Extract availability
        availability_element = book_element.find('p', class_='instock')
        availability = availability_element.get_text().strip() if availability_element else 'Unknown availability'
        
        # Create book info dictionary
        book_info = {
            'title': title,
            'price': price,
            'rating': rating,
            'availability': availability
        }
        
        books.append(book_info)
    
    return books

# Instantiate the IronhackSpider class with parallel capabilities
my_spider = IronhackSpider(
    URL_PATTERN, 
    PAGES_TO_SCRAPE, 
    max_workers=4,  # Maximum 4 worker threads
    content_parser=books_parser
)

# Start scraping jobs
print("=" * 80)
print("Starting PARALLEL spider with ThreadPoolExecutor...")
print("Features: Parallel requests, Random User-Agents, Random Referers, Random Delays")
print("=" * 80)

start_total_time = time.time()
results = my_spider.kickstart()
end_total_time = time.time()

print("=" * 80)
print(f"Total execution time: {end_total_time - start_total_time:.2f} seconds")
print("Parallel spider completed!")
print("=" * 80)

# Print final summary
successful_count = sum(1 for result in results if result['success'])
failed_count = len(results) - successful_count

print(f"Final Summary:")
print(f"Total URLs: {len(results)}")
print(f"Successful: {successful_count}")
print(f"Failed: {failed_count}")
print("=" * 80)

INFO:__main__:Starting parallel spider with 4 worker threads
INFO:__main__:Total pages to scrape: 10


Starting PARALLEL spider with ThreadPoolExecutor...
Features: Parallel requests, Random User-Agents, Random Referers, Random Delays


INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-1.html
INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-3.html
INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-4.html
INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-2.html


[{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'availability': 'In stock'}, {'title': 'Tipping the Velvet', 'price': '£53.74', 'rating': 'One', 'availability': 'In stock'}, {'title': 'Soumission', 'price': '£50.10', 'rating': 'One', 'availability': 'In stock'}, {'title': 'Sharp Objects', 'price': '£47.82', 'rating': 'Four', 'availability': 'In stock'}, {'title': 'Sapiens: A Brief History of Humankind', 'price': '£54.23', 'rating': 'Five', 'availability': 'In stock'}, {'title': 'The Requiem Red', 'price': '£22.65', 'rating': 'One', 'availability': 'In stock'}, {'title': 'The Dirty Little Secrets of Getting Your Dream Job', 'price': '£33.34', 'rating': 'Four', 'availability': 'In stock'}, {'title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'price': '£17.93', 'rating': 'Three', 'availability': 'In stock'}, {'title': 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'p

INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-5.html
INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-7.html


[{'title': 'Princess Jellyfish 2-in-1 Omnibus, Vol. 01 (Princess Jellyfish 2-in-1 Omnibus #1)', 'price': '£13.61', 'rating': 'Five', 'availability': 'In stock'}, {'title': 'Princess Between Worlds (Wide-Awake Princess #5)', 'price': '£13.34', 'rating': 'Five', 'availability': 'In stock'}, {'title': 'Pop Gun War, Volume 1: Gift', 'price': '£18.97', 'rating': 'One', 'availability': 'In stock'}, {'title': 'Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics', 'price': '£36.28', 'rating': 'Two', 'availability': 'In stock'}, {'title': 'Patience', 'price': '£10.16', 'rating': 'Three', 'availability': 'In stock'}, {'title': 'Outcast, Vol. 1: A Darkness Surrounds Him (Outcast #1)', 'price': '£15.44', 'rating': 'Four', 'availability': 'In stock'}, {'title': 'orange: The Complete Collection 1 (orange: The Complete Collection #1)', 'price': '£48.41', 'rating': 'One', 'availability': 'In

INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-6.html
INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-8.html
INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-9.html
INFO:__main__:Scraping: http://books.toscrape.com/catalogue/page-10.html


[{'title': 'Immunity: How Elie Metchnikoff Changed the Course of Modern Medicine', 'price': '£57.36', 'rating': 'Five', 'availability': 'In stock'}, {'title': 'I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)', 'price': '£29.17', 'rating': 'Two', 'availability': 'In stock'}, {'title': 'I am a Hero Omnibus Volume 1', 'price': '£54.63', 'rating': 'Three', 'availability': 'In stock'}, {'title': 'How to Be Miserable: 40 Strategies You Already Use', 'price': '£46.03', 'rating': 'One', 'availability': 'In stock'}, {'title': 'Her Backup Boyfriend (The Sorensen Family #1)', 'price': '£33.97', 'rating': 'One', 'availability': 'In stock'}, {'title': 'Giant Days, Vol. 2 (Giant Days #5-8)', 'price': '£22.11', 'rating': 'Two', 'availability': 'In stock'}, {'title': 'Forever and Forever: The Courtship of Henry Longfellow and Fanny Appleton', 'price': '£29.69', 'rating': 'Three', 'availability': 'In stock'}, {'title': 'First and First (Five Boroughs #3)', 'price': '£1

INFO:__main__:Spider completed in 5.11 seconds
INFO:__main__:Successful: 10, Failed: 0


[{'title': "The Bridge to Consciousness: I'm Writing the Bridge Between Science and Our Old and New Beliefs.", 'price': '£32.00', 'rating': 'Three', 'availability': 'In stock'}, {'title': "The Artist's Way: A Spiritual Path to Higher Creativity", 'price': '£38.49', 'rating': 'Five', 'availability': 'In stock'}, {'title': 'The Art of War', 'price': '£33.34', 'rating': 'Five', 'availability': 'In stock'}, {'title': 'The Argonauts', 'price': '£10.93', 'rating': 'Two', 'availability': 'In stock'}, {'title': 'The 10% Entrepreneur: Live Your Startup Dream Without Quitting Your Day Job', 'price': '£27.55', 'rating': 'Three', 'availability': 'In stock'}, {'title': 'Suddenly in Love (Lake Haven #1)', 'price': '£55.99', 'rating': 'Two', 'availability': 'In stock'}, {'title': 'Something More Than This', 'price': '£16.24', 'rating': 'Four', 'availability': 'In stock'}, {'title': 'Soft Apocalypse', 'price': '£26.12', 'rating': 'Two', 'availability': 'In stock'}, {'title': "So You've Been Publicly S