# Advanced Web Scraping Lab

In this lab you will first learn the following code snippet which is a simple web spider class that allows you to scrape paginated webpages. Read the code, run it, and make sure you understand how it work. In the challenges of this lab, we will guide you in building up this class so that eventually you will have a more robus web spider that you can further work on in the Web Scraping Project.

In [None]:
import requests
from bs4 import BeautifulSoup

class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    
    class AccessError(Error):
        """Raised when the input value is too small"""
        print("Access Error".format(Error))
        pass

    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url)
            result = self.content_parser(response.content)
            self.output_results(result)
            if response.status_code > 399 and response.status_code < 500
                raise AccessError
        except requests.exceptions.Timeout as Timeout:
            print("Page does not answer, TIMEOUT".format(Timeout))
        except requests.exceptions.SSLError as SSL:
            print("Security Problem with SSL".format(SSL))
        except requests.exceptions.TooManyRedirects as TMR:
            print("Security Problem with SSL".format(TMR)) 
        except ValueTooSmallError(response.status_code):
            print('toto')
        except requests.exceptions.RequestException as e:
            # catastrophic error. bail.
            raise SystemExit(e)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge



# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

## Challenge 1 - Custom Parser Function

In this challenge, complete the custom `quotes_parser()` function so that the returned result contains the quote string instead of the whole html page content.

In the cell below, write your updated `quotes_parser()` function and kickstart the spider. Make sure the results being printed contain a list of quote strings extracted from the html content.

In [None]:
"""
This is a custom parser function you will complete in the challenge.
Right now it simply returns the string passed to it. But in this lab
you will complete this function so that it extracts the quotes.
This function will be passed to the IronhackSpider class.
"""
def quotes_parser(content):
    soup = BeautifulSoup(content, "lxml")
    return [quote.text for quote in soup.find_all('span',{'class':'text'})]

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

my_spider.kickstart()

## Challenge 2 - Error Handling

In `IronhackSpider.scrape_url()`, catch any error that might occur when you make requests to scrape the webpage. This includes checking the response status code and catching http request errors such as timeout, SSL, and too many redirects.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
import requests
from bs4 import BeautifulSoup
import time

class AccessError(Exception):
        """Raised when the input value is too small"""
        #print("Access Erroraaa".format(Exception))
        pass
    
class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    


    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        try:
            response = requests.get(url)
            result = self.content_parser(response.content)
            self.output_results(result)
            if response.status_code > 399 and response.status_code < 500:
                print(response.status_code)
                raise AccessError
        except requests.exceptions.Timeout as Timeout:
            print("Page does not answer, TIMEOUT".format(Timeout))
        except requests.exceptions.SSLError as SSL:
            print("Security Problem with SSL".format(SSL))
        except requests.exceptions.TooManyRedirects as TMR:
            print("Security Problem with SSL".format(TMR)) 
        except AccessError(response.status_code):
            print('toto')
        except requests.exceptions.RequestException as e:
            # catastrophic error. bail.
            raise SystemExit(e)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval > 0:
                time.sleep(self.sleep_interval)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge



# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()





# Challenge 3 - Sleep Interval

In `IronhackSpider.kickstart()`, implement `sleep_interval`. You will check if `self.sleep_interval` is larger than 0. If so, tell the FOR loop to sleep the given amount of time before making the next request.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
# your code here

# Challenge 4 - Test Batch Scraping

Change the `PAGES_TO_SCRAPE` value from `1` to `10`. Try if your code still works as intended to scrape 10 webpages. If there are errors in your code, fix them.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
PAGES_TO_SCRAPE = 10 # how many webpages to scrapge

# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()

def quotes_parser(content):
    soup = BeautifulSoup(content, "lxml")
    return [quote.text for quote in soup.find_all('span',{'class':'text'})]# Challenge 5 - Scrape a Different Website

Update the parameters passed to the `IronhackSpider` constructor so that you coder can crawl [books.toscrape.com](http://books.toscrape.com/). You will need to use a different `URL_PATTERN` (figure out the new url pattern by yourself) and write another parser function to be passed to `IronhackSpider`. 

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
def quotes_parser(content):
    soup = BeautifulSoup(content, "lxml")
    return [quote.text for quote in soup.find_all('span',{'class':'text'})]

In [None]:
def books_parser(content):
    soup = BeautifulSoup(content, "lxml")
    return [book.find('a').find('img').get('alt') for book in soup.find('ol',{'class':'row'}).find_all('li')]

In [None]:
url = 'http://books.toscrape.com/catalogue/page-1.html'
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")

In [None]:
book_list = soup.find('ol',{'class':'row'}).find_all('li')

In [None]:
book_list[0].find('a').find('img').get('alt')

In [None]:
URL_PATTERN = 'http://books.toscrape.com/catalogue/page-%s.html' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 10 # how many webpages to scrapge


# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=books_parser)

# Start scraping jobs
my_spider.kickstart()

# Bonus Challenge 1 - Making Your Spider Unblockable

Use techniques such as randomizing user agents and referers in your requests to reduce the likelihood that your spider is blocked by websites. [Here](http://blog.adnansiddiqi.me/5-strategies-to-write-unblock-able-web-scrapers-in-python/) is a great article to learn these techniques.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [None]:
import requests
from bs4 import BeautifulSoup
import time
import numpy as np

def get_random_ua():
    random_ua = ''
    ua_file = 'ua_file.txt'
    try:
        with open(ua_file) as f:
            lines = f.readlines()
        if len(lines) > 0:
            prng = np.random.RandomState()
            index = prng.permutation(len(lines) - 1)
            idx = np.asarray(index, dtype=np.integer)[0]
            random_ua = lines[int(idx)].strip()
    except Exception as ex:
        print('Exception in random_ua')
        print(str(ex))
    finally:
        return random_ua
    
class AccessError(Exception):
        """Raised when the input value is too small"""
        #print("Access Erroraaa".format(Exception))
        pass
    
class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_pattern, pages_to_scrape=10, sleep_interval=-1, content_parser=None):
        self.url_pattern = url_pattern
        self.pages_to_scrape = pages_to_scrape
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    


    """
    Scrape the content of a single url.
    """
    def scrape_url(self, url):
        try:
            user_agent = get_random_ua()
            headers = {'user-agent': user_agent}
            response = requests.get(url, headers=headers)
            result = self.content_parser(response.content)
            self.output_results(result)
            if response.status_code > 399 and response.status_code < 500:
                print(response.status_code)
                raise AccessError
        except requests.exceptions.Timeout as Timeout:
            print("Page does not answer, TIMEOUT".format(Timeout))
        except requests.exceptions.SSLError as SSL:
            print("Security Problem with SSL".format(SSL))
        except requests.exceptions.TooManyRedirects as TMR:
            print("Security Problem with SSL".format(TMR)) 
        except AccessError(response.status_code):
            print('toto')
        except requests.exceptions.RequestException as e:
            # catastrophic error. bail.
            raise SystemExit(e)
    
    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def output_results(self, r):
        print(r)
    
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        for i in range(1, self.pages_to_scrape+1):
            self.scrape_url(self.url_pattern % i)
            if self.sleep_interval > 0:
                time.sleep(self.sleep_interval)


URL_PATTERN = 'http://quotes.toscrape.com/page/%s/' # regex pattern for the urls to scrape
PAGES_TO_SCRAPE = 1 # how many webpages to scrapge



# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_PATTERN, PAGES_TO_SCRAPE, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()


In [None]:

"""	
headers = {
        'user-agent': user_agent,
        'referer':referer
    }"""

"""r = requests.get('example.com',headers=headers,proxies={'https': proxy_url})


proxy = get_random_proxy().replace('\n', '')
        service_args = [
            '--proxy={0}'.format(proxy),
            '--proxy-type=http',
            '--proxy-auth=user:path'
        ]
        print('Processing..' + url)
        driver = webdriver.PhantomJS(service_args=service_args)"""

"""headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
        'referrer': 'https://google.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.9',
        'Pragma': 'no-cache',
    }"""

# Bonus Challenge 2 - Making Asynchronous Calls

Implement asynchronous calls to `IronhackSpider`. You will make requests in parallel to complete your tasks faster.

In the cell below, place your entire code including the updated `IronhackSpdier` class and the code to kickstart the spider.

In [160]:
import aiohttp
import asyncio
from bs4 import BeautifulSoup
import requests
import time
import numpy as np

url_list = ["http://quotes.toscrape.com/page/1/", "http://quotes.toscrape.com/page/2/"]


async def fetch(client, url):
    async with client.get(url) as resp:
        assert resp.status == 200
        return await resp.text()
    
async def fetchAll(url_list):
    async with aiohttp.ClientSession(loop=asyncio.get_event_loop()) as client: # loop is necessary in jupyter, beware with python
        results = await asyncio.gather(*[fetch(client, url) for url in url_list], return_exceptions=True)
        return results

def cleanAll(parsedData):
    for data in parsedData:
        print (quotes_parser(data))
    
def quotes_parser(content):
    soup = BeautifulSoup(content, "lxml")
    return [quote.text for quote in soup.find_all('span',{'class':'text'})]


cleanAll(await fetchAll(url_list))

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']
["“This life is what you make it. No matter what, you're going t

In [215]:
import requests
from bs4 import BeautifulSoup
import time
import numpy as np

    
class IronhackSpider:
    """
    This is the constructor class to which you can pass a bunch of parameters. 
    These parameters are stored to the class instance variables so that the
    class functions can access them later.
    
    url_pattern: the regex pattern of the web urls to scape
    pages_to_scrape: how many pages to scrape
    sleep_interval: the time interval in seconds to delay between requests. If <0, requests will not be delayed.
    content_parser: a function reference that will extract the intended info from the scraped content.
    """
    def __init__(self, url_list, sleep_interval=-1, content_parser=None):
        self.url_list = url_list
        self.sleep_interval = sleep_interval
        self.content_parser = content_parser
    

    """
    Scrape the content of a single url.
    """
    async def fetch(self, client, url):
        async with client.get(url) as resp:
            assert resp.status == 200
            return await resp.text()

    async def fetchAll(self):
        async with aiohttp.ClientSession(loop=asyncio.get_event_loop()) as client: # loop is necessary in jupyter, beware with python
            result =(await (asyncio.gather(*[self.fetch(client, url) for url in self.url_list], return_exceptions=True)))
            self.cleanAll(result)
        

    """
    Export the scraped content. Right now it simply print out the results.
    But in the future you can export the results into a text file or database.
    """
    def cleanAll(self, parsedData):
        for data in parsedData:
            print(self.content_parser(data))   
    """
    After the class is instantiated, call this function to start the scraping jobs.
    This function uses a FOR loop to call `scrape_url()` for each url to scrape.
    """
    def kickstart(self):
        asyncio.create_task(self.fetchAll()) #Only jupyter ?
        #asyncio.run on Python


# Instantiate the IronhackSpider class
my_spider = IronhackSpider(URL_LIST, content_parser=quotes_parser)

# Start scraping jobs
my_spider.kickstart()


['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", '“Try not to become a man of success. Rather become a man of value.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“I have not failed. I've just found 10,000 ways that won't work.”", "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", '“A day without sunshine is like, you know, night.”']
["“This life is what you make it. No matter what, you're going t