# Getting started with scrapin

## **Scraping python.org with Requests and Beauritifsoup**

In this recipe we will install Requests and Beautiful Soup and scrape some content from www.python.org.  We'll install both of the libraries and get some basic familiarity with them.  We'll come back to them both in subsequent chapters and dive deeper into each.

### **How to do it**

Now let's go and learn to scrape a couple events. For this recipe we will start by using interactive python.

In [1]:
# 1 Import requests
import requests

In [2]:
# 2 We now use requests to make a GET HTTP request for the url by making a GET requests
url = 'https://www.python.org/events/python-events'
req = requests.get(url)

In [3]:
# 3 That downloaded the page content but it is stored in our requests object req.
# We can retrieve the content using the
# .text property.  This prints the first 200 characters.
req.text[:200]

'<!doctype html>\n<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->\n<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->\n<!--[if IE 8]>      <h'

We now have the raw HTML of the page.  We can now use beautiful soup to parse the HTML and retrieve the event data. 

In [4]:
# 1 First let's import BeautifulSoup
from bs4 import BeautifulSoup
# 2 Now we create a BeautifulSoup object and pass it the HTML.
soup = BeautifulSoup(req.text, 'html.parser')
# 3 Now we tell Beautiful Soup to find the main <ul> tag for the recent events, and then to get all the <li> tags below it.
events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')
# 4 And finally we can loop through each of the <li> elements, extracting the event details, and print each to the console:
for event in events:
    event_details = dict()
    event_details['name'] = event.find('h3').find('a').text
    event_details['location'] = event.find('span', {'class', 'event-location'}).text
    event_details['time'] = event.find('time').text
    print(event_details)

{'name': 'PyConFr 2023', 'location': 'Bordeaux, France', 'time': '16 Feb. – 19 Feb.  2023'}
{'name': 'PyCon Namibia 2023', 'location': 'Windhoek, Namibia', 'time': '21 Feb. – 23 Feb.  2023'}
{'name': 'PyCon PH 2023', 'location': 'Manila, Philippines', 'time': '25 Feb. – 26 Feb.  2023'}
{'name': 'GeoPython 2023', 'location': 'Basel, Switzerland', 'time': '06 March – 08 March  2023'}
{'name': 'PyCon DE & PyData Berlin 2023', 'location': 'Berlin, Germany', 'time': '17 April – 19 April  2023'}
{'name': 'PyCon US 2023', 'location': 'Salt Lake City, Utah, USA', 'time': '19 April – 27 April  2023'}


### **How it works**

We will dive into details of both Requests and Beautiful Soup in the next chapter, but for now let's just summarize a few key points about how this works.  The following important points about Requests:

* Requests is used to execute HTTP requests.  We used it to make a GET verb request of the URL for the events page.
* The Requests object holds the results of the request.  This is not only the page content, but also many other items about the result such as HTTP status codes and headers.
* Requests is used only to get the page, it does not do an parsing.

We use Beautiful Soup to do the parsing of the HTML and also the finding of content within the HTML.

We used the power of Beautiful Soup to:

* Find the `<ul>` element representing the section, which is found by looking for a `<ul>` with the a class attribute that has a value of list-recent-events.
* From that object, we find all the `<li>` elements. 

Each of these `<li>` tags represent a different event.  We iterate over each of those making a dictionary from the event data found in child HTML tags:

* The name is extracted from the `<a>` tag that is a child of the `<h3>` tag
* The location is the text content of the `<span>` with a class of `event-location`
And the time is extracted from the datetime attribute of the `<time>` tag.

## **Scraping Python.org in urllib3 and Beautiful Soup**

In this recipe we swap out the use of requests for another library `urllib3`. This is **another common library for retrieving data from URLs and for other functions involving URLs such as parsing of the parts of the actual URL and handling various encodings**.

### **Getting ready**

In [5]:
%pip install urllib3

Note: you may need to restart the kernel to use updated packages.


### **How to do it**

In [6]:
import urllib3
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = urllib3.PoolManager()
    res = req.request('GET', url)
    
    soup = BeautifulSoup(res.data, 'html.parser')

    events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')

    for event in events:
        event_details = dict()
        event_details['name'] = event.find('h3').find("a").text
        event_details['location'] = event.find('span', {'class', 'event-location'}).text
        event_details['time'] = event.find('time').text
        print(event_details)

get_upcoming_events('https://www.python.org/events/python-events/')

{'name': 'PyConFr 2023', 'location': 'Bordeaux, France', 'time': '16 Feb. – 19 Feb.  2023'}
{'name': 'PyCon Namibia 2023', 'location': 'Windhoek, Namibia', 'time': '21 Feb. – 23 Feb.  2023'}
{'name': 'PyCon PH 2023', 'location': 'Manila, Philippines', 'time': '25 Feb. – 26 Feb.  2023'}
{'name': 'GeoPython 2023', 'location': 'Basel, Switzerland', 'time': '06 March – 08 March  2023'}
{'name': 'PyCon DE & PyData Berlin 2023', 'location': 'Berlin, Germany', 'time': '17 April – 19 April  2023'}
{'name': 'PyCon US 2023', 'location': 'Salt Lake City, Utah, USA', 'time': '19 April – 27 April  2023'}


### **How it works**

The only difference in this recipe is how we fetch the resource:

```python
req = urllib3.PoolManager()
res = req.request('GET', url)
```

Unlike `Requests`, `urllib3` **doesn't apply header encoding automatically**. **The reason why the code snippet works in the preceding example is because BS4 handles encoding beautifully**.  But you should keep in mind that **encoding is an important part of scraping**. **If you decide to use your own framework or use other libraries, make sure encoding is well handled**.

### **There's more**

`Requests` and `urllib3` are very similar in terms of capabilities. **it is generally recommended to use Requests when it comes to making HTTP requests**. The following code example illustrates a few advanced features: 

In [7]:
import requests
import json
# builds on top of urllib3's connection pooling
# session reuses the same TCP connection if 
# requests are made to the same host
# see https://en.wikipedia.org/wiki/HTTP_persistent_connection for details
session = requests.Session()

# You may pass in custom cookie
r = session.get('http://httpbin.org/get', cookies={'my-cookie': 'browser'})
print(r.text)
# '{"cookies": {"my-cookie": "test cookie"}}'

# Streaming is another nifty feature
# From http://docs.python-requests.org/en/master/user/advanced/#streaming-requests
# copyright belongs to reques.org
r = requests.get('http://httpbin.org/stream/20', stream=True)

for line in r.iter_lines():
    # filter out keep-alive new lines
    if line:
        decoded_line = line.decode('utf-8')
        print(json.loads(decoded_line))

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Cookie": "my-cookie=browser", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-63b8bafb-32e7adc802b511dd15b2a5f3"
  }, 
  "origin": "102.23.26.168", 
  "url": "http://httpbin.org/get"
}

{'url': 'http://httpbin.org/stream/20', 'args': {}, 'headers': {'Host': 'httpbin.org', 'X-Amzn-Trace-Id': 'Root=1-63b8bafc-0018dbcf1b7cafb207adc5eb', 'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*'}, 'origin': '102.23.26.168', 'id': 0}
{'url': 'http://httpbin.org/stream/20', 'args': {}, 'headers': {'Host': 'httpbin.org', 'X-Amzn-Trace-Id': 'Root=1-63b8bafc-0018dbcf1b7cafb207adc5eb', 'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*'}, 'origin': '102.23.26.168', 'id': 1}
{'url': 'http://httpbin.org/stream/20', 'args': {}, 'headers': {'Host': 'httpbin.org'

## **Scraping Python.org with Scrapy**

Scrapy is a very popular open source Python scraping framework for extracting data. **It was originally designed for only scraping, but it is has also evolved into a powerful web crawling solution**.

In our previous recipes, we used Requests and urllib2 to fetch data and Beautiful Soup to extract data. **Scrapy offers all of these functionalities with many other built-in modules and extensions. It is also our tool of choice when it comes to scraping with Python**. 

Scrapy offers a number of powerful features that are worth mentioning:

* Built-in extensions to make HTTP requests and handle compression, authentication, caching, manipulate user-agents, and HTTP headers
* Built-in support for selecting and extracting data with selector languages such as CSS and XPath, as well as support for utilizing regular expressions for selection of content and links 
* Encoding support to deal with languages and non-standard encoding declarations
* Flexible APIs to reuse and write custom middleware and pipelines, which provide a clean and easy way to implement tasks such as automatically downloading assets (for example, images or media) and storing data in storage such as file systems, S3, databases, and others

### **Getting started**

There are several means of creating a scraper with Scrapy.  **One is a programmatic pattern where we create the crawler and spider in our code**.  It is also possible to **configure a Scrapy project from templates or generators and then run the scraper from the command line using the scrapy command**.  This book will follow the programmatic pattern as it contains the code in a single file more effectively.  This will help when we are putting together specific, targeted, recipes with Scrapy. 

**This isn't necessarily a better way of running a Scrapy scraper than using the command line execution, just one that is a design decision for this book**.  Ultimately this book is not about Scrapy (there are other books on just Scrapy), but more of an exposition on various things you may need to do when scraping, and in the ultimate creation of a functional scraper as a service in the cloud.

### **How to do it**

In [8]:
import scrapy
from scrapy.crawler import CrawlerProcess

class PythonEventsSpider(scrapy.Spider):
    name = 'pythoneventsspider'

    start_urls = ['https://www.python.org/events/python-events/',]
    found_events = []

    def parse(self, response):
        for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
            event_details = dict()
            event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
            event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
            event_details['time'] = event.xpath('p/time/text()').extract_first()
            self.found_events.append(event_details)

if __name__ == "__main__":
    process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
    process.crawl(PythonEventsSpider)
    spider = next(iter(process.crawlers)).spider
    process.start()

    for event in spider.found_events: print(event)
    process.stop()

{'name': 'PyConFr 2023', 'location': 'Bordeaux, France', 'time': '16 Feb. – 19 Feb. '}
{'name': 'PyCon Namibia 2023', 'location': 'Windhoek, Namibia', 'time': '21 Feb. – 23 Feb. '}
{'name': 'PyCon PH 2023', 'location': 'Manila, Philippines', 'time': '25 Feb. – 26 Feb. '}
{'name': 'GeoPython 2023', 'location': 'Basel, Switzerland', 'time': '06 March – 08 March '}
{'name': 'PyCon DE & PyData Berlin 2023', 'location': 'Berlin, Germany', 'time': '17 April – 19 April '}
{'name': 'PyCon US 2023', 'location': 'Salt Lake City, Utah, USA', 'time': '19 April – 27 April '}
{'name': 'XtremePython 2022', 'location': 'Online', 'time': '27 Dec.'}
{'name': 'PyCon Bolivia 2022', 'location': 'Cochabamba, Bolivia', 'time': '09 Dec. – 10 Dec. '}


### **How it works**

We will get into some details about Scrapy in later chapters, but let's just go through this code quick to get a feel how it is accomplishing this scrape.  Everything in Scrapy revolves around creating a spider.  Spiders crawl through pages on the Internet based upon rules that we provide.  This spider only processes one single page, so it's not really much of a spider.  But it shows the pattern we will use through later Scrapy examples.

The spider is created with a class definition that derives from one of the Scrapy spider classes.  Ours derives from the scrapy.Spider class.
```py
class PythonEventsSpider(scrapy.Spider):
    name = 'pythoneventsspider'

    start_urls = ['https://www.python.org/events/python-events/',]
```
Every spider is given a name, and also one or more start_urls which tell it where to start the crawling.

This spider has a field to store all the events that we find:
```py
    found_events = []
```

```python
The spider then has a method names parse which will be called for every page the spider collects.

def parse(self, response):
        for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'):
            event_details = dict()
            event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first()
            event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first()
            event_details['time'] = event.xpath('p/time/text()').extract_first()
            self.found_events.append(event_details)
```

The implementation of this method uses and XPath selection to get the events from the page (XPath is the built in means of navigating HTML in Scrapy). It them builds the event_details dictionary object similarly to the other examples, and then adds it to the found_events list.

The remaining code does the programmatic execution of the Scrapy crawler.
```py
    process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'})
    process.crawl(PythonEventsSpider)
    spider = next(iter(process.crawlers)).spider
    process.start()
```

It starts with the creation of a CrawlerProcess which does the actual  crawling and a lot of other tasks.  We pass it a `LOG_LEVEL` of `ERROR` to prevent the voluminous Scrapy output.  Change this to `DEBUG` and re-run it to see the difference.

Next we tell the crawler process to use our Spider implementation.  We get the actual spider object from that crawler so that we can get the items when the crawl is complete.  And then we kick of the whole thing by calling `process.start()`.

When the crawl is completed we can then iterate and print out the items that were found.
```py
    for event in spider.found_events: print(event)
```

> This example really didn't touch any of the power of Scrapy.  We will look more into some of the more advanced features later in the book.

## **Scraping python.org with Selenium**

This recipe will ntroduce Selenium, a framework that is very different from the frameworks in the previour recipes. In fact, Selenium is often used in functional/acceptance testing. We want to demonstrate this tool as it offers unique benefits from the scraping perspective. Several that we will look at later in the book are the ability to fill out forms, press buttons, and wait for dynamic JS to be downloaded and executed.
S
Selenium itself is a programming language neutral framework. It offers a number of programming language bindings, such as Python, Java, C# and PHP (among others). The framework also provides many components that focus on testing. Three commonly used components are:

* IDE for recording and replaying tests
* Webdriver, which actually launches a web browser(such as Firefox, Chrome, or Internet Explorer) by sending commands and sending the results to the selected browser
* A grid server executes tests with a web browser on a remove server. It can run multiple test cases in parallel.

In [9]:
%conda install -c conda-forge selenium

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/ibrahim/miniconda3/envs/scraping

  added / updated specs:
    - selenium


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    async_generator-1.10       |             py_0          18 KB  conda-forge
    exceptiongroup-1.1.0       |     pyhd8ed1ab_0          18 KB  conda-forge
    outcome-1.2.0              |     pyhd8ed1ab_0          12 KB  conda-forge
    selenium-4.7.2             |     pyhd8ed1ab_0         272 KB  conda-forge
    sortedcontainers-2.4.0     |     pyhd8ed1ab_0          26 KB  conda-forge
    trio-0.22.0                |  py310hff52083_1         540 KB  conda-forge
    trio-websocket-0.9.2       |     pyhd8ed1ab_0          25 KB  conda-forge
    wsproto-1.2.0              |     pyhd8ed1ab_0          24 KB  conda-forge
    -----------------

In [10]:
%conda install -c conda-forge webdriver-manager

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/ibrahim/miniconda3/envs/scraping

  added / updated specs:
    - webdriver-manager


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python-dotenv-0.21.0       |     pyhd8ed1ab_0          21 KB  conda-forge
    webdriver-manager-3.8.5    |     pyhd8ed1ab_0          24 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          45 KB

The following NEW packages will be INSTALLED:

  click              conda-forge/noarch::click-8.1.3-unix_pyhd8ed1ab_2 
  colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_0 
  python-dotenv      conda-forge/noarch::python-dotenv-0.21.0-pyhd8ed1ab_0 
  tqdm               conda-forge/noarch::tqdm-4.64.1-pyhd8ed1ab_0 
  webdriver-manag

In [21]:
from selenium import webdriver
from selenium.webdriver.edge.service import Service as EdgeService
from webdriver_manager.microsoft import EdgeChromiumDriverManager

def get_upcoming_events(url):
    driver = webdriver.Edge(service=EdgeService(EdgeChromiumDriverManager().install()))
    driver.get(url)

    events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li')

    for event in events:
        event_details = dict()
        event_details['name'] = event.find_element_by_xpath('p/span[@class="event-location"]').text
        event_details['location'] = event.find_element_by_xpath('p/time').text
        print(event_details)

    driver.quit()

get_upcoming_events('https://www.python.org/events/python-events/')

WebDriverException: Message: unknown error: cannot find msedge binary
Stacktrace:
#0 0x563a77644a03 <unknown>
#1 0x563a773da111 <unknown>
#2 0x563a77402239 <unknown>
#3 0x563a773ff9b0 <unknown>
#4 0x563a77440a6f <unknown>
#5 0x563a77438613 <unknown>
#6 0x563a7740b199 <unknown>
#7 0x563a7740c3ee <unknown>
#8 0x563a77684cd8 <unknown>
#9 0x563a77686b8e <unknown>
#10 0x563a776865ff <unknown>
#11 0x563a776872a5 <unknown>
#12 0x563a77673629 <unknown>
#13 0x563a7768760e <unknown>
#14 0x563a77668a56 <unknown>
#15 0x563a776a31f8 <unknown>
#16 0x563a776a3330 <unknown>
#17 0x563a776bd066 <unknown>
#18 0x7f97a382a609 start_thread
