# Getting started with scrapin

## **Scraping python.org with Requests and Beauritifsoup**

In this recipe we will install Requests and Beautiful Soup and scrape some content from www.python.org.  We'll install both of the libraries and get some basic familiarity with them.  We'll come back to them both in subsequent chapters and dive deeper into each.

### **How to do it**

Now let's go and learn to scrape a couple events. For this recipe we will start by using interactive python.

In [1]:
# 1 Import requests
import requests

In [2]:
# 2 We now use requests to make a GET HTTP request for the url by making a GET requests
url = 'https://www.python.org/events/python-events'
req = requests.get(url)

In [3]:
# 3 That downloaded the page content but it is stored in our requests object req.
# We can retrieve the content using the
# .text property.  This prints the first 200 characters.
req.text[:200]

'<!doctype html>\n<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->\n<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->\n<!--[if IE 8]>      <h'

We now have the raw HTML of the page.  We can now use beautiful soup to parse the HTML and retrieve the event data. 

In [14]:
# 1 First let's import BeautifulSoup
from bs4 import BeautifulSoup
# 2 Now we create a BeautifulSoup object and pass it the HTML.
soup = BeautifulSoup(req.text, 'html.parser')
# 3 Now we tell Beautiful Soup to find the main <ul> tag for the recent events, and then to get all the <li> tags below it.
events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')
# 4 And finally we can loop through each of the <li> elements, extracting the event details, and print each to the console:
for event in events:
    event_details = dict()
    event_details['name'] = event.find('h3').find('a').text
    event_details['location'] = event.find('span', {'class', 'event-location'}).text
    event_details['time'] = event.find('time').text
    print(event_details)

{'name': 'PyConFr 2023', 'location': 'Bordeaux, France', 'time': '16 Feb. – 19 Feb.  2023'}
{'name': 'PyCon Namibia 2023', 'location': 'Windhoek, Namibia', 'time': '21 Feb. – 23 Feb.  2023'}
{'name': 'PyCon PH 2023', 'location': 'Manila, Philippines', 'time': '25 Feb. – 26 Feb.  2023'}
{'name': 'GeoPython 2023', 'location': 'Basel, Switzerland', 'time': '06 March – 08 March  2023'}
{'name': 'PyCon DE & PyData Berlin 2023', 'location': 'Berlin, Germany', 'time': '17 April – 19 April  2023'}
{'name': 'PyCon US 2023', 'location': 'Salt Lake City, Utah, USA', 'time': '19 April – 27 April  2023'}


### **How it works**

We will dive into details of both Requests and Beautiful Soup in the next chapter, but for now let's just summarize a few key points about how this works.  The following important points about Requests:

* Requests is used to execute HTTP requests.  We used it to make a GET verb request of the URL for the events page.
* The Requests object holds the results of the request.  This is not only the page content, but also many other items about the result such as HTTP status codes and headers.
* Requests is used only to get the page, it does not do an parsing.

We use Beautiful Soup to do the parsing of the HTML and also the finding of content within the HTML.

We used the power of Beautiful Soup to:

* Find the `<ul>` element representing the section, which is found by looking for a `<ul>` with the a class attribute that has a value of list-recent-events.
* From that object, we find all the `<li>` elements. 

Each of these `<li>` tags represent a different event.  We iterate over each of those making a dictionary from the event data found in child HTML tags:

* The name is extracted from the `<a>` tag that is a child of the `<h3>` tag
* The location is the text content of the `<span>` with a class of `event-location`
And the time is extracted from the datetime attribute of the `<time>` tag.

## **Scraping Python.org in urllib3 and Beautiful Soup**

In this recipe we swap out the use of requests for another library `urllib3`. This is **another common library for retrieving data from URLs and for other functions involving URLs such as parsing of the parts of the actual URL and handling various encodings**.

### **Getting ready**

In [15]:
%pip install urllib3

Note: you may need to restart the kernel to use updated packages.


### **How to do it**

In [17]:
import urllib3
from bs4 import BeautifulSoup

def get_upcoming_events(url):
    req = urllib3.PoolManager()
    res = req.request('GET', url)
    
    soup = BeautifulSoup(res.data, 'html.parser')

    events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li')

    for event in events:
        event_details = dict()
        event_details['name'] = event.find('h3').find("a").text
        event_details['location'] = event.find('span', {'class', 'event-location'}).text
        event_details['time'] = event.find('time').text
        print(event_details)

get_upcoming_events('https://www.python.org/events/python-events/')

{'name': 'PyConFr 2023', 'location': 'Bordeaux, France', 'time': '16 Feb. – 19 Feb.  2023'}
{'name': 'PyCon Namibia 2023', 'location': 'Windhoek, Namibia', 'time': '21 Feb. – 23 Feb.  2023'}
{'name': 'PyCon PH 2023', 'location': 'Manila, Philippines', 'time': '25 Feb. – 26 Feb.  2023'}
{'name': 'GeoPython 2023', 'location': 'Basel, Switzerland', 'time': '06 March – 08 March  2023'}
{'name': 'PyCon DE & PyData Berlin 2023', 'location': 'Berlin, Germany', 'time': '17 April – 19 April  2023'}
{'name': 'PyCon US 2023', 'location': 'Salt Lake City, Utah, USA', 'time': '19 April – 27 April  2023'}


### **How it works**

The only difference in this recipe is how we fetch the resource:

```python
req = urllib3.PoolManager()
res = req.request('GET', url)
```

Unlike `Requests`, `urllib3` **doesn't apply header encoding automatically**. **The reason why the code snippet works in the preceding example is because BS4 handles encoding beautifully**.  But you should keep in mind that **encoding is an important part of scraping**. **If you decide to use your own framework or use other libraries, make sure encoding is well handled**.

### **There's more**

`Requests` and `urllib3` are very similar in terms of capabilities. **it is generally recommended to use Requests when it comes to making HTTP requests**. The following code example illustrates a few advanced features: 

In [None]:
import requests
import json
# builds on top of urllib3's connection pooling
# session reuses the same TCP connection if 
# requests are made to the same host
# see https://en.wikipedia.org/wiki/HTTP_persistent_connection for details
session = requests.Session()

# You may pass in custom cookie
r = session.get('http://httpbin.org/get', cookies={'my-cookie': 'browser'})
print(r.text)
# '{"cookies": {"my-cookie": "test cookie"}}'

# Streaming is another nifty feature
# From http://docs.python-requests.org/en/master/user/advanced/#streaming-requests
# copyright belongs to reques.org
r = requests.get('http://httpbin.org/stream/20', stream=True)

for line in r.iter_lines():
    # filter out keep-alive new lines
    if line:
        decoded_line = line.decode('utf-8')
        print(json.loads(decoded_line))