# The 'Requests' package & BeautifulSoup

## The DOM Tree (Document Object Model)

Adapted from: http://docs.python-requests.org/en/master/user/quickstart/

Install __requests__, if it isn't already (I use "pip", e.g. "pip install requests".  You may use git, easy_install, etc.)

Then import the library, like you would any other.

In [1]:
import requests

To "get" a webpage (or anything else available via the internet), use requests.get(URL):

In [2]:
r = requests.get('https://api.github.com/events')

Now, we have a Response object called r. We can get all the information we need from this object.

(We can use more complex URL encodings, to do HTTP POST calls, or to pass parameters in the URL, but we'll skip those for now)


Use the .text attribute to get the raw text of the response.
This will be key when we start parsing/scraping webpages with Beautiful Soup!

In [3]:
r.text



If we're dealing with JSON data, use...you guessed it!  The .json() method (note: This will raise an exception if you're not dealing with JSON data/the JSON decoding fails)

In [4]:
r.json()

[{'id': '27538980347',
  'type': 'PullRequestEvent',
  'actor': {'id': 49699333,
   'login': 'dependabot[bot]',
   'display_login': 'dependabot',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/dependabot[bot]',
   'avatar_url': 'https://avatars.githubusercontent.com/u/49699333?'},
  'repo': {'id': 253501538,
   'name': 'EricRicketts/Exercism',
   'url': 'https://api.github.com/repos/EricRicketts/Exercism'},
  'payload': {'action': 'opened',
   'number': 59,
   'pull_request': {'url': 'https://api.github.com/repos/EricRicketts/Exercism/pulls/59',
    'id': 1265729561,
    'node_id': 'PR_kwDODxwgYs5LcYAZ',
    'html_url': 'https://github.com/EricRicketts/Exercism/pull/59',
    'diff_url': 'https://github.com/EricRicketts/Exercism/pull/59.diff',
    'patch_url': 'https://github.com/EricRicketts/Exercism/pull/59.patch',
    'issue_url': 'https://api.github.com/repos/EricRicketts/Exercism/issues/59',
    'number': 59,
    'state': 'open',
    'locked': False,
    'title': 'Bu

That's really all you need for basic web access!  Isn't python GREAT!!?

But here's a few other things...

To check the "status code" returned by the web server (to error check and make sure you actually connected and retrieved the web document correctly), we use the .status_code attribute:

In [5]:
r = requests.get('http://httpbin.org/get')
r.status_code

200

Note the 200, which means "It's all good!"

But let's check what happens when things go bad:

In [6]:
bad_r = requests.get('http://httpbin.org/status/404')
bad_r.status_code

404

A 404 error is "Not found".  That's bad.  So we should attempt to catch that.  We can use the raise_for_status() method, to tell us when something bad has happened.  It'll throw an error (which we could catch in a try/except block if we wanted to) 

In [7]:
bad_r.raise_for_status()

HTTPError: 404 Client Error: NOT FOUND for url: http://httpbin.org/status/404

ACK!  Disaster!  Actually, let's practice our exception handling, so we can handle this gracefully:

In [8]:
def catch_me_some_web_exceptions(URL):
    my_request = requests.get(URL)
    try:
        my_request.raise_for_status()
        print("Your webpage", URL, "is waiting for you in the response variable!")
    except:
        print("Turns out", URL, "is a bad URL, bro.")
   
    # regardless of whether or not this works, return the request object
    return my_request

See how simple/useful catching exceptions is?  Let's try it out!

In [9]:
my_bad_request = catch_me_some_web_exceptions('http://httpbin.org/status/404')

Turns out http://httpbin.org/status/404 is a crappy URL, bro.


In [10]:
my_good_request = catch_me_some_web_exceptions('http://www.google.com')

Your webpage http://www.google.com is waiting for you in the response variable!


If you're really the curious sort, you can check out the HTTP respose headers using -- wait for it -- the .headers attribute

In [11]:
my_good_request.headers

{'Date': 'Tue, 07 Mar 2023 07:03:29 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'Content-Length': '6248', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2023-03-07-07; expires=Thu, 06-Apr-2023 07:03:29 GMT; path=/; domain=.google.com; Secure, AEC=ARSKqsKHtky15Tb8jjUv5O5onZ52u2DrXYiDwH_mMe06ekJQUEackXs7TAs; expires=Sun, 03-Sep-2023 07:03:29 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax, NID=511=E16IoJ9jotoQ6E2l9TkEu32-iU3C5Itxb7CrKzLJK3w9dxwQ9NlqN9dZLBd7cAxkQW-HiWBGoZj4JneMM_dljmVHQZZqumD2qO2wTcqEslCYaj8mNrE-aCdwO50KY7aMT4De8EnI7jscjd41iQgrGIIM_pcfTOJlmaOGQN3Ui60; expires=Wed, 06-Sep-2023 07:03:29 GMT; path=/; domain=.google.com; HttpOnly'}

You can access the headers like a dict, if you're so inclined:

In [12]:
my_good_request.headers['Content-Type']

'text/html; charset=ISO-8859-1'

In [13]:
my_bad_request.headers['Content-Type']

'text/html; charset=utf-8'

In [14]:
r.headers['Content-Type']

'application/json'

In [15]:
r.headers

{'Date': 'Tue, 07 Mar 2023 07:02:52 GMT', 'Content-Type': 'application/json', 'Content-Length': '311', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

In [16]:
my_good_request.text

'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="Ht0LEa0a0dvVcPRUYAWP0g">(function(){window.google={kEI:\'weEGZMaPN_X1kPIP6K6m8A8\',kEXPI:\'0,1359409,6058,207,4804,2316,383,246,5,1129120,1625,1196074,867,379925,16114,28684,22430,1362,12319,17580,4998,13228,3847,35733,2711,2872,2891,3926,214,8220,30668,30022,15324,432,3,346,1244,1,5445,148,11323,2652,4,1528,2304,29062,13063,13660,2980,1457,9358,7428,5830,2527,4094,17,7579,1,11943,30211,2,14022,2373,342,21266,1758,5679,1020,31122,4569,6255,23421,1252,5835,14968,4332,7484,445,2,

In [17]:
r.text

'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.28.1", \n    "X-Amzn-Trace-Id": "Root=1-6406e19c-103e0c6a2bbdb1b8469a15aa"\n  }, \n  "origin": "172.250.225.33", \n  "url": "http://httpbin.org/get"\n}\n'

In [18]:
my_bad_request.text

''

In [19]:
test = requests.get('https://github.com/thisshouldntbehere')

In [20]:
print(test.text)

Not Found


In [21]:
test.status_code

404

In [None]:
enc = requests.get('https://www.twitter.com')

In [None]:
enc.headers

## BeautifulSoup
__BeautifulSoup parse an HTML string to a Document Object Model (DOM) and then navigates the DOM.__

You may need to do 'pip install beautifulsoup4'

In [None]:
import requests
from bs4 import BeautifulSoup

### Simple table example

In [None]:
html_with_table = '''
<html>
   <body>
      <table cellpadding="3">
         <tr class="header"><th>Country</th><th>Capital</th><th>Population</th></tr>
         <tr><td>France</td><td>Paris</td><td>67000000</td></tr>
         <tr><td>Germany</td><td>Berlin</td><td>83000000</td></tr>
         <tr><td>Spain</td><td>Madrid</td><td>47000000</td></tr>
      </table>
   </body>
</html>'''

In [None]:
# Parse with beautiful soup (used to use 'lxml' as the parser)
soup = BeautifulSoup(html_with_table, 'html.parser')

table = soup.find('table')
table_rows = table.findAll('tr')
for table_row in table_rows:
    table_row_class_list = table_row.attrs.get('class', [])
    # print(table_row_class_list)
    if 'header' not in table_row_class_list:
        table_data_cells = table_row.findAll('td')
        country_cell = table_data_cells[0]
        capital_cell = table_data_cells[1]
        print(f"The capital of {country_cell.string} is {capital_cell.string}.")

### Basketball draft example

In [None]:
r = requests.get("http://www.basketball-reference.com/draft/NBA_2003.html")

In [None]:
r.text

In [None]:
# Parse with beautiful soup (used to use 'lxml' as the parser)
soup = BeautifulSoup(r.content, 'html.parser') # returns bytes
soup1 = BeautifulSoup(r.text, 'html.parser')   # returns Unicode string, automatically decoded

In [None]:
all_rows = soup.findAll('tr')
all_tables = soup.findAll('table')

In [None]:
# the findAll method has an argument attrs 
# that takes a dictionary with attribute names and values

# find the 'table' elements that have an attribute 'id' with value 'stats'
main_table = soup.findAll('table', attrs={"id" : "stats"})[0]  # there is only one, pick it
main_body = main_table.find('tbody')

In [None]:
main_body

In [None]:
print(main_body.prettify())

In [None]:
# Useful to look at this in parallel with "inspecting" the table in Google Chrome

import re
url = "http://www.basketball-reference.com/draft/NBA_2003.html"

# Get the base URL: include first // but stop before further /
# r'[^/]+\/\/[^/]+' = match one or more characters that are not a slash, 
#                     match 2 slashes, match one or more characters that are not a slash,
base_url = re.match(r'[^/]+\/\/[^/]+', url).group(0)  
# Yields "http://www.basketball-reference.com"

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

# method find finds the first occurrence of the tag; returns None is tag not present
main_body = main_table.find('tbody') # there is only one <tbody> tag in the doc, so OK

for player in main_body.findAll('tr'):
    player_data = player.findAll('td')
    if (len(player_data) > 0):
        player_a_tag = player_data[2].find('a') # Third row has the player name and URL
        # print(player_a_tag)
        player_name = player_a_tag.string
        # print(player_name)
        if (player_a_tag != None):
            player_link = player_a_tag.attrs['href'] # get URL of the player
        else:
            # If the link IS empty, assign the empty string
            player_link = ''
        print(f"{player_name}: {base_url}{player_link}")