## Making HTTP Requests

We will learn how to perform HTTP requests in the notebook as well as how to analyze and interact with the HTML response data in this exercise.

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import pdir
import re


Let's prepare a request. We use the `Request()` class to prepare a `'GET'` request to the [airlinequality.com](https://www.airlinequality.com/airline-reviews/kenya-airways/) page. A `GET` request is a request to fetch, or 'get', the content of a web page. Running `req?` prints the docstring for the `req` prepared. Looking at its usage, we can see how the request can be sent using a session. This is similar to opening a web browser (starting a session) and then requesting a URL.

In [2]:
url = 'https://www.airlinequality.com/airline-reviews/kenya-airways/'
req = requests.Request('GET', url)
req = req.prepare()
req

<PreparedRequest [GET]>

Next, we make the request and store the response in a variable named `resp`. This will return the HTTP response, as referenced by the page variable. The `with` statement initialize a session whose scope is limited to the intended code block. This means we don't have to worry about explicitly closing the session, as this is done automatically. Running `resp` and `resp.status_code` helps us to investigate the response. The string representation of the page should indicate a 200 status code response.

In [3]:
with requests.Session() as sess:
    resp = sess.send(req)
    
print(resp)
print(resp.status_code)

<Response [200]>
200


Then we assign the response text to the `page_html` variable and take a look at the first 300 characters of the string.

In [4]:
page_html = resp.text
page_html[:300]

'<!doctype html>\n\n<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 8]>    <html class="no-js lt-ie9 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 9]>    <htm'

We can format the output above with the help of `BeautifulSoup`: a library used extensively for HTML parsing.

In [5]:
print(BeautifulSoup(page_html, 'html.parser').prettify()[:600])

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7 lt-ie10" lang="en-GB"> <![endif]-->
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 lt-ie10" lang="en-GB"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9 lt-ie10" lang="en-GB"> <![endif]-->
<!--[if IE 9]>    <html class="no-js lt-ie10" lang="en-GB"> <![endif]-->
<!--[if gt IE 8]><!-->
<html lang="en-GB">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <title>
   Kenya Airways Customer Reviews - SKYTRAX
  </title>
  <!-- Google Chrome Frame for IE -->
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-


We can take this step further and display the HTML in Jupyter by using the IPython `display` module. Here, we can see the HTML rendered as well as possible, given that no JavaScript code has been run and no external resources have been loaded. For example, the images that are hosted on the [airlinequality.com](https://www.airlinequality.com/airline-reviews/kenya-airways/) server are not rendered. Instead, we can see the alternate text—that is, squares of Kenya airways photos, ads, and so on.

In [6]:
from IPython.display import HTML
HTML(page_html)

0,1
Food & Beverages,12345
Inflight Entertainment,12345
Seat Comfort,12345
Staff Service,12345
Value for Money,12345

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Lusaka to New York via Nairobi
Date Flown,September 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Value For Money,12345

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Dubai to Nairobi
Date Flown,September 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Value For Money,12345

0,1
Type Of Traveller,Couple Leisure
Seat Type,Economy Class
Route,Harare to London via Nairobi / Amsterdam
Date Flown,October 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Ground Service,12345
Value For Money,12345
Recommended,no

0,1
Type Of Traveller,Business
Seat Type,Economy Class
Route,Johannesburg to Jeddah via Nairobi
Date Flown,September 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Ground Service,12345
Value For Money,12345
Recommended,no

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Antananarivo to Mumbai via Nairobi
Date Flown,September 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Wifi & Connectivity,12345

0,1
Aircraft,Boeing 787
Type Of Traveller,Couple Leisure
Seat Type,Business Class
Route,London Heathrow to Nairobi
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345

0,1
Type Of Traveller,Family Leisure
Seat Type,Economy Class
Route,Nairobi to Johannesburg
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Value For Money,12345

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Douala to Dubai via Nairobi
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Wifi & Connectivity,12345

0,1
Type Of Traveller,Family Leisure
Seat Type,Economy Class
Route,Antananarivo to Nairobi
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Ground Service,12345
Value For Money,12345
Recommended,no

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Bujumbura to Johannesburg via Nairobi
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Value For Money,12345


Previously, we made a request by preparing it and then used a session to send it. This can be done using a shorthand method instead as shown below. Note that it should show a 200 status code to indicate a successful response to our request.

In [7]:
url = 'http://www.python.org/'
resp = requests.get(url)
resp

<Response [200]>

To print the URL of our page, we can run the `resp.url` while to get the history attribute of the page we use `resp.history`. Note that the URL that's returned is not what we input. We're being redirected to a secure URL. Any redirects are stored in the `.history` attribute. In this case, we find one page in here with the status code 301 (permanent redirect) corresponding to the original URL that was requested.

In [8]:
print(resp.url)
print(resp.history)

https://www.python.org/
[<Response [301]>]


## Making API Calls

API calls allows us to access well-structured data on demand. Here, we'll work with the Wikipedia API as a way of learning how APIs generally work. We'll make API request and ingest the JSON response data. Let's begin by running code below to define our API request URL. Note that the backslashes are used to split the code across multiple lines, while the forward slashes are part of the url. Basically, we're requesting for the resources that satisfies a set of parameters, such as `action`, `page`, `section`, and so on. Notice that we've explicitly requested a response in `.json` format by appending `&format=json` to the URL. These parameters are specific to Wikipedia API, but many APIs work in a similar way.

In [9]:
url = ('https://en.wikipedia.org/w/api.php?'\
       'action=parse' \
       '&page=List_of_countries_by_central_bank_interest_rates' \
       '&section=1' \
       '&prop=wikitext' \
       '&format=json')
url

'https://en.wikipedia.org/w/api.php?action=parse&page=List_of_countries_by_central_bank_interest_rates&section=1&prop=wikitext&format=json'

Next we can make the API request. Running `resp.text[:100]` will print the first 100 lines of the response string. Notice how the string appears to represent JSON data, which is what we asked for when making the request.

In [None]:
api_resp = requests.get(url)
print(api_resp)
api_resp.text

We can now convert the string into a Python dictionary by using `.json()`. Note that there are some nested fields in the data such as `parse`, `pageid`, and `wikitext`. Here, we'll just get the key from `data` due to output length.

In [None]:
data = api_resp.json()
print(type(data))
data.keys()

Next we extract the page title from the API response data running the code below:

In [None]:
data['parse']['title']

Here is how we can extract a row from the table contained in the API response data. Note that we extracted a table from the response data as a `wikitext` string, and then separated the rows by splitting on `|-`. Ideally, the table data returned from Wikipedia's free API would be in a nicer format for us to ingest programatically; which is not the case here. 

In [None]:
row_idx = 16

wikitext = data['parse']['wikitext']['*']
table_row = wikitext.split('|-')[row_idx]
table_row

We can then parse the data from the row using regular expressions. Here, we output a countries name. Note that APIs would easily make this data easily available to the application using it. In this situation, for Wikipedia, we can still get to the data very easily by extracting the field between `flag|` and `}`. In this scenario, we extracted `Bulgaria` from `{{flag|Canada}}`.

In [None]:
re.findall('flag\|([^}]+)}', table_row)

Some data is easier to extract using Python string methods such as `.split()` and `.strip()` rather than regular expressions. For instance, we can run the following command to get the interest rate for our extracted row. Therefore, by iterating over all of the rows in the API response data, we can apply this extraction to each and pull out all of the data for the requested table resource.

In [None]:
table_row.split('||')[1].strip()

## Parsing HTML

We'll scrape the review content of Kenya Airways. 

In [None]:
url = 'https://www.airlinequality.com/airline-reviews/kenya-airways/'
resp = requests.get(url)
print(resp.url, resp.status_code)

Then we'll load the HTML as a `BeautifulSoup` object so that it can be parsed. Note that we are using Python's default `'html.parser'` as the parser, but other parsing libraries such as `lxml` can be installed and used instead. The advantage of `lxml` over `html.parser` is that it is generally better at parsing messy or malformed HTML code. That is, it is forgiving and fixes problems like unclosed tags, improperly nested tags, and missing head or body tags. It is also somewhat faster than `html.parser`. However, the speed is not necessarily an advantage in web scraping. The bottleneck is the speed of the network itself. One disadvantage of `lxml` is that in some cases it has to be installed seperately and depends on third party C libraries to function. This result to portability issues and the ease of use compared to `html.parser`. 

In [None]:
soup_lxml = BeautifulSoup(resp.content, 'lxml')

`html5lib` is another popular HTML parser. Just like `lxml`, it is an extremely forgiving parser that even corrects broken HTML. The downsides are that it also depends on external dependency and is slower than both `lxml` and `html.parser`. Despite this, it can be used if working with messy or handwritten HTML sites. 

In [None]:
soup_5lib = BeautifulSoup(resp.content, 'html5lib')




Because each HTML parser interprets documents differently, the final `BeautifulSoup` object may differ depending on which is utilized. Here we'll just use the `html.parser`. 

In [None]:
soup = BeautifulSoup(resp.content, 'html.parser')

Also, we can pull the docstring of the `BeautifulSoup` object by using `soup?` or the in-built `dir()` function which will lists the attributes and methods of an object that we'll use later, such as `find_all`, `attrs`, and `text`.

In [None]:
dir(soup)[180:]

Still, this is not particularly informative. Therefore we'll use the `pdir` library to obtain information about Python objects.
Note that we import `pdir2` as `pdir` despited its listing as `pdir2` on the Python Packaging Index (PyPI). Notice how the methods and attributes have been organized into groupings, and descriptions included where applicable.
Let's pay particular attention to `.find_all()` method.

In [None]:
pdir(soup)

Here is how we get the the `h1` heading from the page. Usually, pages have only one `h1` (top-level heading) element thus we get only one here. 

In [None]:
h1 = soup.find_all('h1')
h1[0]

Previously, we identified the HTML element that contains our data, but the field still needs to be extracted as a string. Therefore, we can get the HTML element attributes and text as shown below. Basically, `h1[0]` is the first (and only) list element and to get the element attributes we've used `.attrs`. Here we see the `itemprop` element both of which can be referenced in CSS stylesheets. To get the text we've used `.text`.

In [None]:
print('Attribute: ', h1[0].attrs)
print('Text: ', h1[0].text)

### `find()` and `find_all()`

These two functions in BeautifulSoup will be used a lot. With them we can filter HTML pages to find a list of desired tags, or a single tag, based on their various attributes. For example, here we use `.findall()` to see the number of extracted image tags. From the result we can see that there are 27 images.

In [None]:
imgs = soup.find_all('img')
len(imgs)

Most of the images are from SkyTrax and we can see them by printing the source of each image. This will output the path of each image resource.

In [None]:
for element in imgs:
    if 'src' in element.attrs.keys():
        print(element.attrs['src'])

For now, we'll just get the content of the first review. Here we'll be using the `.find()` method which is identical to `.find_all()` except that it returns only the first match. When calling this, we passed a second argument, `{'id': 'anchor810808'}` which follows the form `{attribute_name: attribute_value}`.

In [None]:
content = soup.find("div", {"id": "anchor810808"})
content

Having narrowed down the content of interest, for now let's focus on the table content. Usually, tables are organized into headers (`<th>`), rows (`<tr>`) and data entries (`<td>`). In this case, there are no table headings.

In [None]:
table_head = content.find_all('th')
table_head

However, there are rows and of interest is the data in these rows.

In [None]:
table_rows = content.find_all('tr')
table_data = content.find_all('td')
table_data[:5]

The next step is parsing the table data as plain text, from the list of HTML elements

In [None]:
for i, t in enumerate(table_data[:6]):
    print(i, t.text.strip())
    print('-'*20)

From the previous results, it is evident that the response to the first entry is the subsequent entry and this goes on and on. Therefore, to get the output as required, we have to run a command which search for all the rows and then select them and then search for all the data elements in these rows. The results we get is then placed in a dictionary. However, we still have an issue with the number ratings. We'll look at how to fix these later on. 

In [None]:
rating_details = {}

for i in range(len(content.find_all('tr'))):
    rating_details[content.find_all('tr')[i].find('td').text] =\
        content.find_all('tr')[i].find_all('td')[1].text
    
rating_details

## More on Navigating Trees

Before diving deep into writing web crawlers, let's look at another site for online shopping so as to get some insight that could be useful in our future web-scraping. The `.find_all()` function is used to find tags based on their name and attributes. However, we may want to find a tag based on a specified location in a document and that's where tree navigation comes in handy. 

In [None]:
shopping_url = "https://www.pythonscraping.com/pages/page3.html"
online_shop = BeautifulSoup(urlopen(shopping_url), 'html.parser')

### Dealing with Children and Other Descendants

Just like in a human family tree, children are always exactly one tag below a parent, whereas descendants can be at any level in the tree below a parent. For instance, we talk about tables, `tr` are descendants of `table` tag while `tr`, `th`, `td`, `img`, and `span` are the descendants of `table` tag. In other words, all children are descendants, but not all descendants are children. BeautifulSoup will always deal with the descendants of the currently selected tag. For example if we only want to find descendants that are children, we can use the `.children` tag. Here we see a list of product rows in the `giftList` table, including the initial row of column labels. 

In [None]:
for child in (online_shop.
              find('table', {'id': 'giftList'})
              .children):
    print(child)

Note that using `.descendants` instead of `.children` results to the printing of two dozen tags within the table; hence the importance to differentiate between children and descendants. 

In [None]:
for descendant in (online_shop
                   .find('table', {'id': 'giftList'})
                   .descendants):
    print(descendant.text)

### Dealing with Siblings

In the next example we use `.next_siblings` function which makes it trivial to collect data from tables, especially one with title rows. The code below ensure that we get all row of products from the product table beside the first title row. Object cannot be siblings with themselves, thus the title row get skipped. As the name of the function implies, only the next siblings are called. So, by selecting the title row and calling `.next_siblings` we have selected all the rows in the table without selecting the title row itself.

In [None]:
for siblings in (online_shop.
                 find('table', {'id': 'giftList'})
                 .tr
                 .next_siblings):
    print(siblings)

As a complement to `.next_siblings`, the `previous_siblings` function can be used if there is an easily selectable tag at the end of a list of sibling tag that we'd like to get. Additionally, we have `.next_sibling` and `.previous_sibling` which only return a single tag rather than a list of them. 

### Dealing with Parents

Most of the time when scraping pages, we'll likely discover that we rarely need to find parents of tags than we need to find their children or siblings. Occassionally, we may find ourself in odd situations that require us to use `.parent` and `.parents`. For instance, in the code below, we print the price of the object represented by the first git image. Basically, the first selection is the image tag where `src="../img/gifts/img1.jpg"`. Then we go ahead to select the parent of that tag `td`. After which we get the `previous_sibling` of the `td` tag specifically the text within the tag.

In [None]:
(online_shop
 .find('img', {'src': '../img/gifts/img1.jpg'})
 .parent
 .previous_sibling
 .get_text())