# **BASICS**

## Making HTTP Requests

We will learn how to perform HTTP requests in the notebook as well as how to analyze and interact with the HTML response data in this exercise.

In [1]:
from urllib.request import urlopen
from urllib.error import URLError
from bs4 import BeautifulSoup
import requests
import datetime
import random
import pdir
import re

Let's prepare a request. We use the `Request()` class to prepare a `'GET'` request to the [airlinequality.com](https://www.airlinequality.com/airline-reviews/kenya-airways/) page. A `GET` request is a request to fetch, or 'get', the content of a web page. Running `req?` prints the docstring for the `req` prepared. Looking at its usage, we can see how the request can be sent using a session. This is similar to opening a web browser (starting a session) and then requesting a URL.

In [2]:
url = 'https://www.airlinequality.com/airline-reviews/kenya-airways/'
req = requests.Request('GET', url)
req = req.prepare()
req

<PreparedRequest [GET]>

Next, we make the request and store the response in a variable named `resp`. This will return the HTTP response, as referenced by the page variable. The `with` statement initialize a session whose scope is limited to the intended code block. This means we don't have to worry about explicitly closing the session, as this is done automatically. Running `resp` and `resp.status_code` helps us to investigate the response. The string representation of the page should indicate a 200 status code response.

In [3]:
with requests.Session() as sess:
    resp = sess.send(req)
    
print(resp)
print(resp.status_code)

<Response [200]>
200


Then we assign the response text to the `page_html` variable and take a look at the first 300 characters of the string.

In [4]:
page_html = resp.text
page_html[:300]

'<!doctype html>\n\n<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 8]>    <html class="no-js lt-ie9 lt-ie10" lang="en-GB"> <![endif]-->\n<!--[if IE 9]>    <htm'

We can format the output above with the help of `BeautifulSoup`: a library used extensively for HTML parsing.

In [5]:
print(BeautifulSoup(page_html, 'html.parser').prettify()[:600])

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7 lt-ie10" lang="en-GB"> <![endif]-->
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8 lt-ie10" lang="en-GB"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9 lt-ie10" lang="en-GB"> <![endif]-->
<!--[if IE 9]>    <html class="no-js lt-ie10" lang="en-GB"> <![endif]-->
<!--[if gt IE 8]><!-->
<html lang="en-GB">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <title>
   Kenya Airways Customer Reviews - SKYTRAX
  </title>
  <!-- Google Chrome Frame for IE -->
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-


We can take this step further and display the HTML in Jupyter by using the IPython `display` module. Here, we can see the HTML rendered as well as possible, given that no JavaScript code has been run and no external resources have been loaded. For example, the images that are hosted on the [airlinequality.com](https://www.airlinequality.com/airline-reviews/kenya-airways/) server are not rendered. Instead, we can see the alternate text—that is, squares of Kenya airways photos, ads, and so on.

In [6]:
from IPython.display import HTML
HTML(page_html)

0,1
Food & Beverages,12345
Inflight Entertainment,12345
Seat Comfort,12345
Staff Service,12345
Value for Money,12345

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Lusaka to New York via Nairobi
Date Flown,September 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Value For Money,12345

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Dubai to Nairobi
Date Flown,September 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Value For Money,12345

0,1
Type Of Traveller,Couple Leisure
Seat Type,Economy Class
Route,Harare to London via Nairobi / Amsterdam
Date Flown,October 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Ground Service,12345
Value For Money,12345
Recommended,no

0,1
Type Of Traveller,Business
Seat Type,Economy Class
Route,Johannesburg to Jeddah via Nairobi
Date Flown,September 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Ground Service,12345
Value For Money,12345
Recommended,no

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Antananarivo to Mumbai via Nairobi
Date Flown,September 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Wifi & Connectivity,12345

0,1
Aircraft,Boeing 787
Type Of Traveller,Couple Leisure
Seat Type,Business Class
Route,London Heathrow to Nairobi
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345

0,1
Type Of Traveller,Family Leisure
Seat Type,Economy Class
Route,Nairobi to Johannesburg
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Value For Money,12345

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Douala to Dubai via Nairobi
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Wifi & Connectivity,12345

0,1
Type Of Traveller,Family Leisure
Seat Type,Economy Class
Route,Antananarivo to Nairobi
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Ground Service,12345
Value For Money,12345
Recommended,no

0,1
Type Of Traveller,Solo Leisure
Seat Type,Economy Class
Route,Bujumbura to Johannesburg via Nairobi
Date Flown,July 2022
Seat Comfort,12345
Cabin Staff Service,12345
Food & Beverages,12345
Inflight Entertainment,12345
Ground Service,12345
Value For Money,12345


Previously, we made a request by preparing it and then used a session to send it. This can be done using a shorthand method instead as shown below. Note that it should show a 200 status code to indicate a successful response to our request.

In [7]:
url = 'http://www.python.org/'
resp = requests.get(url)
resp

<Response [200]>

To print the URL of our page, we can run the `resp.url` while to get the history attribute of the page we use `resp.history`. Note that the URL that's returned is not what we input. We're being redirected to a secure URL. Any redirects are stored in the `.history` attribute. In this case, we find one page in here with the status code 301 (permanent redirect) corresponding to the original URL that was requested.

In [8]:
print(resp.url)
print(resp.history)

https://www.python.org/
[<Response [301]>]


## Making API Calls

API calls allows us to access well-structured data on demand. Here, we'll work with the Wikipedia API as a way of learning how APIs generally work. We'll make API request and ingest the JSON response data. Let's begin by running code below to define our API request URL. Note that the backslashes are used to split the code across multiple lines, while the forward slashes are part of the url. Basically, we're requesting for the resources that satisfies a set of parameters, such as `action`, `page`, `section`, and so on. Notice that we've explicitly requested a response in `.json` format by appending `&format=json` to the URL. These parameters are specific to Wikipedia API, but many APIs work in a similar way.

In [9]:
url = ('https://en.wikipedia.org/w/api.php?'\
       'action=parse' \
       '&page=List_of_countries_by_central_bank_interest_rates' \
       '&section=1' \
       '&prop=wikitext' \
       '&format=json')
url

'https://en.wikipedia.org/w/api.php?action=parse&page=List_of_countries_by_central_bank_interest_rates&section=1&prop=wikitext&format=json'

Next we can make the API request. Running `resp.text[:100]` will print the first 100 lines of the response string. Notice how the string appears to represent JSON data, which is what we asked for when making the request.

In [10]:
api_resp = requests.get(url)
print(api_resp)
api_resp.text

<Response [200]>


'{"parse":{"title":"List of countries by central bank interest rates","pageid":20582369,"wikitext":{"*":"== List ==\\n{| class=\\"wikitable sortable\\" style=\\"text-align: center;\\"\\n|- bgcolor=\\"#ececec\\"\\n! Country or<br>currency union !! Central bank<br>interest rate (%) !! Date of last<br>change\\n! Average inflation rate<br>2013\\u20132017 (%)<br>by [[World Bank|WB]] and [[International Monetary Fund|IMF]]<ref>{{Cite web|url=https://data.worldbank.org/indicator/FP.CPI.TOTL.ZG?end=2017&start=1986&year_high_desc=false|title=Inflation, consumer prices (annual %) {{!}} Data|website=data.worldbank.org|access-date=2019-12-07}}</ref><ref>{{Cite web|url=https://www.imf.org/external/pubs/ft/weo/2018/02/weodata/weorept.aspx?pr.x=28&pr.y=13&sy=2017&ey=2017&scsm=1&ssd=1&sort=country&ds=.&br=1&c=512,213,626,628,636,453,643,646,648,732,923,299,474&s=PCPIPCH&grp=0&a=|title=Report for Selected Countries and Subjects|website=www.imf.org|access-date=2019-12-07}}</ref><br>as in the [[List of c

We can now convert the string into a Python dictionary by using `.json()`. Note that there are some nested fields in the data such as `parse`, `pageid`, and `wikitext`. Here, we'll just get the key from `data` due to output length.

In [11]:
data = api_resp.json()
print(type(data))
data.keys()

<class 'dict'>


dict_keys(['parse'])

Next we extract the page title from the API response data running the code below:

In [12]:
data['parse']['title']

'List of countries by central bank interest rates'

Here is how we can extract a row from the table contained in the API response data. Note that we extracted a table from the response data as a `wikitext` string, and then separated the rows by splitting on `|-`. Ideally, the table data returned from Wikipedia's free API would be in a nicer format for us to ingest programatically; which is not the case here. 

In [13]:
row_idx = 16

wikitext = data['parse']['wikitext']['*']
table_row = wikitext.split('|-')[row_idx]
table_row

'                \n|align="left"| {{flag|Bulgaria}} || 0.00 || {{dts|format=dmy|2016-01-29}}<ref name="CentralBankNews"/>\n|0.12\n| -0.12\n|0.00\n'

We can then parse the data from the row using regular expressions. Here, we output a countries name. Note that APIs would easily make this data easily available to the application using it. In this situation, for Wikipedia, we can still get to the data very easily by extracting the field between `flag|` and `}`. In this scenario, we extracted `Bulgaria` from `{{flag|Canada}}`.

In [14]:
re.findall('flag\|([^}]+)}', table_row)

['Bulgaria']

Some data is easier to extract using Python string methods such as `.split()` and `.strip()` rather than regular expressions. For instance, we can run the following command to get the interest rate for our extracted row. Therefore, by iterating over all of the rows in the API response data, we can apply this extraction to each and pull out all of the data for the requested table resource.

In [15]:
table_row.split('||')[1].strip()

'0.00'

## Parsing HTML

We'll scrape the review content of Kenya Airways. 

In [16]:
url = 'https://www.airlinequality.com/airline-reviews/kenya-airways/'
resp = requests.get(url)
print(resp.url, resp.status_code)

https://www.airlinequality.com/airline-reviews/kenya-airways/ 200


Then we'll load the HTML as a `BeautifulSoup` object so that it can be parsed. Note that we are using Python's default `'html.parser'` as the parser, but other parsing libraries such as `lxml` can be installed and used instead. The advantage of `lxml` over `html.parser` is that it is generally better at parsing messy or malformed HTML code. That is, it is forgiving and fixes problems like unclosed tags, improperly nested tags, and missing head or body tags. It is also somewhat faster than `html.parser`. However, the speed is not necessarily an advantage in web scraping. The bottleneck is the speed of the network itself. One disadvantage of `lxml` is that in some cases it has to be installed seperately and depends on third party C libraries to function. This result to portability issues and the ease of use compared to `html.parser`. 

In [17]:
soup_lxml = BeautifulSoup(resp.content, 'lxml')

`html5lib` is another popular HTML parser. Just like `lxml`, it is an extremely forgiving parser that even corrects broken HTML. The downsides are that it also depends on external dependency and is slower than both `lxml` and `html.parser`. Despite this, it can be used if working with messy or handwritten HTML sites. 

In [18]:
soup_5lib = BeautifulSoup(resp.content, 'html5lib')




Because each HTML parser interprets documents differently, the final `BeautifulSoup` object may differ depending on which is utilized. Here we'll just use the `html.parser`. 

In [19]:
soup = BeautifulSoup(resp.content, 'html.parser')

Also, we can pull the docstring of the `BeautifulSoup` object by using `soup?` or the in-built `dir()` function which will lists the attributes and methods of an object that we'll use later, such as `find_all`, `attrs`, and `text`.

In [20]:
dir(soup)[180:]

['string',
 'string_container',
 'string_container_stack',
 'strings',
 'stripped_strings',
 'tagStack',
 'text',
 'unwrap',
 'wrap']

Still, this is not particularly informative. Therefore we'll use the `pdir` library to obtain information about Python objects.
Note that we import `pdir2` as `pdir` despited its listing as `pdir2` on the Python Packaging Index (PyPI). Notice how the methods and attributes have been organized into groupings, and descriptions included where applicable.
Let's pay particular attention to `.find_all()` method.

In [21]:
pdir(soup)

[0;33mproperty:[0m
[0;33mspecial attribute:[0m
    [0;36m__class__[0m[1;30m, [0m[0;36m__dict__[0m[1;30m, [0m[0;36m__doc__[0m[1;30m, [0m[0;36m__module__[0m[1;30m, [0m[0;36m__weakref__[0m
[0;33mabstract class:[0m
    [0;36m__subclasshook__[0m
[0;33mobject customization:[0m
    [0;36m__bool__[0m[1;30m, [0m[0;36m__format__[0m[1;30m, [0m[0;36m__hash__[0m[1;30m, [0m[0;36m__init__[0m[1;30m, [0m[0;36m__new__[0m[1;30m, [0m[0;36m__repr__[0m[1;30m, [0m[0;36m__sizeof__[0m[1;30m, [0m[0;36m__str__[0m
[0;33mrich comparison:[0m
    [0;36m__eq__[0m[1;30m, [0m[0;36m__ge__[0m[1;30m, [0m[0;36m__gt__[0m[1;30m, [0m[0;36m__le__[0m[1;30m, [0m[0;36m__lt__[0m[1;30m, [0m[0;36m__ne__[0m
[0;33mattribute access:[0m
    [0;36m__delattr__[0m[1;30m, [0m[0;36m__dir__[0m[1;30m, [0m[0;36m__getattr__[0m[1;30m, [0m[0;36m__getattribute__[0m[1;30m, [0m[0;36m__setattr__[0m
[0;33mclass customization:[0m
    [0;36m__init_sub

Here is how we get the the `h1` heading from the page. Usually, pages have only one `h1` (top-level heading) element thus we get only one here. 

In [22]:
h1 = soup.find_all('h1')
h1[0]

<h1 itemprop="name">
											Kenya Airways										</h1>

Previously, we identified the HTML element that contains our data, but the field still needs to be extracted as a string. Therefore, we can get the HTML element attributes and text as shown below. Basically, `h1[0]` is the first (and only) list element and to get the element attributes we've used `.attrs`. Here we see the `itemprop` element both of which can be referenced in CSS stylesheets. To get the text we've used `.text`.

In [23]:
print('Attribute: ', h1[0].attrs)
print('Text: ', h1[0].text)

Attribute:  {'itemprop': 'name'}
Text:  
											Kenya Airways										


### `find()` and `find_all()`

These two functions in BeautifulSoup will be used a lot. With them we can filter HTML pages to find a list of desired tags, or a single tag, based on their various attributes. For example, here we use `.findall()` to see the number of extracted image tags. From the result we can see that there are 27 images.

In [24]:
imgs = soup.find_all('img')
len(imgs)

27

Most of the images are from SkyTrax and we can see them by printing the source of each image. This will output the path of each image resource.

In [25]:
for element in imgs:
    if 'src' in element.attrs.keys():
        print(element.attrs['src'])

data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E
https://www.airlinequality.com/wp-content/themes/airlinequality2014new/library/images/skytrax.svg
data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E
https://www.airlinequality.com/wp-content/themes/airlinequality2014new/library/images/skytrax.svg
data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%20150%2029'%3E%3C/svg%3E
https://www.airlinequality.com/wp-content/uploads/2015/04/KENYA_1000-150x29.png
data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E
https://www.airlinequality.com/wp-content/themes/airlinequality2014new/library/images/skytrax-rating-airline-3.png
data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E
https://www.airlinequality.com/wp-content/uploads/2022/07/CAA5A116-1D0F-4A4C-B84D-2D23A2C7A4E8-5

For now, we'll just get the content of the first review. Here we'll be using the `.find()` method which is identical to `.find_all()` except that it returns only the first match. When calling this, we passed a second argument, `{'id': 'anchor810808'}` which follows the form `{attribute_name: attribute_value}`.

In [26]:
content = soup.find("div", {"id": "anchor810808"})
content

<div class="body" id="anchor810808">
<h2 class="text_header">"I had a very bad experience"</h2>
<h3 class="text_sub_header userStatusWrapper">
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<span itemprop="name">Winnie Mumba Edme</span></span> (United States) <time datetime="2022-10-24" itemprop="datePublished">24th October 2022</time></h3>
<div class="tc_mobile">
<div class="text_content" itemprop="reviewBody"><strong><a href="https://www.airlinequality.com/verified-reviews/"><em>Not Verified</em></a></strong> |  I was connecting from Lusaka to New York from Kenya I got off the plane I checked the board it said gate 16 for JFK New York so I sat there waiting for the boarding call . The flight for New York was supposed to departure at 11:35 I sat there for almost an hour with no boarding call. I asked one of the employees there what time the flight for New York was leaving she gave a vague answer than around 11:40 or so they was a last boarding call. The Gate

Having narrowed down the content of interest, for now let's focus on the table content. Usually, tables are organized into headers (`<th>`), rows (`<tr>`) and data entries (`<td>`). In this case, there are no table headings.

In [27]:
table_head = content.find_all('th')
table_head

[]

However, there are rows and of interest is the data in these rows.

In [28]:
table_rows = content.find_all('tr')
table_data = content.find_all('td')
table_data[:5]

[<td class="review-rating-header type_of_traveller">Type Of Traveller</td>,
 <td class="review-value">Solo Leisure</td>,
 <td class="review-rating-header cabin_flown">Seat Type</td>,
 <td class="review-value">Economy Class</td>,
 <td class="review-rating-header route">Route</td>]

The next step is parsing the table data as plain text, from the list of HTML elements

In [29]:
for i, t in enumerate(table_data[:6]):
    print(i, t.text.strip())
    print('-'*20)

0 Type Of Traveller
--------------------
1 Solo Leisure
--------------------
2 Seat Type
--------------------
3 Economy Class
--------------------
4 Route
--------------------
5 Lusaka to New York via Nairobi
--------------------


From the previous results, it is evident that the response to the first entry is the subsequent entry and this goes on and on. Therefore, to get the output as required, we have to run a command which search for all the rows and then select them and then search for all the data elements in these rows. The results we get is then placed in a dictionary. However, we still have an issue with the number ratings. We'll look at how to fix these later on. 

In [30]:
rating_details = {}

for i in range(len(content.find_all('tr'))):
    rating_details[content.find_all('tr')[i].find('td').text] =\
        content.find_all('tr')[i].find_all('td')[1].text
    
rating_details

{'Type Of Traveller': 'Solo Leisure',
 'Seat Type': 'Economy Class',
 'Route': 'Lusaka to New York via Nairobi',
 'Date Flown': 'September 2022',
 'Seat Comfort': '12345',
 'Cabin Staff Service': '12345',
 'Food & Beverages': '12345',
 'Inflight Entertainment': '12345',
 'Ground Service': '12345',
 'Value For Money': '12345',
 'Recommended': 'no'}

## More on Navigating Trees

Before diving deep into writing web crawlers, let's look at another site for online shopping so as to get some insight that could be useful in our future web-scraping. The `.find_all()` function is used to find tags based on their name and attributes. However, we may want to find a tag based on a specified location in a document and that's where tree navigation comes in handy. 

In [31]:
shopping_url = "https://www.pythonscraping.com/pages/page3.html"
online_shop = BeautifulSoup(urlopen(shopping_url), 'html.parser')

### Dealing with Children and Other Descendants

Just like in a human family tree, children are always exactly one tag below a parent, whereas descendants can be at any level in the tree below a parent. For instance, we talk about tables, `tr` are descendants of `table` tag while `tr`, `th`, `td`, `img`, and `span` are the descendants of `table` tag. In other words, all children are descendants, but not all descendants are children. BeautifulSoup will always deal with the descendants of the currently selected tag. For example if we only want to find descendants that are children, we can use the `.children` tag. Here we see a list of product rows in the `giftList` table, including the initial row of column labels. 

In [32]:
for child in (online_shop.
              find('table', {'id': 'giftList'})
              .children):
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


Note that using `.descendants` instead of `.children` results to the printing of two dozen tags within the table; hence the importance to differentiate between children and descendants. 

In [33]:
for descendant in (online_shop
                   .find('table', {'id': 'giftList'})
                   .descendants):
    print(descendant.text)




Item Title

Description

Cost

Image


Item Title


Item Title


Description


Description


Cost


Cost


Image


Image




Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

$15.00




Vegetable Basket


Vegetable Basket


This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!


This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

Now with super-colorful bell peppers!
Now with super-colorful bell peppers!



$15.00


$15.00












Russian Nesting Dolls

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!

$10,000.52




Russian Nesting Dolls


Russian Nesting Dolls


Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "pricele

### Dealing with Siblings

In the next example we use `.next_siblings` function which makes it trivial to collect data from tables, especially one with title rows. The code below ensure that we get all row of products from the product table beside the first title row. Object cannot be siblings with themselves, thus the title row get skipped. As the name of the function implies, only the next siblings are called. So, by selecting the title row and calling `.next_siblings` we have selected all the rows in the table without selecting the title row itself.

In [34]:
for siblings in (online_shop.
                 find('table', {'id': 'giftList'})
                 .tr
                 .next_siblings):
    print(siblings)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

As a complement to `.next_siblings`, the `previous_siblings` function can be used if there is an easily selectable tag at the end of a list of sibling tag that we'd like to get. Additionally, we have `.next_sibling` and `.previous_sibling` which only return a single tag rather than a list of them. 

### Dealing with Parents

Most of the time when scraping pages, we'll likely discover that we rarely need to find parents of tags than we need to find their children or siblings. Occassionally, we may find ourself in odd situations that require us to use `.parent` and `.parents`. For instance, in the code below, we print the price of the object represented by the first git image. Basically, the first selection is the image tag where `src="../img/gifts/img1.jpg"`. Then we go ahead to select the parent of that tag `td`. After which we get the `previous_sibling` of the `td` tag specifically the text within the tag.

In [35]:
(online_shop
 .find('img', {'src': '../img/gifts/img1.jpg'})
 .parent
 .previous_sibling
 .get_text())

'\n$15.00\n'

## Regular Expressions

Notice that the site has many product images which takes the following form `<img src="../img/gifts/img1.jpg">`. If we want to grab URLs to all of the product images we can use the code below. However, we can see that there extra images; in our case we have the `logo.jpg`. Modern websites often have hidden images, blank images used for spacing and aligning elements, and other random images tags we may not be aware of. 

In [36]:
online_shop.find_all('img')

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <img src="../img/gifts/img1.jpg"/>,
 <img src="../img/gifts/img2.jpg"/>,
 <img src="../img/gifts/img3.jpg"/>,
 <img src="../img/gifts/img4.jpg"/>,
 <img src="../img/gifts/img6.jpg"/>]

It is important to keep in mind that it is inadvisable to depend on the position of an element in a webpage because the layout may change. Therefore, one way of looking for something is to identify the tag itself. In our case, we'll look at the file path of the product images. With the code below, we get only the relative image paths that start with `..img/gifts/img` and end in `.jpg`.

In [37]:
img_regex = '\.\.\/img\/gifts\/img.*\.jpg'
images = (online_shop.
          find_all('img', {'src': re.compile(img_regex)}))

for image in images:
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


From the previous code we can notice that we've use `image['src']` to access the attributes of the tag `img`. This is because with tag objects, return a Python dictionary object, which makes retrieval and manipulation of attributes trivial.

## Lambda Expressions

Here we'll look at how `lambda` expressions can be useful in web scraping. BeautifulSoup allows us to pass certain functions as parameters in to the `.find_all()` function. The only restriction is that these functions must take a tag object as an argument and return a boolean. This way, every tag object will be evaluated in this function, and tags that are evaluated as `True` are returned, while the rest are discarded. More of this will be explored later on.

# **WEB CRAWLERS**

It is time to write scrapers that traverse multiple pages and even multiple sites. At the core of web crawlers, there is an element of recursion. That is, they retrieve page contents for a URL, examine that page for another URL, and retrieve that page, ad infinitum. Note that with web crawlers we must be extremely conscientious of how much bandwith we are using and make every effort to determine whether there's a way to make the target server's load easier.

In both Six Degrees of Wikipedia and Six Degrees of Kevin Bacon, the goal is to link two unlikely subjects (in the first case, Wikipedia articles that link to each other, and in the second case, actors appearing in the same film) by a chain containing no more than six total including the two original subjects. For example, Eric Idle appeared in *Dudely Do-Right* with Brendan Fraser, who appeared in *The Air I Breathe* with Kevin Bacon. In this case, the chain from Eric Idle to Kevin Bacon is only three subjects long. In this section, we'll do a Six Degrees of Wikipedia solution finder that takes the [Eric Idle page](https://en.wikipedia.org/wiki/Eric_Idle) and find the fewest number of link click to take us to the [Kevin Bacon page](https://en.wikipedia.org/wiki/Kevin_Bacon).

## Traversing a Single Domain

Let's begin by writing a script that retrieves an arbitrary page and produce a list of links on that page.

In [38]:
bacon = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")
bacon_bs = BeautifulSoup(bacon, 'html.parser')

bacon_list = []
for link in bacon_bs.find_all('a'):
    if 'href' in link.attrs:
        bacon_list.append(link.attrs['href'])

If we look at the list of links produced we'll notice that all the articles we'd expect are there: "*Apollo 13*", "*Philadelphia*", "*Primetime Emmy Award*", and so on. However, there are certain things we don't want. In fact, Wikipedia is full of sidebar, footer, and header links that appear on every page, along with links to the category pages, talk pages, and other pages that do not contain different articles. 

In [39]:
bacon_list[:10]

['/wiki/Wikipedia:Protection_policy#semi',
 '#mw-head',
 '#searchInput',
 '/wiki/Kevin_Bacon_(disambiguation)',
 '/wiki/File:Kevin_Bacon_SDCC_2014.jpg',
 '/wiki/Philadelphia',
 '/wiki/Kevin_Bacon_filmography',
 '/wiki/Kyra_Sedgwick',
 '/wiki/Sosie_Bacon',
 '#cite_note-1']

If we take a closer look at the links that point to article pages as opposed to other internal pages, we'll see that they have the following three things in common:

1. They reside within the `div` with the `id` set to `'bodyContent'`. 

2. The URLs do not contain colons

3. The URLs begins with `"/wiki/"`.

With these information, we can revise the code to only retrieve the desired article links using the regular expression `^(/wiki/)((?!:).)*$)`

In [40]:
bacon_links_refined = []

for link in (bacon_bs
             .find('div', {'id': 'bodyContent'})
             .find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))):
    if 'href' in link.attrs:
        bacon_links_refined.append(link.attrs['href'])

However, having a script that finds all article links in one, hard-coded Wikipedia article is interesting yet useless in practice. We need to take transform the previous code into something like this:

1. A single function, `getLinks()`, which takes in a Wikipedia article URL of the form `'/wiki/<Article_Name>'` and returns a list of all linked article URLs in the same form. 

2. A main function that calls `getLinks()` with a starting article, chooses a random article link from the returned list, and calls `getLinks()` again, until we stop the program or until no article links are found on the new page.

So, the first we do is to define the `getLinks()` function which takes in an article URL of the form `'/wiki/...'`, prepends the Wikipedia domain name, `'https://en.wikipedia.org'` and retrieves the `BeautifulSoup` object for the HTML at that domain. The function then extracts a list of article link tags, based on the parameters discussed previously, and returns them

In [41]:
def getLinks(articleUrl):
    html = urlopen('https://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return (bs.find('div', {'id': 'bodyContent'})
            .find_all('a', href=re.compile('^(/wiki/)((?!:).)*$')))

Next we set a random-number generator seed to ensure that a new and interesting random path through Wikipedia article every time the program is run. The main body of the program:

1. Begins with setting a list of article link tags to the list of links in the initial page: `'https://en.wikipedia.org/wiki/Kevin_Bacon'`.

2. It then goes into a loop, finding a random article link tag in the page, extracting the `href` attribute from it, printing the page, and getting a new list of links from the extracted URL.

In [42]:
random.seed(5)

try:
    links = getLinks('/wiki/Kevin_Bacon')
    while len(links) > 0:
        newArticle = links[random.randint(0, len(links)-1)].attrs['href']
        print(newArticle)
        links = getLinks(newArticle)
        
except TimeoutError as te:
    print(te)
except URLError as ue:
    print(ue)

/wiki/Anthony_Andrews
<urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>


There is a bit more to solving a Six Degree of Wikipedia problem than building a scraper that goes from page to page. we must also be able to store and analyze the resulting data. More will covered later on. It is also good to keep in mind that for autonomous production code, we should consider exception handling to cater for the many potential pitfalls that could arise. For example, in our case it its a timeout error that results to improper response or connection failure.

## Crawling and Entire Site

We took a random tour through a website, going from link to link in the preceding section. What if we need to catalog or search every page on a website in a methodical manner? It should be noted that crawling a whole site, especially a large one, is a memory-intensive procedure that is best suited to applications that have a database available to store crawling results. We may, however, investigate the behavior of these types of applications without executing them at full scale. More on how to run these application by using a database will be covered later on when we look at how to store data.