Todo... 3 exercises on webscraping at http://example.webscraping.com

**Introduction to Web Scraping**

In this notebook we will be doing some introductory exercises at http://example.webscraping.com/  to get familiar to webscraping with BeautifulSoup.

In [1]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import requests
import time

Get HTML text of a webpage using requests.get(url).text, then create a BeautifulSoup object using BeautifulSoup(text)

In [2]:
requests.get('http://example.webscraping.com/')

<Response [200]>

In [3]:
url_text = requests.get('http://example.webscraping.com/').text
soup = BeautifulSoup(url_text)

BeautifulSoup is an HTML parser that will allow us to traverse the 'HTML tree' each website has.

**0)** Check the robots.txt at example.webscraping.com: this will tell us how often we can send requests to their website.

They recommend a crawl-delay of 5 seconds.

**1)** Extract the title of example.webscraping.com

In [4]:
soup.find?

In [5]:
soup.find('h1')

<h1>
                    Example web scraping website
                    <small></small>
</h1>

In [6]:
title = soup.find('h1').text

In [7]:
title

'\n                    Example web scraping website\n                    \n'

In [8]:
title = title.strip('\n').strip(' ').strip('\n')

In [9]:
title

'Example web scraping website'

**2)** Get the name of the first country to appear on http://example.webscraping.com/

We want to find structure in the webpage such that we could find the name of this country without knowing it's name beforehand.  This will come in handy when scraping many webpages at a time.

In [10]:
soup.find_all('a')

[<a class="dropdown-toggle" data-toggle="dropdown" href="#" rel="nofollow">Log In</a>,
 <a href="/places/default/user/register?_next=/places/default/index" rel="nofollow"><i class="icon icon-user glyphicon glyphicon-user"></i> Sign Up</a>,
 <a href="/places/default/user/login?_next=/places/default/index" rel="nofollow"><i class="icon icon-off glyphicon glyphicon-off"></i> Log In</a>,
 <a href="/places/default/index">Home</a>,
 <a href="/places/default/search">Search</a>,
 <a href="/places/default/view/Afghanistan-1"><img src="/places/static/images/flags/af.png"/> Afghanistan</a>,
 <a href="/places/default/view/Aland-Islands-2"><img src="/places/static/images/flags/ax.png"/> Aland Islands</a>,
 <a href="/places/default/view/Albania-3"><img src="/places/static/images/flags/al.png"/> Albania</a>,
 <a href="/places/default/view/Algeria-4"><img src="/places/static/images/flags/dz.png"/> Algeria</a>,
 <a href="/places/default/view/American-Samoa-5"><img src="/places/static/images/flags/as.pn

In [11]:
soup.find_all('td')

[<td><div><a href="/places/default/view/Afghanistan-1"><img src="/places/static/images/flags/af.png"/> Afghanistan</a></div></td>,
 <td><div><a href="/places/default/view/Aland-Islands-2"><img src="/places/static/images/flags/ax.png"/> Aland Islands</a></div></td>,
 <td><div><a href="/places/default/view/Albania-3"><img src="/places/static/images/flags/al.png"/> Albania</a></div></td>,
 <td><div><a href="/places/default/view/Algeria-4"><img src="/places/static/images/flags/dz.png"/> Algeria</a></div></td>,
 <td><div><a href="/places/default/view/American-Samoa-5"><img src="/places/static/images/flags/as.png"/> American Samoa</a></div></td>,
 <td><div><a href="/places/default/view/Andorra-6"><img src="/places/static/images/flags/ad.png"/> Andorra</a></div></td>,
 <td><div><a href="/places/default/view/Angola-7"><img src="/places/static/images/flags/ao.png"/> Angola</a></div></td>,
 <td><div><a href="/places/default/view/Anguilla-8"><img src="/places/static/images/flags/ai.png"/> Anguill

In [12]:
soup.find_all('td')[0]

<td><div><a href="/places/default/view/Afghanistan-1"><img src="/places/static/images/flags/af.png"/> Afghanistan</a></div></td>

In [13]:
country = soup.find_all('td')[0].text
country

' Afghanistan'

In [14]:
country = country.strip()
country

'Afghanistan'

**3)** Get the name of the third country to appear on the second page of countries at http://example.webscraping.com/

To complete this task, we will have to traverse away from http://example.webscraping.com/ to the second page of countries.  We could do this manually, but we will be implementing a programmatic way to do so, so we can traverse multiple pages at a time.

In [15]:
page_number = 2
url = f'http://example.webscraping.com/places/default/index/{page_number-1}'

In [16]:
url_text = requests.get(url).text
soup2 = BeautifulSoup(url_text)

In [17]:
country = soup2.find_all('td')[2].text  # Gets the third 'td' element on the page
country

' Aruba'

In [18]:
country = country.strip()
country

'Aruba'

**4)** Get the names of all the country at http://example.webscraping.com/

Now that we have a basic understanding of how different pages of country names are accessed and how the country names are structured within each page, we can write a script to get the names of all countries and append them to a list.

In [19]:
requests.codes.ok

200

In [21]:
total_countries = []
for page in range(200):
    url = f'http://example.webscraping.com/places/default/index/{page}'
    r = requests.get(url)
    if r.status_code == requests.codes.ok:
        url_text = r.text
    else:
        print('Bad Request')
        break
    soup = BeautifulSoup(url_text)
    country_list_page = soup.find_all('td')  # gets a list of all elements that contain country names
    if len(country_list_page) == 0:
        print('All Countries extracted')
        break
    for elem in country_list_page:
        country = elem.text
        total_countries.append(country)
    print(f'Page {page} Done')

Page 0 Done
Page 1 Done
Page 2 Done
Page 3 Done
Page 4 Done
Page 5 Done
Page 6 Done
Page 7 Done
Page 8 Done
Page 9 Done
Page 10 Done
Bad Request


Uncomment this code

In [None]:
# total_countries = []
# for page in range(200):
#     url = f'http://example.webscraping.com/places/default/index/{page}'
#     r = requests.get(url)
#     if r.status_code == requests.codes.ok:
#         url_text = r.text
#     else:
#         print('Bad Request')
#         print('Trying again in 5 seconds...')
#         page -= 1
#         time.sleep(5)
#         continue
#     soup = BeautifulSoup(url_text)
#     country_list_page = soup.find_all('td')  # gets a list of all elements that contain country names
#     if len(country_list_page) == 0:
#         print('All Countries Extracted')
#         break
#     for elem in country_list_page:
#         country = elem.text
#         total_countries.append(country)
#     print(f'Page {page} Done')
# #     time.sleep(3)

In [None]:
total_countries