In [1]:
!pip3 install selenium

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
[K     |████████████████████████████████| 904 kB 417 kB/s eta 0:00:01
Installing collected packages: selenium
Successfully installed selenium-3.141.0


In [2]:
# importing required libraries
import requests     ## !pip install requests
import numpy as np
import pandas as pd
import bs4
import lxml.etree as xml

### Fetch webpage contents using requests

To get everything about a webpage we use the `get` method from requests. There are many optional arguments it can take but the one main argument it takes is the url to the webpage you want retrieved.

In [3]:
URL = "https://github.com/requests/requests"
requests.get(URL)

<Response [200]>

The result of this method is a Response object. 

The number `200` is a [status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). 200 is OK and it means no error.

In [4]:
requests.get(URL, {}).text

'\n\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-QV1ZNBjZz8stPx+uh4ZAKc6AJ1z8A9VHut/SGtgbc+iYLfhrh68QmDH3rZgkXJ0BWIOcDw+ILnWcctH0ljcHPg==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-415d593418d9cfcb2d3f1fae87864029.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-Z1l3RiXbvx3iU15ESvB9fNSNb9jKTywFzKK2vJTr9WGTx+GkdM33cn+TfTEA

**To get the HTML as a string we use the `text` property of the Response object.**

Before we go farther you should know that often you will get an error when accessing the webpage. There are many errors and even more causes for the error, but the most common cases are:
- You use a wrong URL.
- The website is down. To be sure this happens access it via browser.
- The website blocks bots and scraping agents. You can try to use browser looking UserAgent to fix this. If this happens investigate the `headers` parameter of the `get` method. It usually helps to use a plausible UserAgent but if it doesn't good luck trying to find a solution.

We can convert that text into either a `BeautifoulSoup object`.

#### Example 1

Create a beautifoul soup object.

In [5]:
web_page = bs4.BeautifulSoup(requests.get(URL, {}).text, "lxml"); web_page

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-415d593418d9cfcb2d3f1fae87864029.css" integrity="sha512-QV1ZNBjZz8stPx+uh4ZAKc6AJ1z8A9VHut/SGtgbc+iYLfhrh68QmDH3rZgkXJ0BWIOcDw+ILnWcctH0ljcHPg==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-6759774625dbbf1de2535e444af07d7c.css" integrity="sha512-Z1l3RiXbvx3iU15ESvB9fNSNb9jKTy

Web pages are trees of elements nested one inside the other.

For example:
- html
  - body
      - div
      - div
      - div
      
We say that body is a child of html and html is a parent of body, and that the 3 div are children of body. The 3 div are siblings. This terminology matters because the method names in bs4 follow it. 

Before you go scrapping open the website in Inspector View to see the nesting hierarchy of web page elements.

Generally all web pages have two main sections called `head` and `body`:
- `head` is where a lot of metadata lives
- `body` is what you seen on the screen and it contains all links, tables and images.

#### Example 2

Let's find the title of the web page we pulled using the `head` and `title` elements.

In [6]:
web_page.head.title

<title>GitHub - psf/requests: A simple, yet elegant HTTP library.</title>

We can navigate the tree by going element by element. You need to know the element names (html, head, div, span, p, a and so on) but don't worry if you don't. Look at the webpage in the inspector view in your browser and you can see the full path to the element of interest.

To get the text we need to use the `text` property of elements.

In [7]:
web_page.head.title.text

'GitHub - psf/requests: A simple, yet elegant HTTP library.'

#### Example 3

Let's go into the body of the github page we accessed.

In [8]:
web_page.body

<body class="logged-out env-production page-responsive">
<div class="position-relative js-header-wrapper">
<a class="px-2 py-4 bg-blue text-white show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>
<span class="progress-pjax-loader width-full js-pjax-loader-bar Progress position-fixed">
<span class="Progress-item progress-pjax-loader-bar" style="background-color: #79b8ff;width: 0%;"></span>
</span>
<header class="Header-old header-logged-out js-details-container Details position-relative f4 py-2" role="banner">
<div class="container-xl d-lg-flex flex-items-center p-responsive">
<div class="d-flex flex-justify-between flex-items-center">
<a aria-label="Homepage" class="mr-4" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
<svg aria-hidden="true" class="octicon octicon-mark-github text-white" height="32" version="1.1" viewbox="0 0 16 16" width="32"><path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.0

It is full of elements like `<a>` or `<ul>` or `<li>` or `<div>` or `<span>` etc.

The majority of that is noise to us because we want to find the numbers which describe this repository.

#### Example 4. Filter with `find_all()`

In [9]:
sub_web_page = web_page.find_all(name="div", attrs={"class": "px-2"})[0]
sub_web_page

<div class="px-2">
<button aria-label="Dismiss this message" class="flash-close js-flash-close" type="button">
<svg aria-hidden="true" class="octicon octicon-x" height="16" version="1.1" viewbox="0 0 16 16" width="16"><path d="M3.72 3.72a.75.75 0 011.06 0L8 6.94l3.22-3.22a.75.75 0 111.06 1.06L9.06 8l3.22 3.22a.75.75 0 11-1.06 1.06L8 9.06l-3.22 3.22a.75.75 0 01-1.06-1.06L6.94 8 3.72 4.78a.75.75 0 010-1.06z" fill-rule="evenodd"></path></svg>
</button>
<div>{{ message }}</div>
</div>

#### Example 5. Find number

In [10]:
WP_URL = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

# Fake the user agent so the web page thinks we access it as a regular human user
web_page = bs4.BeautifulSoup(requests.get(WP_URL, headers={
    "UserAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.183 Safari/537.36"
}).text, "lxml")

imf_table = web_page.find_all(name="table", attrs={"class": "wikitable"})[0]

imf_table

<table class="wikitable" style="margin:auto; width:100%;">
<tbody><tr>
<td style="width:33%; text-align:center"><b>Per the <a href="/wiki/International_Monetary_Fund" title="International Monetary Fund">International Monetary Fund</a> (2020 estimates)</b><sup class="reference" id="cite_ref-GDP_IMF_1-2"><a href="#cite_note-GDP_IMF-1">[1]</a></sup>
</td>
<td style="width:33%; text-align:center;"><b>Per the <a href="/wiki/World_Bank" title="World Bank">World Bank</a> (2019)</b><sup class="reference" id="cite_ref-worldbank_21-0"><a href="#cite_note-worldbank-21">[20]</a></sup>
</td>
<td style="width:33%; text-align:center;"><b>Per the <a href="/wiki/United_Nations" title="United Nations">United Nations</a> (2018)</b><sup class="reference" id="cite_ref-22"><a href="#cite_note-22">[21]</a></sup>
</td></tr>
<tr valign="top">
<td>
<table class="wikitable sortable" style="margin-left:auto; margin-right:auto; margin-top:0;">
<tbody><tr>
<th data-sort-type="number" style="width:2em;">Rank</th>
<t

In [11]:
# Get the column names of our dataframe. 
# `children` is an iterator and to index it we must first convert it to a list.
columns = list(imf_table.tbody.children)[0]
columns = [elem.text.strip("\n ") 
           for elem in columns 
           if type(elem) != bs4.NavigableString]

rows = []
columns

['Per the International Monetary Fund (2020 estimates)[1]',
 'Per the World Bank (2019)[20]',
 'Per the United Nations (2018)[21]']

#### Create table

In [12]:
for i, row in enumerate(imf_table.tbody.find_all("tr")):
    # Skip the header
    if i <= 1 or type(row) == bs4.NavigableString:
        continue
    tds = row.find_all("td")    
    rank         = tds[0].text
    country_name = tds[1].text
    gdp          = tds[2].text    
    rows.append((rank, country_name, gdp))
    
data_frame = pd.DataFrame(rows, columns=columns)
data_frame.head()

Unnamed: 0,Rank,Country,GDP(US$MM)
0,1,United States,"19,390,600\n"
1,_,European Union[n 1][19],"17,308,862\n"
2,2,China[n 2],"12,014,610\n"
3,3,Japan,"4,872,135\n"
4,4,Germany,"3,684,816\n"


In [13]:
import re

def clean_country(c):
    try:
        return re.findall(pattern="(.+)\[.+", string=c)[0].strip("\xa0")
    except:
        return c
    
# One call removes one set of brackets [] so two calls to fix EU if EU is fixable...
data_frame.Country = data_frame.Country.apply(clean_country)
data_frame.Country = data_frame.Country.apply(clean_country)
                    
data_frame.rename({"GDP(US$MM)": "GDP"}, inplace=True, axis=1)
data_frame.GDP = data_frame.GDP.apply(lambda gdp: gdp.replace("\n", "").replace(",", ""))
data_frame.head()

Unnamed: 0,Rank,Country,GDP
0,1,United States,19390600
1,_,European Union,17308862
2,2,China,12014610
3,3,Japan,4872135
4,4,Germany,3684816
