!pip install requests

In [3]:
!pip install beautifulsoup4

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/1d/5d/3260694a59df0ec52f8b4883f5d23b130bc237602a1411fa670eae12351e/beautifulsoup4-4.7.1-py3-none-any.whl (94kB)
[K     |████████████████████████████████| 102kB 112kB/s ta 0:00:01
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/b9/a5/7ea40d0f8676bde6e464a6435a48bc5db09b1a8f4f06d41dd997b8f3c616/soupsieve-1.9.1-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.7.1 soupsieve-1.9.1


# Web Scrapping With Beautiful Soup

Today, we would be looking at how to extract data from websites using python. In order for us to do this conveniently we would be using a library called **Beautiful Soup**

Usually beautiful soup works best with static html pages. Working with websites that are built using react or any of the other javascript frameworks may not work well using beautiful soup. 

To scrape data on such sites you may need to use a tool like **Selenium**. We may get the chance to work with **Selenium** at a later date. For now, we would work on the basics of web scrapping using **Beautiful Soup**. 

The fact that beautiful soup cannot work with js sites does not mean it is not powerful, it is pretty good at getting even the smallest details on a static web page. And it is very easy to use.

So let's take a some examples.

In [7]:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

Here we are calling in the dependencies we would be needing for this practice.

Requests is a module used to make http requests
bs4 is the Beautiful soup module. We have installed it using the pip on line 3.

In [6]:
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None
    except RequestException as e:
        log_error(f'Error during requests to {url}: {str(e)}')
        return None


So like the comment says, simple_get takes in url to pull the HTML/XML of a that url's page. It returns None if the url is invalid

In [8]:
def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (
        resp.status_code == 200
        and content_type is not None
        and content_type.find('html') > -1
    )

In [9]:
def log_error(e):
    """
    It is always a good idea to log errors.
    This function just prints them, but you can
    make it do anything.
    """
    print(e)


In [21]:
def get_names():
    """
    Downloads the page where the list of mathematicians is found
    and returns a list of strings, one per mathematician
    """
    url = 'http://www.fabpedigree.com/james/mathmen.htm'
    response = simple_get(url)

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        names = set()
        for li in html.select('li'):
            for name in li.text.split('\n'):
                if len(name) > 0:
                    names.add(name.strip())
        return list(names)

    raise Exception(f'Error retrieving contents at {url}')


So let's try out some urls with the simple_get() function.

In [11]:
google_text = simple_get('https://google.com')

In [12]:
len(google_text)

12526

In [22]:
mathematicians = get_names()

In [14]:
!pip install html5lib

Collecting html5lib
[?25l  Downloading https://files.pythonhosted.org/packages/a5/62/bbd2be0e7943ec8504b517e62bab011b4946e1258842bc159e5dfde15b96/html5lib-1.0.1-py2.py3-none-any.whl (117kB)
[K     |████████████████████████████████| 122kB 236kB/s eta 0:00:01
Installing collected packages: html5lib
Successfully installed html5lib-1.0.1


In [18]:
!pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/45/6c/436a534dca42f7982ba793983353035d117ab70541266704974efa323ade/lxml-4.3.3-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB)
[K     |████████████████████████████████| 8.7MB 2.0MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.3.3


In [23]:
mathematicians

['Eudoxus  of Cnidus',
 'Jean-Pierre Serre',
 'Hipparchus  of Nicaea',
 'Alexandre Grothendieck',
 'Pierre-Simon Laplace',
 'Emmy Noether',
 'George D. Birkhoff',
 'F. L. Gottlob Frege',
 'Isaac Newton',
 'Blaise Pascal',
 'Christiaan Huygens',
 'John von Neumann',
 'Élie Cartan',
 'Archytas  of Tarentum',
 'Hermann K. H. Weyl',
 'Pierre de Fermat',
 'William R. Hamilton',
 'Apollonius  of Perga',
 'Hermann G. Grassmann',
 'Karl W. T. Weierstrass',
 'André Weil',
 'Shiing-Shen Chern',
 'Joseph-Louis Lagrange',
 'Alhazen ibn al-Haytham',
 'Jacob Bernoulli',
 'Giuseppe Peano',
 'Kurt Gödel',
 'Évariste Galois',
 'Pafnuti Chebyshev',
 'F. Gotthold Eisenstein',
 'Henri Poincaré',
 'Godfrey H. Hardy',
 'Jean-Victor Poncelet',
 'Carl G. J. Jacobi',
 'Jacques Hadamard',
 'M. E. Camille Jordan',
 'Marius Sophus Lie',
 'Liu Hui',
 'Joseph Liouville',
 'Jakob Steiner',
 'Euclid  of Alexandria',
 'Carl Ludwig Siegel',
 'George Pólya',
 'Adrien M. Legendre',
 "Leonardo `Fibonacci'",
 'Johannes Kep

In [24]:
raw_html = simple_get('http://www.fabpedigree.com/james/mathmen.htm')

In [25]:
html = BeautifulSoup(raw_html, 'html.parser')

In [26]:
for i, li in enumerate(html.select('li')):
        print(i, li.text)

0  Isaac Newton
 Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

1  Archimedes
 Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

2  Carl F. Gauss
 Leonhard Euler
 Bernhard Riemann

3  Leonhard Euler
 Bernhard Riemann

4  Bernhard Riemann

5  Henri Poincaré
 Joseph-Louis Lagrange
 Euclid  of Alexandria
 David Hilbert
 Gottfried W. Leibniz

6  Joseph-Louis Lagrange
 Euclid  of Alexandria
 David Hilbert
 Gottfried W. Leibniz

7  Euclid  of Alexandria
 David Hilbert
 Gottfried W. Leibniz

8  David Hilbert
 Gottfried W. Leibniz

9  Gottfried W. Leibniz

10  Alexandre Grothendieck
 Pierre de Fermat
 Évariste Galois
 John von Neumann
 René Descartes

11  Pierre de Fermat
 Évariste Galois
 John von Neumann
 René Descartes

12  Évariste Galois
 John von Neumann
 René Descartes

13  John von Neumann
 René Descartes

14  René Descartes

15  Karl W. T. Weierstrass
 Srinivasa Ramanujan
 Hermann K. H. Weyl
 Peter G. L. Dirichlet
 Niels Abel

16  Srinivasa Ramanujan
 Hermann K. H. Weyl
 

Another example -- Periodic Table

In [27]:
element_html = simple_get('https://www.sigmaaldrich.com/technical-documents/articles/biology/periodic-table-of-elements-names.html')

In [28]:
periodic_soup = BeautifulSoup(element_html) 

In [29]:
periodic_table = periodic_soup.select('.productTable')

In [30]:
type(periodic_table)

list

In [31]:
len(periodic_table)

1

In [32]:
periodic_tds = periodic_table[0].children

In [33]:
periodic_tds

<list_iterator at 0x1148e9208>

In [35]:
for td in periodic_tds:
    print(td)



<tbody><tr><th>Element Name</th>
<th>Symbol</th>
<th>Atomic Number</th>
</tr><tr><td>Actinium</td>
<td>Ac</td>
<td>89</td>
</tr><tr><td>Aluminum</td>
<td>Al</td>
<td>13</td>
</tr><tr><td>Americium</td>
<td>Am</td>
<td>95</td>
</tr><tr><td>Antimony</td>
<td>Sb</td>
<td>51</td>
</tr><tr><td>Argon</td>
<td>Ar</td>
<td>18</td>
</tr><tr><td>Arsenic</td>
<td>As</td>
<td>33</td>
</tr><tr><td>Astatine</td>
<td>At</td>
<td>85</td>
</tr><tr><td>Barium</td>
<td>Ba</td>
<td>56</td>
</tr><tr><td>Berkelium</td>
<td>Bk</td>
<td>97</td>
</tr><tr><td>Beryllium</td>
<td>Be</td>
<td>4</td>
</tr><tr><td>Bismuth</td>
<td>Bi</td>
<td>83</td>
</tr><tr><td>Bohrium</td>
<td>Bh</td>
<td>107</td>
</tr><tr><td>Boron</td>
<td>B</td>
<td>5</td>
</tr><tr><td>Bromine</td>
<td>Br</td>
<td>35</td>
</tr><tr><td>Cadmium</td>
<td>Cd</td>
<td>48</td>
</tr><tr><td>Calcium</td>
<td>Ca</td>
<td>20</td>
</tr><tr><td>Californium</td>
<td>Cf</td>
<td>98</td>
</tr><tr><td>Carbon</td>
<td>C</td>
<td>6</td>
</tr><tr><td>Cerium</t