# Introduction to Beautifulsoup4 and web page parsing

### Exercise 1 : Read a HTML file (from disk) using bs4

In [1]:
from bs4 import BeautifulSoup

In [2]:
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    print(type(soup))

<class 'bs4.BeautifulSoup'>


In [3]:
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    print(type(soup.contents))
    print(soup.contents)

<class 'list'>
[<html><head></head><body><h1>Lorem ipsum dolor sit amet consectetuer adipiscing 
elit</h1>

<p>Lorem ipsum dolor sit amet, consectetuer adipiscing 
elit. Aenean commodo ligula eget dolor. Aenean massa 
<strong>strong</strong>. Cum sociis natoque penatibus 
et magnis dis parturient montes, nascetur ridiculus 
mus. Donec quam felis, ultricies nec, pellentesque 
eu, pretium quis, sem. Nulla consequat massa quis 
enim. Donec pede justo, fringilla vel, aliquet nec, 
vulputate eget, arcu. In enim justo, rhoncus ut, 
imperdiet a, venenatis vitae, justo. Nullam dictum 
felis eu pede <a class="external ext" href="#">link</a> 
mollis pretium. Integer tincidunt. Cras dapibus. 
Vivamus elementum semper nisi. Aenean vulputate 
eleifend tellus. Aenean leo ligula, porttitor eu, 
consequat vitae, eleifend ac, enim. Aliquam lorem ante, 
dapibus in, viverra quis, feugiat a, tellus. Phasellus 
viverra nulla ut metus varius laoreet. Quisque rutrum. 
Aenean imperdiet. Etiam ultricies nisi v

### Exercise 2 : Read a particular tag (And its contents)

In [4]:
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    print(soup.p)

<p>Lorem ipsum dolor sit amet, consectetuer adipiscing 
elit. Aenean commodo ligula eget dolor. Aenean massa 
<strong>strong</strong>. Cum sociis natoque penatibus 
et magnis dis parturient montes, nascetur ridiculus 
mus. Donec quam felis, ultricies nec, pellentesque 
eu, pretium quis, sem. Nulla consequat massa quis 
enim. Donec pede justo, fringilla vel, aliquet nec, 
vulputate eget, arcu. In enim justo, rhoncus ut, 
imperdiet a, venenatis vitae, justo. Nullam dictum 
felis eu pede <a class="external ext" href="#">link</a> 
mollis pretium. Integer tincidunt. Cras dapibus. 
Vivamus elementum semper nisi. Aenean vulputate 
eleifend tellus. Aenean leo ligula, porttitor eu, 
consequat vitae, eleifend ac, enim. Aliquam lorem ante, 
dapibus in, viverra quis, feugiat a, tellus. Phasellus 
viverra nulla ut metus varius laoreet. Quisque rutrum. 
Aenean imperdiet. Etiam ultricies nisi vel augue. 
Curabitur ullamcorper ultricies nisi.</p>


### Exercise 3 : Read all tags of the same type from the document

In [5]:
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    all_ps = soup.find_all('p')
    print("Total number of <p>  --- {}".format(len(all_ps)))
    print(all_ps)

Total number of <p>  --- 6
[<p>Lorem ipsum dolor sit amet, consectetuer adipiscing 
elit. Aenean commodo ligula eget dolor. Aenean massa 
<strong>strong</strong>. Cum sociis natoque penatibus 
et magnis dis parturient montes, nascetur ridiculus 
mus. Donec quam felis, ultricies nec, pellentesque 
eu, pretium quis, sem. Nulla consequat massa quis 
enim. Donec pede justo, fringilla vel, aliquet nec, 
vulputate eget, arcu. In enim justo, rhoncus ut, 
imperdiet a, venenatis vitae, justo. Nullam dictum 
felis eu pede <a class="external ext" href="#">link</a> 
mollis pretium. Integer tincidunt. Cras dapibus. 
Vivamus elementum semper nisi. Aenean vulputate 
eleifend tellus. Aenean leo ligula, porttitor eu, 
consequat vitae, eleifend ac, enim. Aliquam lorem ante, 
dapibus in, viverra quis, feugiat a, tellus. Phasellus 
viverra nulla ut metus varius laoreet. Quisque rutrum. 
Aenean imperdiet. Etiam ultricies nisi vel augue. 
Curabitur ullamcorper ultricies nisi.</p>, <p>Lorem ipsum dolor sit a

### Exercise 4 : What is the content under a particular tag

In [6]:
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    print(table.contents)

['\n  ', <tbody><tr>
    <th>Entry Header 1</th>
    <th>Entry Header 2</th>
    <th>Entry Header 3</th>
    <th>Entry Header 4</th>
  </tr>
  <tr>
    <td>Entry First Line 1</td>
    <td>Entry First Line 2</td>
    <td>Entry First Line 3</td>
    <td>Entry First Line 4</td>
  </tr>
  <tr>
    <td>Entry Line 1</td>
    <td>Entry Line 2</td>
    <td>Entry Line 3</td>
    <td>Entry Line 4</td>
  </tr>
  <tr>
    <td>Entry Last Line 1</td>
    <td>Entry Last Line 2</td>
    <td>Entry Last Line 3</td>
    <td>Entry Last Line 4</td>
  </tr>
</tbody>]


### Exercise 5: Using the children generator

In [7]:
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    for child in table.children:
        print(child)
        print("*****")


  
*****
<tbody><tr>
    <th>Entry Header 1</th>
    <th>Entry Header 2</th>
    <th>Entry Header 3</th>
    <th>Entry Header 4</th>
  </tr>
  <tr>
    <td>Entry First Line 1</td>
    <td>Entry First Line 2</td>
    <td>Entry First Line 3</td>
    <td>Entry First Line 4</td>
  </tr>
  <tr>
    <td>Entry Line 1</td>
    <td>Entry Line 2</td>
    <td>Entry Line 3</td>
    <td>Entry Line 4</td>
  </tr>
  <tr>
    <td>Entry Last Line 1</td>
    <td>Entry Last Line 2</td>
    <td>Entry Last Line 3</td>
    <td>Entry Last Line 4</td>
  </tr>
</tbody>
*****


### Exercise 6: Using the `descendants` generator

In [8]:
with open("test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    children = table.children
    des = table.descendants
    print(len(list(children)), len(list(des)))
    for d in table.descendants:
        print(d)
        print("****")

2 62

  
****
<tbody><tr>
    <th>Entry Header 1</th>
    <th>Entry Header 2</th>
    <th>Entry Header 3</th>
    <th>Entry Header 4</th>
  </tr>
  <tr>
    <td>Entry First Line 1</td>
    <td>Entry First Line 2</td>
    <td>Entry First Line 3</td>
    <td>Entry First Line 4</td>
  </tr>
  <tr>
    <td>Entry Line 1</td>
    <td>Entry Line 2</td>
    <td>Entry Line 3</td>
    <td>Entry Line 4</td>
  </tr>
  <tr>
    <td>Entry Last Line 1</td>
    <td>Entry Last Line 2</td>
    <td>Entry Last Line 3</td>
    <td>Entry Last Line 4</td>
  </tr>
</tbody>
****
<tr>
    <th>Entry Header 1</th>
    <th>Entry Header 2</th>
    <th>Entry Header 3</th>
    <th>Entry Header 4</th>
  </tr>
****

    
****
<th>Entry Header 1</th>
****
Entry Header 1
****

    
****
<th>Entry Header 2</th>
****
Entry Header 2
****

    
****
<th>Entry Header 3</th>
****
Entry Header 3
****

    
****
<th>Entry Header 4</th>
****
Entry Header 4
****

  
****

  
****
<tr>
    <td>Entry First Line 1</td>
    <td>Entry 

### Exercise 7 (Using findAll and getting the `tr`s, and finally getting it all in a pandas DF)

In [9]:
import pandas as pd

fd = open("test.html", "r")
soup = BeautifulSoup(fd)
data = soup.findAll('tr')
print("Data is a {} and {} items long".format(type(data), len(data)))

data_without_header = data[1:]
headers = data[0]

col_headers = [th.getText() for th in headers.findAll('th')]
df_data = [[td.getText() for td in tr.findAll('td')] for tr in data_without_header] # nested list-comp for 2d struct

fd.close()

df = pd.DataFrame(df_data, columns=col_headers)
df.head()

Data is a <class 'bs4.element.ResultSet'> and 4 items long


Unnamed: 0,Entry Header 1,Entry Header 2,Entry Header 3,Entry Header 4
0,Entry First Line 1,Entry First Line 2,Entry First Line 3,Entry First Line 4
1,Entry Line 1,Entry Line 2,Entry Line 3,Entry Line 4
2,Entry Last Line 1,Entry Last Line 2,Entry Last Line 3,Entry Last Line 4


### Exercise 8: Export the df to excel file

In [10]:
!pip install openpyxl

Collecting openpyxl
[?25l  Downloading https://files.pythonhosted.org/packages/04/18/64737cc6c5233e15374d21b4958a5600be52359e71063b4d4e7a604a1387/openpyxl-2.5.9.tar.gz (1.9MB)
[K    100% |████████████████████████████████| 1.9MB 3.8MB/s ta 0:00:011
[?25hCollecting jdcal (from openpyxl)
  Downloading https://files.pythonhosted.org/packages/a0/38/dcf83532480f25284f3ef13f8ed63e03c58a65c9d3ba2a6a894ed9497207/jdcal-1.4-py2.py3-none-any.whl
Collecting et_xmlfile (from openpyxl)
  Downloading https://files.pythonhosted.org/packages/22/28/a99c42aea746e18382ad9fb36f64c1c1f04216f41797f2f0fa567da11388/et_xmlfile-1.0.1.tar.gz
Building wheels for collected packages: openpyxl, et-xmlfile
  Running setup.py bdist_wheel for openpyxl ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/57/41/b9/3765af8bda4a8d4b6aaf4957d7214984c3332348713e85cf36
  Running setup.py bdist_wheel for et-xmlfile ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/2a/77/35/0da0965a057698121fc7

In [11]:
writer = pd.ExcelWriter('test_output.xlsx')
df.to_excel(writer, "Sheet1")
writer.save()

### Exercise 9: Stacking URLs (to follow them at a later point)

In [16]:
fd = open("test.html", "r")
soup = BeautifulSoup(fd)

lis = soup.find('ul').findAll('li')
stack = []
for li in lis:
    a = li.find('a', href=True)
    stack.append(a['href'])

In [17]:
print(stack)

['https://en.wikipedia.org/wiki/Entropy_(information_theory)', 'http://www.gutenberg.org/browse/scores/top', 'https://www.imdb.com/chart/top']
