We are going to be using Python and several Python libraries

In [1]:
# Install the following libraries
!pip install bs4
!pip install lxml==4.6.4
!pip install html5lib==1.1 
!pip install requests==2.26.0



Import the required modules and functions

In [2]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

Store it as a string in the variable HTML

In [3]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse the document, pass it into the BeautifulSoup constructor, the BeautifulSoup object, which represents the document as a nested data structure:

In [4]:
soup = BeautifulSoup(html, "html.parser")

We can use the method <code>prettify()</code> to display the HTML in the nested structure:

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>


 We will cover <code>BeautifulSoup</code> and <code>Tag</code> objects that for the purposes of this lab are identical, and <code>NavigableString</code> objects. The <code>Tag</code> object corresponds to an HTML tag in the original document, for example, the tag title.

### Tags

In [6]:
tag_object=soup.title
print("tag object:",tag_object)

tag object: <title>Page Title</title>


In [7]:
# check tag type
print("tag object type:", type(tag_object))

tag object type: <class 'bs4.element.Tag'>


If there is more than one <code>Tag</code>  with the same name, the first element with that <code>Tag</code> name is called, this corresponds to the most paid player:

In [8]:
tag_object=soup.h3
tag_object

<h3><b id="boldest">Lebron James</b></h3>

### Children, Parents, and Siblings

The <code>Tag</code> object is a tree of objects. We can access the child of the tag or navigate down the branch as follows:

In [9]:
tag_child =tag_object.b
tag_child

<b id="boldest">Lebron James</b>

You can access the parent with the <code> parent</code>

In [10]:
parent_tag=tag_child.parent
parent_tag

<h3><b id="boldest">Lebron James</b></h3>

In [11]:
# This is identical to:
tag_object

<h3><b id="boldest">Lebron James</b></h3>

<code>tag_object</code> parent is the <code>body</code> element.

In [12]:
tag_object.parent

<body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>

<code>tag_object</code> sibling is the <code>paragraph</code> element

In [13]:
sibling_1=tag_object.next_sibling
sibling_1

<p> Salary: $ 92,000,000 </p>

`sibling_2` is the `header` element which is also a sibling of both `sibling_1` and `tag_object`

In [14]:
sibling_2=sibling_1.next_sibling
sibling_2

<h3> Stephen Curry</h3>

In [15]:
sibling_3=sibling_2.next_sibling
sibling_3

<p> Salary: $85,000, 000 </p>

### HTML Attributes

If the tag has attributes, the tag <code>id="boldest"</code> has an attribute <code>id</code> whose value is <code>boldest</code>. You can access a tag’s attributes by treating the tag like a dictionary:

In [16]:
tag_child['id']

'boldest'

In [17]:
# You can access that dictionary directly as attrs:
tag_child.attrs

{'id': 'boldest'}

We can also obtain the content if the attribute of the <code>tag</code> using the Python <code>get()</code> method.

In [18]:
tag_child.get('id')

'boldest'

### Navigable String

A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the <code>NavigableString</code> class to contain this text. In our HTML we can obtain the name of the first player by extracting the sting of the <code>Tag</code> object <code>tag_child</code> as follows:


In [19]:
tag_string=tag_child.string
tag_string

'Lebron James'

In [20]:
# We can verify the type is Navigable String
type(tag_string)

bs4.element.NavigableString

A NavigableString is just like a Python string or Unicode string. The main difference is that it also supports some  <code>BeautifulSoup</code> features. We can covert it to string object in Python:

In [21]:
unicode_string = str(tag_string)
unicode_string

'Lebron James'

<h2 id="filter">Filter</h2>


Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string.  Consider the following HTML of rocket launchs:


In [22]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


We can store it as a string in the variable table:

In [24]:
table = "<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

In [25]:
table_bs = BeautifulSoup(table, "html.parser")

### find All

The <code>find_all()</code> method looks through a tag’s descendants and retrieves all descendants that match your filters.

<p>
The Method signature for <code>find_all(name, attrs, recursive, string, limit, **kwargs)<c/ode>
</p>

#### Name

When we set the <code>name</code> parameter to a tag name, the method will extract all the tags with that name and its children.

In [26]:
table_rows = table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>]

The result is a Python Iterable just like a list, each element is a <code>tag</code> object:

In [32]:
# access tag in second row
second_row = table_rows[1]
second_row

<tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>

In [30]:
# check type second_row
print(type(second_row))

<class 'bs4.element.Tag'>


In [31]:
# print child
second_row.td

<td>1</td>

If we iterate through the list, each element corresponds to a row in the table:

In [33]:
for i, row in enumerate(table_rows):
    print("row", i, "is", row)

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>


As <code>row</code> is a <code>cell</code> object, we can apply the method <code>find_all</code> to it and extract table cells in the object <code>cells</code> using the tag <code>td</code>, this is all the children with the name <code>td</code>. The result is a list, each element corresponds to a cell and is a <code>Tag</code> object, we can iterate through this list as well. We can extract the content using the <code>string</code>  attribute.


In [34]:
for i, row in enumerate(table_rows):
    print("row", i)
    cells = row.find_all('td')
    for j, cell in enumerate(cells):
        print("column", j, "cell", cell)

row 0
column 0 cell <td id="flight">Flight No</td>
column 1 cell <td>Launch site</td>
column 2 cell <td>Payload mass</td>
row 1
column 0 cell <td>1</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>
column 2 cell <td>300 kg</td>
row 2
column 0 cell <td>2</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
column 2 cell <td>94 kg</td>
row 3
column 0 cell <td>3</td>
column 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>
column 2 cell <td>80 kg</td>


If we use a list we can match against any item in that list.

In [35]:
list_input = table_bs.find_all(name=['tr', 'td'])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>,
 <td>80 kg</td>]

#### Attributes


If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the <code>id</code>  argument, Beautiful Soup will filter against each tag’s <code>id</code> attribute. For example, the first <code>td</code> elements have a value of <code>id</code> of <code>flight</code>, therefore we can filter based on that <code>id</code> value.


In [37]:
table_bs.find_all(id ='flight')

[<td id="flight">Flight No</td>]

We can find all the elements that have links to the Florida Wikipedia page:

In [39]:
list_input = table_bs.find_all( href="https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a>]

If we set the  <code>href</code> attribute to True, regardless of what the value is, the code finds all tags with <code>href</code> value:

In [40]:
table_bs.find_all(href= True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a>]

In [42]:
# find all the elements without href value

table_bs.find_all(href = False)

[<table><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr></table>,
 <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a></a></a></td>,
 <a></a>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 k

Using the soup object <code>soup</code>, find the element with the <code>id</code> attribute content set to <code>"boldest"</code>.

In [43]:
soup.find_all(id = 'boldest')

[<b id="boldest">Lebron James</b>]

#### string

With string you can search for strings instead of tags, where we find all the elments with Florida:

In [44]:
table_bs.find_all(string = 'Florida')

['Florida', 'Florida']

### find

The <code>find_all()</code> method scans the entire document looking for results. If you are looking for one element you can use the <code>find()</code> method to find the first element in the document. Consider the following two tables:


In [45]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
     <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


We store the HTML as a Python string and assign <code>two_tables</code>:

In [46]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

We create a <code>BeautifulSoup</code> object  <code>two_tables_bs</code>

In [47]:
two_table_bs = BeautifulSoup(two_tables, 'html.parser')

We can find the first table using the tag name table

In [48]:
two_table_bs.find('table')

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

Compare with the <code>find_all()</code> method

In [49]:
two_table_bs.find_all('table')

[<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>,
 <table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>]

We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.

In [50]:
two_table_bs.find('table', class_='pizza')

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

<h3 id="DSCW">Downloading And Scraping The Contents Of A Web Page</h3> 

We Download the contents of the web page:

In [51]:
url = 'https://www.ibm.com'

We use get to download the contents of the webpage in text format and store in a variable called data:

In [52]:
data = requests.get(url).text

We create a <code>BeautifulSoup</code> object using the <code>BeautifulSoup</code> constructor


In [53]:
soup = BeautifulSoup(data, 'html.parser') #create a soup object using variable 'data'

#### Scrape all links

In [54]:
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))

https://www.ibm.com/ke/en
https://www.ibm.com/sitemap/ke/en
https://www.ibm.com/it-infrastructure/power?lnk=ushpv18l1#3086132
https://www.ibm.com/about/secure-your-business/
https://www.ibm.com/cloud/campaign/cloud-simplicity
https://www.ibm.com/analytics/data-fabric
https://www.ibm.com/cloud/aiops
https://www.ibm.com/consulting/
/products/offers-and-discounts?lnk=hpv18t5
/cloud/free?lnk=hpv18t1
/products/cloud-pak-for-data?lnk=hpv18t2
/cloud/watson-assistant?lnk=hpv18t3
/security/identity-access-management/cloud-identity?lnk=hpv18t4
https://developer.ibm.com/depmodels/cloud/?lnk=hpv18ct16
https://developer.ibm.com/technologies/artificial-intelligence?lnk=hpv18ct19
https://developer.ibm.com/?lnk=hpv18ct9
https://www.ibm.com/docs/en?lnk=hpv18ct14
https://www.redbooks.ibm.com/?lnk=ushpv18ct10
https://www.ibm.com/support/home/?lnk=hpv18ct11
https://www.ibm.com/training/?lnk=hpv18ct15
/cloud/hybrid?lnk=hpv18pt14
/cloud/learn/public-cloud?lnk=hpv18ct1
/watson?lnk=ushpv18pt17
/garage?lnk=hpv

#### Scrape all image Tags

In [57]:
for link in soup.find_all('img'): # in html image is represented by the tag <img> 
    print(link)
    print(link.get('src'))

<img alt="A technician stands beside the Power10 frame" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-07/4E-Power10-1000x1000_1.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-07/4E-Power10-1000x1000_1.jpg
<img alt="A developer works at his station" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02-16/security-five-levers-444x254_8.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-02-16/security-five-levers-444x254_8.jpg
<img alt="Two medical engineers review data" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02-16/cloud-five-levers-444x254_8.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-02-16/cloud-five-levers-444x254_8.jpg
<img alt="Oranges on a conveyor belt" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02-16/data-fabric-five-levers-444x254_8.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-02-16/data-fabric-five-levers-444x254_8.jpg
<img alt="A technician works on the IBM Quantum

#### Scrape Data for HTML Tables

In [70]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

# open the url in a browser and examine the contents of the tables

In [60]:
# get the contents of the webpage in text format and store in a variable called data1
data1 = requests.get(url).text

In [61]:
soup = BeautifulSoup(data1, 'html.parser')

In [76]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

In [66]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table is represented by the tag <tr>
     # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by a tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{} ------>{}".format(color_name, color_code))

Color Name ------>None
lightsalmon ------>#FFA07A
salmon ------>#FA8072
darksalmon ------>#E9967A
lightcoral ------>#F08080
coral ------>#FF7F50
tomato ------>#FF6347
orangered ------>#FF4500
gold ------>#FFD700
orange ------>#FFA500
darkorange ------>#FF8C00
lightyellow ------>#FFFFE0
lemonchiffon ------>#FFFACD
papayawhip ------>#FFEFD5
moccasin ------>#FFE4B5
peachpuff ------>#FFDAB9
palegoldenrod ------>#EEE8AA
khaki ------>#F0E68C
darkkhaki ------>#BDB76B
yellow ------>#FFFF00
lawngreen ------>#7CFC00
chartreuse ------>#7FFF00
limegreen ------>#32CD32
lime ------>#00FF00
forestgreen ------>#228B22
green ------>#008000
powderblue ------>#B0E0E6
lightblue ------>#ADD8E6
lightskyblue ------>#87CEFA
skyblue ------>#87CEEB
deepskyblue ------>#00BFFF
lightsteelblue ------>#B0C4DE
dodgerblue ------>#1E90FF


### Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas

In [67]:
import pandas as pd

In [71]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population" 

# open the url in a browser and examine the contents of the tables

In [72]:
# get the contents of the webpage in text format and store in a variable called data2

data2 = requests.get(url).text

In [73]:
soup = BeautifulSoup(data2, 'html.parser')

In [74]:
# find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

In [75]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

25

Assume that we are looking for the `10 most densly populated countries` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.

In [80]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

5


See if you can locate the table name of the table, 10 most densly populated countries, below.

In [82]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
 </caption>
 <tbody>
  <tr>
   <th>
    Rank
   </th>
   <th>
    Country
   </th>
   <th>
    Population
   </th>
   <th>
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th>
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/35px-Flag_of_Singapore.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapo

In [83]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,173200000,143998,1203
2,3,\n Palestine\n\n,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17730000,41526,427
9,10,Israel,9560000,22072,433


In [109]:
for index,table in enumerate(tables):
    if ("Population by region (2020 estimates)" in str(table)):
        table_index_ = index
print(table_index_)

1


In [111]:
print(tables[table_index_].prettify())

<table class="wikitable sortable">
 <caption>
  Population by region (2020 estimates)
 </caption>
 <tbody>
  <tr>
   <th>
    Region
   </th>
   <th>
    Density
    <br/>
    <small>
     (inhabitants/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th>
    Population
    <br/>
    <small>
     (millions)
    </small>
   </th>
   <th>
    Most populous country
   </th>
   <th>
    Most populous city (metropolitan area)
   </th>
  </tr>
  <tr>
   <td>
    Asia
   </td>
   <td style="text-align:right">
    104.1
   </td>
   <td style="text-align:right">
    4,641
   </td>
   <td>
    1,411,778,000 –
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%2

In [136]:
global_population_data = pd.DataFrame(columns=["Region", "Density", "Population", "Most Populous country", "Most populous city (metropolitan area)"])

for row in tables[table_index_].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        region = col[0].text
        density = col[1].text.strip()                                       
        population = col[2].text.strip()
        mpc = col[3].text.strip()
        mpcc = col[4].text.strip()                                       
        global_population_data = global_population_data.append({"Region":region, "Density":density,  "Population":population, "Most Populous country":mpc, "Most populous city (metropolitan area)":mpcc}, ignore_index=True)

global_population_data

Unnamed: 0,Region,Density,Population,Most Populous country,Most populous city (metropolitan area)
0,Asia\n,104.1,4641,"1,411,778,000 – China[note 1]","13,515,000 – Tokyo Metropolis(37,400,000 – G..."
1,Africa\n,44.4,1340,"0,211,401,000 – Nigeria","09,500,000 – Cairo(20,076,000 – Greater Cairo)"
2,Europe\n,73.4,747,"0,146,171,000 – Russia, approx. 110 million i...","13,200,000 – Moscow(20,004,000 – Moscow metr..."
3,Latin America\n,24.1,653,"0,214,103,000 – Brazil","12,252,000 – São Paulo City(21,650,000 – São..."
4,Northern America[note 2]\n,14.9,368,"0,332,909,000 – United States","08,804,000 – New York City(23,582,649 – New ..."
5,Oceania\n,5,42,"0,025,917,000 – Australia","05,367,000 – Sydney"
6,Antarctica\n,~0,0.004[16],N/A[note 3],"00,001,258 – McMurdo Station"


### Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html

Using the same url, data, soup, and tables object as in the last section we can use the read_html function to create a DataFrame.

Remember the tables we need are located in tables[table_index] and tables[table_index_]

We can now use the pandas function read_html and give it the string version of the table as well as the flavor which is the parsing engine bs4.

In [139]:
pd.read_html(str(tables[5]), flavor = 'bs4')

[   Rank      Country  Population  Area(km2)  Density(pop/km2)
 0     1    Singapore     5704000        710              8033
 1     2   Bangladesh   173200000     143998              1203
 2     3    Palestine     5266785       6020               847
 3     4      Lebanon     6856000      10452               656
 4     5       Taiwan    23604000      36193               652
 5     6  South Korea    51781000      99538               520
 6     7       Rwanda    12374000      26338               470
 7     8        Haiti    11578000      27065               428
 8     9  Netherlands    17730000      41526               427
 9    10       Israel     9560000      22072               433]

In [140]:
pd.read_html(str(tables[1]), flavor = 'bs4')

[                     Region Density(inhabitants/km2) Population(millions)  \
 0                      Asia                    104.1                 4641   
 1                    Africa                     44.4                 1340   
 2                    Europe                     73.4                  747   
 3             Latin America                     24.1                  653   
 4  Northern America[note 2]                     14.9                  368   
 5                   Oceania                        5                   42   
 6                Antarctica                       ~0            0.004[16]   
 
                                Most populous country  \
 0                      1,411,778,000 – China[note 1]   
 1                            0,211,401,000 – Nigeria   
 2  0,146,171,000 – Russia, approx. 110 million in...   
 3                             0,214,103,000 – Brazil   
 4                      0,332,909,000 – United States   
 5                          0,02

The function read_html always returns a list of DataFrames so we must pick the one we want out of the list.

In [141]:
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

population_data_read_html

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,173200000,143998,1203
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17730000,41526,427
9,10,Israel,9560000,22072,433


In [142]:
global_population_data_read_html = pd.read_html(str(tables[1]), flavor='bs4')[0]

global_population_data_read_html

Unnamed: 0,Region,Density(inhabitants/km2),Population(millions),Most populous country,Most populous city (metropolitan area)
0,Asia,104.1,4641,"1,411,778,000 – China[note 1]","13,515,000 – Tokyo Metropolis(37,400,000 – Gre..."
1,Africa,44.4,1340,"0,211,401,000 – Nigeria","09,500,000 – Cairo(20,076,000 – Greater Cairo)"
2,Europe,73.4,747,"0,146,171,000 – Russia, approx. 110 million in...","13,200,000 – Moscow(20,004,000 – Moscow metrop..."
3,Latin America,24.1,653,"0,214,103,000 – Brazil","12,252,000 – São Paulo City(21,650,000 – São P..."
4,Northern America[note 2],14.9,368,"0,332,909,000 – United States","08,804,000 – New York City(23,582,649 – New Yo..."
5,Oceania,5,42,"0,025,917,000 – Australia","05,367,000 – Sydney"
6,Antarctica,~0,0.004[16],N/A[note 3],"00,001,258 – McMurdo Station"


### Scrape data from HTML tables into a DataFrame using read_html

We can also use the `read_html` function to directly get DataFrames from a `url`.

In [143]:
dataframe_list = pd.read_html(url, flavor='bs4')

In [144]:
len(dataframe_list)

25

We can see there are 25 DataFrames just like when we used `find_all` on the `soup` object.

We can pick the DataFrame we need out of the list.

In [145]:
dataframe_list[5]

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,173200000,143998,1203
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Haiti,11578000,27065,428
8,9,Netherlands,17730000,41526,427
9,10,Israel,9560000,22072,433


In [146]:
dataframe_list[1]

Unnamed: 0,Region,Density(inhabitants/km2),Population(millions),Most populous country,Most populous city (metropolitan area)
0,Asia,104.1,4641,"1,411,778,000 – China[note 1]","13,515,000 – Tokyo Metropolis(37,400,000 – Gre..."
1,Africa,44.4,1340,"0,211,401,000 – Nigeria","09,500,000 – Cairo(20,076,000 – Greater Cairo)"
2,Europe,73.4,747,"0,146,171,000 – Russia, approx. 110 million in...","13,200,000 – Moscow(20,004,000 – Moscow metrop..."
3,Latin America,24.1,653,"0,214,103,000 – Brazil","12,252,000 – São Paulo City(21,650,000 – São P..."
4,Northern America[note 2],14.9,368,"0,332,909,000 – United States","08,804,000 – New York City(23,582,649 – New Yo..."
5,Oceania,5,42,"0,025,917,000 – Australia","05,367,000 – Sydney"
6,Antarctica,~0,0.004[16],N/A[note 3],"00,001,258 – McMurdo Station"


In [149]:
dataframe_list[4]

Unnamed: 0,Rank,Country / Dependency,Population,Percentage of the world,Date,Source (official or from the United Nations)
0,1,China,1412600000,17.7%,31 Dec 2021,National annual estimate[93]
1,2,India,1373761000,17.2%,1 Mar 2022,Annual national estimate[94]
2,3,United States,332967888,4.18%,8 Aug 2022,National population clock[95]
3,4,Indonesia,272248500,3.42%,1 Jul 2021,National annual estimate[96]
4,5,Pakistan,229488994,2.88%,1 Jul 2022,UN projection[97]
5,6,Nigeria,216746934,2.72%,1 Jul 2022,UN projection[97]
6,7,Brazil,214985656,2.70%,8 Aug 2022,National population clock[98]
7,8,Bangladesh,168220000,2.11%,1 Jul 2020,Annual Population Estimate[99]
8,9,Russia,147190000,1.85%,1 Oct 2021,2021 preliminary census results[100]
9,10,Mexico,128271248,1.61%,31 Mar 2022,National quarterly estimate[101]


We can also use the `match` parameter to select the specific table we want. If the table contains a string matching the text it will be read.

In [153]:
pd.read_html(url, match="Most populous countries", flavor='bs4')[0]

Unnamed: 0,#,Most populous countries,2000,2015,2030[A]
0,1,China[B],1270,1376,1416
1,2,India,1053,1311,1528
2,3,United States,283,322,356
3,4,Indonesia,212,258,295
4,5,Pakistan,136,208,245
5,6,Brazil,176,206,228
6,7,Nigeria,123,182,263
7,8,Bangladesh,131,161,186
8,9,Russia,146,146,149
9,10,Mexico,103,127,148
