## Objectives
* Interacting with web APIs to collect data
* Data Extraction from HTML/XML Sources to DataFrames

In [1]:
# required libraries 

!pip install bs4
!pip install lxml
!pip install html5lib
!pip install requests



The above commands are for installing specific Python libraries, each of which serves a different purpose, primarily in the context of web scraping and data processing. Here's a brief overview of what each library does:

#### Beautiful Soup (bs4):
* Purpose: Beautiful Soup is a Python library for parsing HTML and XML documents. It creates parse trees that are helpful to extract data easily. It's widely used for web scraping, i.e., pulling data out of HTML and XML files. Beautiful Soup provides methods for navigating, searching, and modifying the parse tree, making it ideal for complex scraping.

#### lxml:
* Purpose: The lxml library is a high-performance, easy-to-use library for processing XML and HTML in the Python language. It sits on top of the libxml2 and libxslt libraries, offering Pythonic bindings to these libraries. It's known for its speed and ability to handle large XML documents efficiently. It also provides support for XPath, XSLT, and schema validation, making it suitable for complex XML processing tasks beyond web scraping.

#### html5lib:
* Purpose: html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as opposed to HTML as generally found in the wild. The library is useful for parsing HTML documents that are written in versions of HTML that precede HTML5, ensuring that even poorly formatted HTML can be correctly interpreted and manipulated.

#### requests:
* Purpose: The requests library is a simple, yet powerful, HTTP library for Python. It allows you to send HTTP/1.1 requests easily, without the need for manually adding query strings to your URLs, or form-encoding your POST data. It's highly regarded for its ease of use and its ability to abstract away the complexities of making requests. It also supports sessions with cookie persistence, browser-style SSL verification, and automatic content decoding, among other features.

In [2]:
#import the required modules and functions

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

## Beautiful Soup Objects

Beautiful Soup is a Python library for parsing HTML files. It allows you to navigate HTML as a tree structure and extract specific data from it.
Consider the following HTML:

In [3]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

In [4]:
# storing HTML files inside a variable html

html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the `BeautifulSoup` constructor, the `BeautifulSoup` object, which represents the document as a nested data structure:


In [5]:
soup = BeautifulSoup(html, "html.parser")

Beautiful Soup converts the document to Unicode, simillar to ASCII, and replaces HTML entities with Unicode characters. It then transforms the HTML document into a tree of Python objects. The main objects we'll focus on in this lab are `BeautifulSoup`, `Tag`, and `NavigableString`.

In [6]:
# prettify() function displays the HTML in the nested structure:

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>



## Tags

Let's say we want the  title of the page and the name of the top paid player we can use the `Tag`. The `Tag` object corresponds to an HTML tag in the original document, for example, the tag title.

In [7]:
tag_object = soup.title
print("tag object:",tag_object)

tag object: <title>Page Title</title>


In [9]:
# we can see the tag type 'bs4.element.Tag`

print("tag object type:",type(tag_object))

tag object type: <class 'bs4.element.Tag'>


If there is more than one `Tag`  with the same name, the first element with that `Tag` name is called, this corresponds to the most paid player:


In [10]:
tag_object = soup.h3
tag_object

<h3><b id="boldest">Lebron James</b></h3>

Enclosed in the bold attribute `b`, it helps to use the tree representation. We can navigate down the tree using the child attribute to get the name.

### Children, Parents, and Siblings

As stated above the `Tag` object is a tree of objects we can access the child of the tag or navigate down the branch as follows:

In [12]:
tag_child = tag_object.b
tag_child

<b id="boldest">Lebron James</b>

In [13]:
# you can access the parent with the  parent
parent_tag = tag_child.parent
parent_tag

<h3><b id="boldest">Lebron James</b></h3>

In [17]:
# this is similar to tag object

print('tag_object: ', tag_object)

# tag_object parent is the body element.
print('tag_object parent: ' ,tag_object.parent)

tag_object:  <h3><b id="boldest">Lebron James</b></h3>
tag_object parent:  <body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>


In [18]:
# tag_object sibling is the paragraph element

sibling_1=tag_object.next_sibling
sibling_1

<p> Salary: $ 92,000,000 </p>

In [19]:
# sibling_2 is the header element which is also a sibling of both sibling_1 and tag_object

sibling_2 = sibling_1.next_sibling
sibling_2

<h3> Stephen Curry</h3>

#### Exercise: `next_sibling`
* Using the object `sibling_2` and the property `next_sibling` to find the salary of Stephen Curry:

In [20]:
# finding the next sibbling

sibling_3 = sibling_2.next_sibling
sibling_3

<p> Salary: $85,000, 000 </p>

## HTML Attributes
* If the tag has attributes, the tag `id="boldest"` has an attribute `id` whose value is `boldest`. You can access a tag’s attributes by treating the tag like a dictionary:

In [21]:
tag_child['id']

'boldest'

In [22]:
# you can access that dictionary directly as attrs:

tag_child.attrs

{'id': 'boldest'}

You can also work with Multi-valued attribute check out <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01">\[1]</a> for more.

We can also obtain the content if the attribute of the `tag` using the Python `get()` method.

In [23]:
tag_child.get('id')

'boldest'

## Navigable String
* A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the `NavigableString` class to contain this text. In our HTML we can obtain the name of the first player by extracting the sting of the `Tag` object `tag_child` as follows:

In [24]:
tag_string = tag_child.string
tag_string

'Lebron James'

In [25]:
# we can verify the type is Navigable String

type(tag_string)

bs4.element.NavigableString

A `NavigableString` is just like a `Python string` or `Unicode string`, to be more precise. The main difference is that it also supports some `BeautifulSoup` features. We can covert it to sting object in Python:

In [26]:
unicode_string = str(tag_string)
unicode_string

'Lebron James'

## Filter

* Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string. Consider the following HTML of rocket launchs:

In [27]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [28]:
# storing the HTML as a string in the variable table:

table = "<table><tr><td id='flight' >Flight No</td><td>Launch site</td><td>Payload mass</td></tr><tr><td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td><td>80 kg</td></tr></table>"
table_bs = BeautifulSoup(table, "html.parser")

## find All

* The `find_all()` method looks through a tag’s descendants and retrieves all descendants that match your filters.

* The Method signature for `find_all(name, attrs, recursive, string, limit, **kwargs)`

## Name 
* When we set the `name` parameter to a tag name, the method will extract all the tags with that name and its children.

In [29]:
table_rows = table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>]

The result is a Python Iterable just like a list, each element is a `tag` object:

In [30]:
first_row = table_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>

The type is `tag`

In [31]:
print(type(first_row))

<class 'bs4.element.Tag'>


In [33]:
# we can obtain the child

first_row.td

<td id="flight">Flight No</td>

In [34]:
# if we iterate through the list, each element corresponds to a row in the table:

for i,row in enumerate(table_rows):
    print("row",i,"is",row)

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>
row 1 is <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>


As <code>row</code> is a <code>cell</code> object, we can apply the method <code>find_all</code> to it and extract table cells in the object <code>cells</code> using the tag <code>td</code>, this is all the children with the name <code>td</code>. The result is a list, each element corresponds to a cell and is a <code>Tag</code> object, we can iterate through this list as well. We can extract the content using the <code>string</code>  attribute.


In [35]:
for i,row in enumerate(table_rows):
    print("row",i)
    cells = row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)

row 0
colunm 0 cell <td id="flight">Flight No</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
colunm 2 cell <td>300 kg</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
colunm 2 cell <td>94 kg</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>
colunm 2 cell <td>80 kg</td>


In [40]:
# if we use a list we can match against any item in that list

list_input = table_bs .find_all(name = ["tr", "td"])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>,
 <td>80 kg</td>]

## Attributes

* If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the <code>id</code>  argument, Beautiful Soup will filter against each tag’s <code>id</code> attribute. For example, the first <code>td</code> elements have a value of <code>id</code> of <code>flight</code>, therefore we can filter based on that <code>id</code> value.

In [37]:
table_bs.find_all(id = "flight")

[<td id="flight">Flight No</td>]

In [38]:
# finding all the elements that have links to the Florida Wikipedia page:

list_input = table_bs.find_all(href = "https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

If we set the  <code>href</code> attribute to True, regardless of what the value is, the code finds all tags with <code>href</code> value:

In [39]:
table_bs.find_all(href = True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

There are other methods for dealing with attributes and other related methods; Check out the following <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0220ENSkillsNetwork23455606-2021-01-01#css-selectors'>link</a>

#### Exercise: find_all
* Using the logic above, find all the elements without href value

In [42]:
table_bs.find_all('a', href = False)

[]

Using the soup object `soup`, find the element with the `id` attribute content set to `"boldest"`.

In [43]:
soup.find_all(id = "boldest")

[<b id="boldest">Lebron James</b>]

## string
* With string you can search for strings instead of tags, where we find all the elments with Florida:

In [44]:
table_bs.find_all(string = "Florida")

['Florida', 'Florida']

## find
* The find_all() method scans the entire document looking for results, it’s if you are looking for one element you can use the find() method to find the first element in the document. Consider the following two table:

In [45]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


In [46]:
# storing the HTML as a Python string and assign two_tables:

two_tables = "<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

# creating a BeautifulSoup object two_tables_bs

two_tables_bs= BeautifulSoup(two_tables, 'html.parser')

In [47]:
# finding the first table using the tag name table

two_tables_bs.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.

In [48]:
two_tables_bs.find("table",class_='pizza')

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

## Downloading And Scraping The Contents Of A Web Page

We Download the contents of the web page:

In [49]:
url = "http://www.ibm.com"

We use `get` to download the contents of the webpage in text format and store in a variable called `data`:

In [50]:
data = requests.get(url).text 

We create a `BeautifulSoup` object using the `BeautifulSoup` constructor

In [51]:
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

#### Scrape all links

In [52]:
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))


https://www.ibm.com/cloud?lnk=hpUSbt1


#### Scrape all images Tags

In [53]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

#### Scrape data from HTML tables

In [54]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.

In [55]:
# get the contents of the webpage in text format and store in a variable called data
data = requests.get(url).text

In [56]:
# create a soup object using the variable 'data'
soup = BeautifulSoup(data,"html.parser")

In [57]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

In [58]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas

In [69]:
import pandas as pd

In [77]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check the tables on the webpage.

In [78]:
# get the contents of the webpage in text format and store in a variable called data
data = requests.get(url).text

In [79]:
soup = BeautifulSoup(data,"html.parser")

In [80]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

In [81]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

30

Assume that we are looking for the `10 most densly populated countries` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.

In [82]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

7


See if you can locate the table name of the table, `10 most densly populated countries`, below.

In [83]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
  <sup class="reference" id="cite_ref-:10_107-0">
   <a href="#cite_note-:10-107">
    [102]
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col">
    Rank
   </th>
   <th scope="col">
    Country
   </th>
   <th scope="col">
    Population
   </th>
   <th scope="col">
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th scope="col">
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <span class="mw-image-border" typeof="mw:File">
      <span>
       <img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/4

`.prettify():` This is a method provided by Beautiful Soup. It formats the HTML content in a structured and indented way, making it easier for humans to read and understand.

In [88]:
# Initialize an empty list to collect rows
rows_list = []

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if col:  # Checks if col is not empty
        rank = col[0].text.strip()
        country = col[1].text.strip()
        population = col[2].text.strip().replace(',', '')  # Optionally remove commas for numerical processing
        area = col[3].text.strip().replace(',', '')  # Optionally remove commas for numerical processing
        density = col[4].text.strip().replace(',', '')  # Optionally remove commas for numerical processing
        rows_list.append({"Rank": rank, "Country": country, "Population": population, "Area": area, "Density": density})

# Create the DataFrame from the list of rows
population_data = pd.DataFrame(rows_list, columns=["Rank", "Country", "Population", "Area", "Density"])

population_data


Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[note 3][103],5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html

Using the same `url`, `data`, `soup`, and `tables` object as in the last section we can use the `read_html` function to create a DataFrame.

Remember the table we need is located in `tables[table_index]`

We can now use the `pandas` function `read_html` and give it the string version of the table as well as the `flavor` which is the parsing engine `bs4`.

In [89]:
pd.read_html(str(tables[5]), flavor='bs4')

[                                                                                                         #  \
    Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.   
    Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.   
    Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.   
    Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.   
    Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.   
    Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.   
    Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.   
    Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.   
 

The function `read_html` always returns a list of DataFrames so we must pick the one we want out of the list

In [90]:
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

population_data_read_html

Unnamed: 0_level_0,#,Most populous countries,2000,2015,2030[A],Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.
Unnamed: 0_level_1,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unnamed: 0_level_2,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Unnamed: 0_level_3,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Unnamed: 0_level_4,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4
Unnamed: 0_level_5,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_5,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5
Unnamed: 0_level_6,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_6,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6
Unnamed: 0_level_7,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_7,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7
Unnamed: 0_level_8,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_8,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8
Unnamed: 0_level_9,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_9,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9
Unnamed: 0_level_10,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_10,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10
Unnamed: 0_level_11,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_11,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11
0,,Graphs are unavailable due to technical issues...,,,,
1,1,China[B],1270,1376,1416,
2,2,India,1053,1311,1528,
3,3,United States,283,322,356,
4,4,Indonesia,212,258,295,
5,5,Pakistan,136,208,245,
6,6,Brazil,176,206,228,
7,7,Nigeria,123,182,263,
8,8,Bangladesh,131,161,186,
9,9,Russia,146,146,149,


## Scrape data from HTML tables into a DataFrame using read_html

We can also use the `read_html` function to directly get DataFrames from a `url`.

In [91]:
dataframe_list = pd.read_html(url, flavor='bs4')

`pd.read_html(url, flavor='bs4')`: This is a function call to Pandas' `read_html()` method. It reads HTML tables from a `URL` (url) and returns a list of DataFrames containing the scraped data. The `flavor='bs4'` parameter specifies that `Beautiful Soup` 4 (bs4) should be used as the `HTML parser`.

We can see there are 25 DataFrames just like when we used `find_all` on the `soup` object.

In [92]:
len(dataframe_list)

27

In [93]:
# picking the DataFrame we need out of the list

dataframe_list[5]

Unnamed: 0_level_0,#,Most populous countries,2000,2015,2030[A],Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.
Unnamed: 0_level_1,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Unnamed: 0_level_2,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Unnamed: 0_level_3,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Unnamed: 0_level_4,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4
Unnamed: 0_level_5,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_5,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5
Unnamed: 0_level_6,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_6,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6
Unnamed: 0_level_7,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_7,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7
Unnamed: 0_level_8,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_8,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8
Unnamed: 0_level_9,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_9,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9
Unnamed: 0_level_10,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_10,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10
Unnamed: 0_level_11,Graphs are unavailable due to technical issues. There is more info on Phabricator and on MediaWiki.org.,Unnamed: 1_level_11,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11
0,,Graphs are unavailable due to technical issues...,,,,
1,1,China[B],1270,1376,1416,
2,2,India,1053,1311,1528,
3,3,United States,283,322,356,
4,4,Indonesia,212,258,295,
5,5,Pakistan,136,208,245,
6,6,Brazil,176,206,228,
7,7,Nigeria,123,182,263,
8,8,Bangladesh,131,161,186,
9,9,Russia,146,146,149,


We can also use the `match` parameter to select the specific table we want. If the table contains a string matching the text it will be read.

In [94]:
pd.read_html(url, match="10 most densely populated countries", flavor='bs4')[0]

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[note 3][103],5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419
