# **Web Scraping**


In [1]:
#importing libraries
!mamba install bs4==4.10.0 -y
!pip install lxml==4.6.4
!mamba install html5lib==1.1 -y
# !pip install requests==2.26.0

/bin/bash: mamba: command not found
Collecting lxml==4.6.4
  Downloading lxml-4.6.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 3.7 MB/s 
[?25hInstalling collected packages: lxml
  Attempting uninstall: lxml
    Found existing installation: lxml 4.2.6
    Uninstalling lxml-4.2.6:
      Successfully uninstalled lxml-4.2.6
Successfully installed lxml-4.6.4
/bin/bash: mamba: command not found


Import the required modules and functions


In [2]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

<h2 id="BSO">Beautiful Soup Objects</h2>


Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.  We can navigate the HTML as a tree and/or filter out what we are looking for.


In [3]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lionel Messi</b></h3>
<p> Salary: $ 130,000,000 </p>
<h3> Lebron James</h3>
<p> Salary: $121,200, 000 </p>
<h3> Cristiano Ronaldo </h3>
<p> Salary: $115,000, 000</p>
</body>
</html>

We can store it as a string in the variable HTML:


In [4]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the <code>BeautifulSoup</code> constructor, the <code>BeautifulSoup</code> object, which represents the document as a nested data structure:


In [5]:
soup = BeautifulSoup(html, "html.parser")

First, the document is converted to Unicode, (similar to ASCII),  and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The <code>BeautifulSoup</code> object can create other types of objects. 

We can use the method <code>prettify()</code> to display the HTML in the nested structure:


In [6]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>


## Tags


Let's say we want the  title of the page and the name of the top paid player we can use the <code>Tag</code>. The <code>Tag</code> object corresponds to an HTML tag in the original document, for example, the tag title.


In [7]:
tag_object=soup.title
print("tag object:",tag_object)

tag object: <title>Page Title</title>


we can see the type of <code>Tag</code>


In [8]:
print("tag object type:",type(tag_object))

tag object type: <class 'bs4.element.Tag'>


If there is more than one <code>Tag</code>  with the same name, the first element with that <code>Tag</code> name is called, this corresponds to the most paid player:


In [9]:
tag_object=soup.h3
tag_object

<h3><b id="boldest">Lebron James</b></h3>

Enclosed in the bold attribute <code>b</code>, it helps to use the tree representation. We can navigate down the tree using the child attribute to get the name.


### Children, Parents, and Siblings


As stated above the <code>Tag</code> object is a tree of objects we can access the child of the tag or navigate down the branch as follows:


In [10]:
tag_child =tag_object.b
tag_child

<b id="boldest">Lebron James</b>

We can access the parent tag with the <code> parent</code>


In [11]:
parent_tag=tag_child.parent
parent_tag

<h3><b id="boldest">Lebron James</b></h3>

this is identical to


In [12]:
tag_object

<h3><b id="boldest">Lebron James</b></h3>

<code>tag_object</code> parent is the <code>body</code> element.


In [13]:
tag_object.parent

<body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>

<code>tag_object</code> sibling is the <code>paragraph</code> element


In [14]:
sibling_1=tag_object.next_sibling
sibling_1

<p> Salary: $ 92,000,000 </p>

`sibling_2` is the `header` element which is also a sibling of both `sibling_1` and `tag_object`


In [15]:
sibling_2=sibling_1.next_sibling
sibling_2

<h3> Stephen Curry</h3>

Using the object <code>sibling\_2</code> and the property <code>next_sibling</code> to find the salary of Stephen Curry:


In [16]:
sibling_3=sibling_2.next_sibling
sibling_3

<p> Salary: $85,000, 000 </p>

### HTML Attributes


If the tag has attributes, the tag <code>id="boldest"</code> has an attribute <code>id</code> whose value is <code>boldest</code>. We can access a tag’s attributes by treating the tag like a dictionary:


In [17]:
tag_child['id']

'boldest'

We can access that dictionary directly as <code>attrs</code>:


In [18]:
tag_child.attrs

{'id': 'boldest'}

We can also obtain the content if the attribute of the <code>tag</code> using the Python <code>get()</code> method.


In [19]:
tag_child.get('id')

'boldest'

### Navigable String


A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the <code>NavigableString</code> class to contain this text. In our HTML we can obtain the name of the first player by extracting the sting of the <code>Tag</code> object <code>tag_child</code> as follows:


In [20]:
tag_string=tag_child.string
tag_string

'Lebron James'

we can verify the type is Navigable String


In [21]:
type(tag_string)

bs4.element.NavigableString

A NavigableString is just like a Python string or Unicode string, to be more precise. The main difference is that it also supports some  <code>BeautifulSoup</code> features. We can covert it to sting object in Python:


In [22]:
unicode_string = str(tag_string)
unicode_string

'Lebron James'

<h2 id="filter">Filter</h2>


Filters allow us to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string.  Consider the following HTML of rocket launchs:


In [23]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Delhi'>Delhi</a></td>
    <td>140 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Mumbai'>Mumbai</a></td>
    <td>100 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Delhi,140 kg
2,Mumbai,100 kg
3,Florida,80 kg


We can store it as a string in the variable <code>table</code>:


In [24]:
table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Delhi'>Delhi<a></td><td>140 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Mumbai'>Mumbai</a></td><td>100 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

In [25]:
table_bs = BeautifulSoup(table, "html.parser")
table_bs

<table><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a></td><td>140 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a></td><td>100 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr></table>

## find All


The <code>find_all()</code> method looks through a tag’s descendants and retrieves all descendants that match your filters.

<p>
The Method signature for <code>find_all(name, attrs, recursive, string, limit, **kwargs)<c/ode>
</p>


### Name


When we set the <code>name</code> parameter to a tag name, the method will extract all the tags with that name and its children.


In [26]:
table_rows=table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a></td><td>140 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a></td><td>100 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>]

The result is a Python Iterable just like a list, each element is a <code>tag</code> object:


In [27]:
first_row =table_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>

The type is <code>tag</code>


In [28]:
print(type(first_row))

<class 'bs4.element.Tag'>


we can obtain the child


In [29]:
first_row.td

<td id="flight">Flight No</td>

If we iterate through the list, each element corresponds to a row in the table:


In [30]:
for i,row in enumerate(table_rows):
    print("row",i,"is",row)
    

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a></td><td>140 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a></td><td>100 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>


As <code>row</code> is a <code>cell</code> object, we can apply the method <code>find_all</code> to it and extract table cells in the object <code>cells</code> using the tag <code>td</code>, this is all the children with the name <code>td</code>. The result is a list, each element corresponds to a cell and is a <code>Tag</code> object, we can iterate through this list as well. We can extract the content using the <code>string</code>  attribute.


In [31]:
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)

row 0
colunm 0 cell <td id="flight">Flight No</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a></td>
colunm 2 cell <td>140 kg</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a></td>
colunm 2 cell <td>100 kg</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>
colunm 2 cell <td>80 kg</td>


If we use a list we can match against any item in that list.


In [32]:
list_input=table_bs .find_all(name=["tr", "td"])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a></td><td>140 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a></td>,
 <td>140 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a></td><td>100 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a></td>,
 <td>100 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td>,
 <td>80 kg</td>]

## Attributes


If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the <code>id</code>  argument, Beautiful Soup will filter against each tag’s <code>id</code> attribute. For example, the first <code>td</code> elements have a value of <code>id</code> of <code>flight</code>, therefore we can filter based on that <code>id</code> value.


In [33]:
table_bs.find_all(id="flight")

[<td id="flight">Flight No</td>]

We can find all the elements that have links to the 

---

Delhi Wikipedia page:


In [64]:
list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Delhi")
list_input

[<a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a>]

If we set the  <code>href</code> attribute to True, regardless of what the value is, the code finds all tags with <code>href</code> value:


In [65]:
table_bs.find_all(href=True)

[<a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a>,
 <a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a>]

Using the logic above, find all the elements without <code>href</code> value


In [66]:
table_bs.find_all(href=False)


[<table><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a></td><td>140 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a></td><td>100 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida<a> </a></a></td><td>80 kg</td></tr></table>,
 <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a></td><td>140 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Delhi">Delhi<a></a></a></td>,
 <a></a>,
 <td>140 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a></td><td>100 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Mumbai">Mumbai</a></td>,
 <td>100 kg</

Using the soup object <code>soup</code>, find the element with the <code>id</code> attribute content set to <code>"boldest"</code>.


In [37]:
soup.find_all(id="boldest")

[<b id="boldest">Lebron James</b>]

### string


With string you can search for strings instead of tags, where we find all the elments with Florida:


In [38]:
table_bs.find_all(string="Florida")

['Florida']

## find


The <code>find_all()</code> method scans the entire document looking for results, it’s if you are looking for one element you can use the <code>find()</code> method to find the first element in the document. Consider the following two table:


In [39]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Delhi</td>
    <td>140 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Mumbai</td>
    <td>100 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Outlet</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>La Pino</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Delhi,140 kg
2,Mumbai,100 kg
3,Florida,80 kg

0,1,2
Pizza Outlet,Orders,Slices
Domino's Pizza,10,100
La Pino,12,144
Papa John's,15,165


We store the HTML as a Python string and assign <code>two_tables</code>:


In [40]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Delhi</td><td>140 kg</td></tr><tr><td>2</td><td>Mumbai</td><td>100 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Outlet</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>La Pino</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

We create a <code>BeautifulSoup</code> object  <code>two_tables_bs</code>


In [41]:
two_tables_bs= BeautifulSoup(two_tables, 'html.parser')

We can find the first table using the tag name table


In [42]:
two_tables_bs.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Delhi</td><td>140 kg</td></tr><tr><td>2</td><td>Mumbai</td><td>100 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.


In [43]:
two_tables_bs.find("table",class_='pizza')

<table class="pizza"><tr><td>Pizza Outlet</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>La Pino</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

<h2 id="DSCW">Downloading And Scraping The Contents Of A Web Page</h2> 


We Download the contents of the web page:


In [44]:
url = "http://www.ibm.com"

We use <code>get</code> to download the contents of the webpage in text format and store in a variable called <code>data</code>:


In [45]:
data  = requests.get(url).text 

We create a <code>BeautifulSoup</code> object using the <code>BeautifulSoup</code> constructor


In [46]:
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

Scrape all links


In [47]:
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))


https://www.ibm.com/tw/zh
https://www.ibm.com/sitemap/tw/zh
https://www.ibm.com/lets-create/
/tw-zh/cloud/application-modernization
/tw-zh/analytics/journey-to-ai
https://www.ibm.com/tw-zh/products/flashsystem-5000?lnk=TWHP
https://www.ibm.com/tw-zh/it-infrastructure/storage/flash?lnk=STW_TW_HP_F3_&psrc=NONE&pexp=DEF&lnk2=goto_FlashStorage
/tw-zh/cloud/hybrid
https://www.ibm.com/tw-zh/products?lnk=STW_TW_HP_SWT5_BLK&psrc=NONE&pexp=DEF&lnk2=trial_PHP
https://www.ibm.com/tw-zh/security/security-intelligence/qradar?lnk=STW_TW_HP_SWT1_BLK&psrc=NONE&pexp=DEF&lnk2=trial_QradarPlat
https://www.ibm.com/tw-zh/analytics/spss-trials?lnk=STW_TW_HP_SWT1_&psrc=NONE&pexp=DEF&lnk2=goto_SPSSstat
https://www.ibm.com/tw-zh/cloud/free?lnk=STW_TW_HP_SWT1_BLK&psrc=NONE&pexp=DEF&lnk2=trial_Cloud
https://www.ibm.com/tw-zh/products/cloud-pak-for-data?lnk=STW_TW_HP_SWT2_BLK&psrc=NONE&pexp=DEF&lnk2=trial_CloudPakData
https://www.ibm.com/tw-zh/cloud/watson-assistant?lnk=STW_TW_HP_SWT3_BLK&psrc=NONE&pexp=DEF&lnk2=

## Scrape  all images  Tags


In [48]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

<img alt="攜手共創 萬物爭新" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2022-02/security-2_10.jpg"/>
//1.cms.s81c.com/sites/default/files/2022-02/security-2_10.jpg
<img alt="一次構建，隨處部署" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-09-02/twhp-mordernize.jpg"/>
//1.cms.s81c.com/sites/default/files/2021-09-02/twhp-mordernize.jpg
<img alt="擴展 AI 以加速數位轉型" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-09-02/twhp-j2ai.jpg"/>
//1.cms.s81c.com/sites/default/files/2021-09-02/twhp-j2ai.jpg
<img alt="限時優惠！免費取得三大企業級資料服務" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-08-09/twzh-flashsystem-5000.jpg"/>
//1.cms.s81c.com/sites/default/files/2021-08-09/twzh-flashsystem-5000.jpg
<img alt="2021年第一季度儲存新品震撼發佈" class="" loading="lazy" src="//1.cms.s81c.com/sites/default/files/2021-04-09/20210209-flash-system-5200-storage-25719-444x320_0.jpg"/>
//1.cms.s81c.com/sites/default/files/2021-04-09/20210209-flash-system-5200-st

## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas


In [49]:
import pandas as pd

In [67]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/Demographics_of_India"

Before proceeding to scrape a web site, we need to examine the contents, and the way data is organized on the website. 

In [68]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [69]:
soup = BeautifulSoup(data,"html.parser")

In [70]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>

In [71]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

38

Assume that we are looking for the `population over years` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.


In [87]:
for index,table in enumerate(tables):
    if ("Population growth of India per decade" in str(table)):
        table_index = index
print(table_index)

7


See if you can locate the table name of the table, `population over years`, below.


In [88]:
print(tables[table_index].prettify())

<table class="wikitable sortable">
 <caption>
  Population growth of India per decade
  <sup class="reference" id="cite_ref-Census_Population_49-0">
   <a href="#cite_note-Census_Population-49">
    [48]
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col" style="text-align: left;">
    Census year
   </th>
   <th data-sort-type="number" scope="col" style="text-align: right;">
    Population
   </th>
   <th data-sort-type="number" scope="col" style="text-align: right;">
    Change (%)
   </th>
  </tr>
  <tr>
   <td>
    1951
   </td>
   <td style="text-align: right;">
    361,088,003
   </td>
   <td style="text-align: right;">
    –
   </td>
  </tr>
  <tr>
   <td>
    1961
   </td>
   <td style="text-align: right;">
    439,235,000
   </td>
   <td style="text-align: right;">
    21.6
   </td>
  </tr>
  <tr>
   <td>
    1971
   </td>
   <td style="text-align: right;">
    548,160,000
   </td>
   <td style="text-align: right;">
    24.8
   </td>
  </tr>
  <tr>
   <td>
    1981

In [91]:
population_data = pd.DataFrame(columns=["Year", "Population", "Change"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        year = col[0].text
        population = col[1].text
        change = col[2].text.strip()
        population_data = population_data.append({"Year":year, "Population":population, "Change":change}, ignore_index=True)

population_data

Unnamed: 0,Year,Population,Change
0,1951\n,"361,088,003\n",–
1,1961\n,"439,235,000\n",21.6
2,1971\n,"548,160,000\n",24.8
3,1981\n,"683,329,000\n",24.7
4,1991\n,"846,387,888\n",23.9
5,2001\n,"1,028,737,436\n",21.5
6,2011\n,"1,210,726,932\n",17.7


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html


Using the same `url`, `data`, `soup`, and `tables` object as in the last section we can use the `read_html` function to create a DataFrame.

Remember the table we need is located in `tables[table_index]`

We can now use the `pandas` function `read_html` and give it the string version of the table as well as the `flavor` which is the parsing engine `bs4`.


In [92]:
pd.read_html(str(tables[7]), flavor='bs4')

[   Census year  Population Change (%)
 0         1951   361088003          –
 1         1961   439235000       21.6
 2         1971   548160000       24.8
 3         1981   683329000       24.7
 4         1991   846387888       23.9
 5         2001  1028737436       21.5
 6         2011  1210726932       17.7]

The function `read_html` always returns a list of DataFrames so we must pick the one we want out of the list.


In [93]:
population_data_read_html = pd.read_html(str(tables[7]), flavor='bs4')[0]

population_data_read_html

Unnamed: 0,Census year,Population,Change (%)
0,1951,361088003,–
1,1961,439235000,21.6
2,1971,548160000,24.8
3,1981,683329000,24.7
4,1991,846387888,23.9
5,2001,1028737436,21.5
6,2011,1210726932,17.7


## Scrape data from HTML tables into a DataFrame using read_html


We can also use the `read_html` function to directly get DataFrames from a `url`.


In [96]:
dataframe_list = pd.read_html(url, flavor='bs4')

We can see there are 38 DataFrames just like when we used `find_all` on the `soup` object.


In [97]:
len(dataframe_list)

38

Finally we can pick the DataFrame we need out of the list.


In [99]:
dataframe_list[7]

Unnamed: 0,Census year,Population,Change (%)
0,1951,361088003,–
1,1961,439235000,21.6
2,1971,548160000,24.8
3,1981,683329000,24.7
4,1991,846387888,23.9
5,2001,1028737436,21.5
6,2011,1210726932,17.7


We can also use the `match` parameter to select the specific table we want. If the table contains a string matching the text it will be read.


In [100]:
pd.read_html(url, match="Comparative demographics", flavor='bs4')[0]

Unnamed: 0,Category,Global ranking,References
0,Area,7th,[45]
1,Population,2nd,[45]
2,Population growth rate,102nd of 212,in 2010[46]
3,Population density,24th of 212,in 2010[46]
4,"Male to Female ratio, at birth",12th of 214,in 2009[47]
