# Web Scraping Lab

 import the required modules and functions

In [8]:
import requests
from bs4 import BeautifulSoup # import BeautifulSoup as bs4

# Beautiful Soup Object 


Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML. We can navigate the HTML as a tree and/or filter out what we are looking for.

In [9]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>
 

In [10]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the BeautifulSoup constructor, the BeautifulSoup object, which represents the document as a nested data structure:

In [11]:
soup = BeautifulSoup(html, "html.parser")


First, the document is converted to Unicode, (similar to ASCII), and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The BeautifulSoup object can create other types of objects. In this lab, we will cover BeautifulSoup and Tag objects that for the purposes of this lab are identical, and NavigableString objects.

We can use the method prettify() to display the HTML in the **nested structure**:

In [12]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>



# Tags


Let's say we want the title of the page and the name of the top paid player we can use the Tag. The Tag object corresponds to an HTML tag in the original document, for example, the tag title.

In [13]:
tag_object=soup.title
print("tag object:",tag_object)

tag object: <title>Page Title</title>


we can see the tag type bs4.element

In [15]:
print("tag object type",type(tag_object))

tag object type <class 'bs4.element.Tag'>


If there is more than one Tag with the same name, the first element with that Tag name is called, this corresponds to the most paid player:

In [20]:
tag_object = soup.h3 # print first h3 from html
tag_object

<h3><b id="boldest">Lebron James</b></h3>

* Enclosed in the bold attribute b, it helps to use the tree representation. We can navigate down the tree using the child attribute to get the name.

## Children, Parents, and Siblings

In [26]:
tag_child = tag_object.b # b is a tag which is inside of h3 tag
tag_child

<b id="boldest">Lebron James</b>

You can access the parent with the  __parent__

In [27]:
parent_tag = tag_child.parent # h3 is parent of tag_child
parent_tag

<h3><b id="boldest">Lebron James</b></h3>

In [28]:
tag_object

<h3><b id="boldest">Lebron James</b></h3>

In [29]:
tag_object.parent # show main parent

<body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>

In [31]:
sibling_1 = tag_object.next_sibling
sibling_1

<p> Salary: $ 92,000,000 </p>

* sibling_2 is the header element which is also a sibling of both sibling_1 and tag_object.

In [32]:
sibling_2 = sibling_1.next_sibling
sibling_2

<h3> Stephen Curry</h3>


# Exercise: next_sibling

Using the object sibling_2 and the property next_sibling to find the salary of Stephen Curry:

In [36]:

sibling_2.next_sibling

<p> Salary: $85,000, 000 </p>

## HTML Attributes

* If the tag has attributes, the tag id="boldest" has an attribute id whose value is boldest. You can access a tag’s attributes by treating the tag like a dictionary:

In [37]:
tag_child['id']

'boldest'


You can access that dictionary directly as attrs:

In [38]:
tag_child.attrs

{'id': 'boldest'}

* We can also obtain the content if the attribute of the tag using the Python get() method.

In [39]:
tag_child.get('id')

'boldest'

## Navigable String

A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text. In our HTML we can obtain the name of the first player by extracting the sting of the Tag object tag_child as follows:

In [40]:
tag_string=tag_child.string # print first string
tag_string

'Lebron James'

In [41]:
type(tag_string)

bs4.element.NavigableString

A NavigableString is just like a Python string or Unicode string, to be more precise. The main difference is that it also supports some BeautifulSoup features. We can covert it to sting object in Python:

In [43]:
unicode_string = str(tag_string)
unicode_string

'Lebron James'

## Filter

Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string. Consider the following HTML of rocket launchs:

In [44]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [45]:
table="<table><tr><td id='flight' >Flight No</td><td>Launch site</td><td>Payload mass</td></tr><tr><td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td><td>80 kg</td></tr></table>"

In [46]:
table_bs = BeautifulSoup(table, "html.parser")

# Find all

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

The Method signature for find_all(name, attrs, recursive, string, limit, **kwargs)

## Name

When we set the name parameter to a tag name, the method will extract all the tags with that name and its children.

In [51]:
tag_row = table_bs.find_all('tr') # use to find all tr tag
tag_row

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>]

In [52]:
first_row =table_rows[0]
first_row

NameError: name 'table_rows' is not defined

In [53]:
print(type(first_row))

NameError: name 'first_row' is not defined

In [54]:
first_row.td

NameError: name 'first_row' is not defined

## Attributes

If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the id argument, Beautiful Soup will filter against each tag’s id attribute. For example, the first td elements have a value of id of flight, therefore we can filter based on that id value.

In [55]:
table_bs.find_all(id="flight")

[<td id="flight">Flight No</td>]

In [56]:
list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

If we set the href attribute to True, regardless of what the value is, the code finds all tags with href value:

In [57]:
table_bs.find_all(href=True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

# string

With string you can search for strings instead of tags, where we find all the elments with Florida

In [58]:
table_bs.find_all(string="Florida")

['Florida', 'Florida']


# find

The find_all() method scans the entire document looking for results, it’s if you are looking for one element you can use the find() method to find the first element in the document. Consider the following two table:

In [59]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


We store the HTML as a Python string and assign two_tables.

In [60]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

In [61]:
two_tables_bs= BeautifulSoup(two_tables, 'html.parser')

In [62]:
two_tables_bs.find("table") # find first table

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>


We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.

In [63]:
two_tables_bs.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>


# Downloading And Scraping The Contents Of A Web Page

In [65]:
url = "http://www.ibm.com"


We use get to download the contents of the webpage in text format and store in a variable called data

In [69]:
data  =requests.get(url).text

We create a BeautifulSoup object using the BeautifulSoup constructor

In [70]:
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

In [71]:
for link in soup.find_all('a',href=True):  # in html anchor/link is represented by the tag <a>

    print(link.get('href'))


https://www.ibm.com/granite?lnk=dev
https://developer.ibm.com/technologies/artificial-intelligence?lnk=dev
https://www.ibm.com/products/watsonx-code-assistant?lnk=dev
https://www.ibm.com/watsonx/developer/?lnk=dev
https://www.ibm.com/thought-leadership/institute-business-value/report/ceo-generative-ai?lnk=bus
https://www.ibm.com/think/videos/ai-academy
https://www.ibm.com/products/watsonx-orchestrate/ai-agent-for-hr?lnk=bus
https://www.ibm.com/products/guardium-data-security-center?lnk=bus
https://www.ibm.com/artificial-intelligence?lnk=ProdC
https://www.ibm.com/hybrid-cloud?lnk=ProdC
https://www.ibm.com/consulting?lnk=ProdC


## Scrape all images Tags

In [73]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

## Scrape data from HTML tables

In [74]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table

In [76]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text
data


'<html>\n   <body>\n      <h1>Partital List  of HTML5 Supported Colors</h1>\n<table border ="1" class="main-table">\n   <tr>\n      <td>Number </td>\n      <td>Color</td>\n      <td>Color Name</td>\n      <td>Hex Code<br>#RRGGBB</td>\n      <td>Decimal Code<br>(R,G,B)</td>\n   </tr>\n   <tr>\n      <td>1</td>\n      <td style="background:lightsalmon;">&nbsp;</td>\n      <td>lightsalmon</td>\n      <td>#FFA07A</td>\n      <td>rgb(255,160,122)</td>\n   </tr>\n   <tr>\n      <td>2</td>\n      <td style="background:salmon;">&nbsp;</td>\n      <td>salmon</td>\n      <td>#FA8072</td>\n      <td>rgb(250,128,114)</td>\n   </tr>\n   <tr>\n      <td>3</td>\n      <td style="background:darksalmon;">&nbsp;</td>\n      <td>darksalmon</td>\n      <td>#E9967A</td>\n      <td>rgb(233,150,122)</td>\n   </tr>\n   <tr>\n      <td>4</td>\n      <td style="background:lightcoral;">&nbsp;</td>\n      <td>lightcoral</td>\n      <td>#F08080</td>\n      <td>rgb(240,128,128)</td>\n   </tr>\n   <tr>\n      <td>5<

In [77]:
soup = BeautifulSoup(data,"html.parser")

In [81]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>
table

<table border="1" class="main-table">
<tr>
<td>Number </td>
<td>Color</td>
<td>Color Name</td>
<td>Hex Code<br/>#RRGGBB</td>
<td>Decimal Code<br/>(R,G,B)</td>
</tr>
<tr>
<td>1</td>
<td style="background:lightsalmon;"> </td>
<td>lightsalmon</td>
<td>#FFA07A</td>
<td>rgb(255,160,122)</td>
</tr>
<tr>
<td>2</td>
<td style="background:salmon;"> </td>
<td>salmon</td>
<td>#FA8072</td>
<td>rgb(250,128,114)</td>
</tr>
<tr>
<td>3</td>
<td style="background:darksalmon;"> </td>
<td>darksalmon</td>
<td>#E9967A</td>
<td>rgb(233,150,122)</td>
</tr>
<tr>
<td>4</td>
<td style="background:lightcoral;"> </td>
<td>lightcoral</td>
<td>#F08080</td>
<td>rgb(240,128,128)</td>
</tr>
<tr>
<td>5</td>
<td style="background:coral;"> </td>
<td>coral</td>
<td>#FF7F50</td>
<td>rgb(255,127,80)</td>
</tr>
<tr>
<td>6</td>
<td style="background:tomato;"> </td>
<td>tomato</td>
<td>#FF6347</td>
<td>rgb(255,99,71)</td>
</tr>
<tr>
<td>7</td>
<td style="background:orangered;"> </td>
<td>orangered</td>
<td>#FF4500</td>
<td>rgb

In [79]:
#Get all rows from the table
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->None
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas

In [82]:
import pandas as pd

In [83]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check the tables on the webpage.

In [84]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [85]:
soup = BeautifulSoup(data,"html.parser")

In [87]:
#find all html tables in the web page
tables = soup.find_all('table') # in html table is represented by the tag <table>
table

<table border="1" class="main-table">
<tr>
<td>Number </td>
<td>Color</td>
<td>Color Name</td>
<td>Hex Code<br/>#RRGGBB</td>
<td>Decimal Code<br/>(R,G,B)</td>
</tr>
<tr>
<td>1</td>
<td style="background:lightsalmon;"> </td>
<td>lightsalmon</td>
<td>#FFA07A</td>
<td>rgb(255,160,122)</td>
</tr>
<tr>
<td>2</td>
<td style="background:salmon;"> </td>
<td>salmon</td>
<td>#FA8072</td>
<td>rgb(250,128,114)</td>
</tr>
<tr>
<td>3</td>
<td style="background:darksalmon;"> </td>
<td>darksalmon</td>
<td>#E9967A</td>
<td>rgb(233,150,122)</td>
</tr>
<tr>
<td>4</td>
<td style="background:lightcoral;"> </td>
<td>lightcoral</td>
<td>#F08080</td>
<td>rgb(240,128,128)</td>
</tr>
<tr>
<td>5</td>
<td style="background:coral;"> </td>
<td>coral</td>
<td>#FF7F50</td>
<td>rgb(255,127,80)</td>
</tr>
<tr>
<td>6</td>
<td style="background:tomato;"> </td>
<td>tomato</td>
<td>#FF6347</td>
<td>rgb(255,99,71)</td>
</tr>
<tr>
<td>7</td>
<td style="background:orangered;"> </td>
<td>orangered</td>
<td>#FF4500</td>
<td>rgb

In [88]:
# we can see how many tables were found by checking the length of the tables list
len(tables)

31

Assume that we are looking for the 10 most densly populated countries table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.

In [89]:
for index,table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

8


See if you can locate the table name of the table, 10 most densly populated countries, below.

In [90]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
  <sup class="reference" id="cite_ref-:10_104-0">
   <a href="#cite_note-:10-104">
    <span class="cite-bracket">
     [
    </span>
    99
    <span class="cite-bracket">
     ]
    </span>
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col">
    Rank
   </th>
   <th scope="col">
    Country
   </th>
   <th scope="col">
    Population
   </th>
   <th scope="col">
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th scope="col">
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <span class="mw-image-border" typeof="mw:File">
      <span>
       <img alt="" class="mw-file-element" data-file-height="600" data-fi

In [91]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

population_data

AttributeError: 'DataFrame' object has no attribute 'append'


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html

Using the same url, data, soup, and tables object as in the last section we can use the read_html function to create a DataFrame.

Remember the table we need is located in tables[table_index]

We can now use the pandas function read_html and give it the string version of the table as well as the flavor which is the parsing engine bs4.

In [92]:
pd.read_html(str(tables[5]), flavor='bs4')

  pd.read_html(str(tables[5]), flavor='bs4')


ImportError: Missing optional dependency 'html5lib'.  Use pip or conda to install html5lib.

The function read_html always returns a list of DataFrames so we must pick the one we want out of the list.

In [93]:
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

population_data_read_html

  population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]


ImportError: Missing optional dependency 'html5lib'.  Use pip or conda to install html5lib.

Scrape data from HTML tables into a DataFrame using read_html


We can also use the read_html function to directly get DataFrames from a url

In [95]:
dataframe_list = pd.read_html(url, flavor='bs4')

ImportError: Missing optional dependency 'html5lib'.  Use pip or conda to install html5lib.

We can see there are 25 DataFrames just like when we used find_all on the soup object.

# End