In [2]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

<h2 id="BSO">Beautiful Soup Objects</h2>

Beautiful Soup is a python Libray for pulling data out of html and xml files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML. We can navigate the HTML as a tree and/or filter out what we are looking for.

Consider the following HTML:

In [3]:
%%html
<!DOCTYPE html>
<html>
    <head>
    <title> Page Title </title>
    </head>
    <body>
    <h3><b id='boldest'>Lebron James</b></h3>
    <p> Salary: $92,000,00 </p>
    <h3> Stephen Curry </h3>
    <p> Salary: $85,0000 </p>
    <h3> Kevin Durant </h3>
    <p> Salary: $73,200,000 </p>
    </body>
</html>
    
    

We can store it as a string in the variable HTML:

In [4]:
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

To parse a document, pass it into the <code>BeautifulSoup</code> constructor, the <code>BeautifulSoup</code> object, which represents the document as a nested data structure:


In [5]:
soup = BeautifulSoup(html, "html.parser")

First, the document is converted to Unicode, (similar to ASCII),  and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The <code>BeautifulSoup</code> object can create other types of objects. In this lab, we will cover <code>BeautifulSoup</code> and <code>Tag</code> objects that for the purposes of this lab are identical, and <code>NavigableString</code> objects.


<code>prettify()</code> to display the HTML in the nested structure:

In [6]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>


## Tags

The <code>Tag</code> object corresponds to an html tag in the original document, for example the tag title

In [7]:
tag_object = soup.title
print("Tag object: ", tag_object)

Tag object:  <title>Page Title</title>


In [8]:
print("Tag object Type: ", type(tag_object))

Tag object Type:  <class 'bs4.element.Tag'>


If there is more than one <code>Tag</code>  with the same name, the first element with that <code>Tag</code> name is called, this corresponds to the most paid player:


In [9]:
tag_object= soup.h3
tag_object

<h3><b id="boldest">Lebron James</b></h3>

## Children, Parents, and Siblings

As stated above the <code>Tag</code> object is a tree of objects we can access the child of the tag or navigate down the branch as follows:

In [10]:
tag_child =tag_object.b
tag_child

<b id="boldest">Lebron James</b>

You can access the parent with the <code> parent</code>


In [11]:
parent_tag = tag_child.parent
parent_tag

<h3><b id="boldest">Lebron James</b></h3>

In [12]:
tag_object.parent

<body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>

In [13]:
sibling_1=tag_object.next_sibling
sibling_1

<p> Salary: $ 92,000,000 </p>

## HTML Attributes


If the tag has attributes, the tag <code>id="boldest"</code> has an attribute <code>id</code> whose value is <code>boldest</code>. You can access a tag’s attributes by treating the tag like a dictionary:


In [14]:
tag_child['id']

'boldest'

access that dictionary directly as <code>attrs</code>:

In [15]:
tag_child.attrs

{'id': 'boldest'}

In [16]:
tag_child.get('id')

'boldest'

In [17]:
soup.get("id")
print(soup)

<!DOCTYPE html>
<html><head><title>Page Title</title></head><body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>


## Navigable String


A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the <code>NavigableString</code> class to contain this text. In our HTML we can obtain the name of the first player by extracting the sting of the <code>Tag</code> object <code>tag_child</code> as follows:


In [18]:
tag_string = tag_child.string
tag_string

'Lebron James'

## Filter

Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string. Consider the following HTML of rocket launchs:



In [19]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


We can store it as a string in the variable <code>table</code>:


In [20]:
table="<table><tr><td id='flight' >Flight No</td><td>Launch site</td><td>Payload mass</td></tr><tr><td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a> </td><td>80 kg</td></tr></table>"

In [21]:
table_bs = BeautifulSoup(table, "html.parser")

## find All

The <code>find_all()</code> method looks through a tag’s descendants and retrieves all descendants that match your filters.

<p>
The Method signature for <code>find_all(name, attrs, recursive, string, limit, **kwargs)<c/ode>
</p>


### Name

When we set the <code>name</code> parameter to a tag name, the method will extract all the tags with that name and its children.


When you use <code>table_bs.find_all('tr')</code>, it returns a list of all <tr> elements, and you can access individual rows using an index, like table_rows[0].

In [22]:
table_rows = table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>]

In [23]:
table1 = table_bs #assigns the entire BeautifulSoup object to table1, and BeautifulSoup objects do not support direct indexing like lists do.
first_row = table1.find_all("tr")[0]
print(first_row)

<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>


In [24]:
first_row = table_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>

The type is <code>tag</code>


In [25]:
print(type(first_row))

<class 'bs4.element.Tag'>


In [26]:
first_row.td

<td id="flight">Flight No</td>

If we iterate through the list, each element corresponds to a row in the table:


In [27]:
for i, row in enumerate(table_rows):
    print("row", i, "is", row)

row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>
row 1 is <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>


In [28]:
# list 

list_input= table_bs.find_all(name=["tr","td"])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td><td>Payload mass</td></tr>,
 <td id="flight">Flight No</td>,
 <td>Launch site</td>,
 <td>Payload mass</td>,
 <tr><td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td><td>300 kg</td></tr>,
 <td>1</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>,
 <td>300 kg</td>,
 <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>,
 <td>2</td>,
 <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>,
 <td>94 kg</td>,
 <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td><td>80 kg</td></tr>,
 <td>3</td>,
 <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a> </td>,
 <td>80 kg</td>]

## Attributes

If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the <code>id</code>  argument, Beautiful Soup will filter against each tag’s <code>id</code> attribute. For example, the first <code>td</code> elements have a value of <code>id</code> of <code>flight</code>, therefore we can filter based on that <code>id</code> value.


In [29]:
table_bs.find_all(id="flight")

[<td id="flight">Flight No</td>]

We can find all the elements that have links to the Florida Wikipedia page:


In [30]:
list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

If we set the  <code>href</code> attribute to True, regardless of what the value is, the code finds all tags with <code>href</code> value:

In [31]:
table_bs.find_all(href= True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

## find


The <code>find_all()</code> method scans the entire document looking for results, it’s if you are looking for one element you can use the <code>find()</code> method to find the first element in the document. Consider the following two table:


In [32]:
%%html
<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>


0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg

0,1,2
Pizza Place,Orders,Slices
Domino's Pizza,10,100
Little Caesars,12,144
Papa John's,15,165


In [33]:
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

We create a <code>BeautifulSoup</code> object  <code>two_tables_bs</code>


In [34]:
two_tables_bs = BeautifulSoup(two_tables, "html.parser")

We can find the first table using the tag name table


In [35]:
two_tables_bs.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.


In [36]:
two_tables_bs.find("table", class_="pizza")

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

<h2 id="DSCW">Downloading And Scraping The Contents Of A Web Page</h2> 


We Download the contents of the web page:


In [37]:
url = "https://web.archive.org/web/20230224123642/https://www.ibm.com/us-en/"

We use <code>get</code> to download the contents of the webpage in text format and store in a variable called <code>data</code>:


In [38]:
data = requests.get(url).text

In [39]:
soup = BeautifulSoup(data, "html.parser") #  create a soup object using the variable 'data'

Scrape all links


In [40]:
for link in soup.find_all("a", href= True):
    print(link.get('href'))

https://web.archive.org/web/20230224123642/https://www.ibm.com/reports/threat-intelligence/
https://web.archive.org/web/20230224123642/https://www.ibm.com/about
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/strategy/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/ibmix?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/technology/
https://web.archive.org/web/20230224123642/https://www.ibm.com/consulting/operations/?lnk=flathl
https://web.archive.org/web/20230224123642/https://www.ibm.com/strategic-partnerships
https://web.archive.org/web/20230224123642/https://www.ibm.com/employment/?lnk=flatitem
https://web.archive.org/web/20230224123642/https://www.ibm.com/impact
https://web.archive.org/web/20230224123642/https://research.ibm.com/
https://web.archive.org/web/20230224123642/https://www.ibm.com/


## Scrape  all images  Tags


In [41]:
for link in soup.find_all("img"): #in html image is represented by the tag img
    print(link)
    print(link.get('scr'))

<img alt="Person standing with arms crossed" aria-describedby="bx--image-1" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/0a23e414312bcb6f/08196d0e04260ae5_cropped.jpg.global.sr_16x9.jpg"/>
None
<img alt="Team members at work in a conference room" aria-describedby="bx--image-2" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/06655c075aa3aa29/CaitOppermann_2019_12_06_IBMGarage_DSC3304.jpg.global.m_16x9.jpg"/>
None
<img alt="Coworkers looking at laptops" aria-describedby="bx--image-3" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/08f951353c2707b8/052022_CaitOppermann_InsideIBM_London_2945_03.jpg.global.sr_16x9.jpg"/>
None
<img alt="Cloud developer with red sweater coding at desk" aria-describedby="bx--image-4" class="bx--image__img" src="https://web.archive.org/web/20230224123642im_/https://1.dam.s81c.com/p/064e0139f5a3aa5e/0500002_L

## Scrape data from html tables

In [42]:
#The below url contains an html table with data about colors and color codes.
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.


33 Rows and 5 Columns

In [43]:
#getting the contents of the webpage in text format and stores in the varible of data
data = requests.get(url).text

In [44]:
soup = BeautifulSoup(data, "html.parser")

In [45]:
# finding the html table in web page
table = soup.find("table")  # in html table is represented by the tag <table>

In [46]:
# getting all rows from the table
for row in table.find_all("tr"):  # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html s column is represented by the <td>
    color_name = cols[2].string # store the value in column 3 as color_name
    color_code = cols[3].string # store the value in the 4th column as color code
    print(f"{color_name} ---> {color_code}")
    
    

Color Name ---> None
lightsalmon ---> #FFA07A
salmon ---> #FA8072
darksalmon ---> #E9967A
lightcoral ---> #F08080
coral ---> #FF7F50
tomato ---> #FF6347
orangered ---> #FF4500
gold ---> #FFD700
orange ---> #FFA500
darkorange ---> #FF8C00
lightyellow ---> #FFFFE0
lemonchiffon ---> #FFFACD
papayawhip ---> #FFEFD5
moccasin ---> #FFE4B5
peachpuff ---> #FFDAB9
palegoldenrod ---> #EEE8AA
khaki ---> #F0E68C
darkkhaki ---> #BDB76B
yellow ---> #FFFF00
lawngreen ---> #7CFC00
chartreuse ---> #7FFF00
limegreen ---> #32CD32
lime ---> #00FF00
forestgreen ---> #228B22
green ---> #008000
powderblue ---> #B0E0E6
lightblue ---> #ADD8E6
lightskyblue ---> #87CEFA
skyblue ---> #87CEEB
deepskyblue ---> #00BFFF
lightsteelblue ---> #B0C4DE
dodgerblue ---> #1E90FF


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas


In [47]:
import pandas as pd

In [48]:
#The below url contains html tables with data about world population.
url = "https://en.wikipedia.org/wiki/World_population"


In [49]:
data = requests.get(url).text

In [51]:
soup = BeautifulSoup(data, "html.parser")

In [54]:
#find all the html tables in the webpage
tables = soup.find_all("table") #in html table is represented by the tag<table>

In [55]:
len(tables)

31

Assume that we are looking for the `10 most densly populated countries` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.


In [56]:
for index, table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index = index
print(table_index)

8


If you're looking for the table titled "10 most densely populated countries," the code snippet you provided is designed to help you locate it within a list of tables. Here's a more detailed explanation:

1.  **List of Tables:**  Assume you have a list called  `tables`  that contains multiple HTML tables extracted from a webpage.
    
2.  **Searching for the Table:**  The code iterates over each table in the  `tables`  list using  `enumerate()`, which provides both the index and the table itself.
    
3.  **Checking Table Content:**  For each table, it converts the table to a string with  `str(table)`  and checks if the string "10 most densely populated countries" is present.
    
4.  **Finding the Right Table:**  If the string is found in a table, the index of that table is stored in  `table_index`.
    
5.  **Output:**  The  `print(table_index)`  statement will output the index of the table that contains the specified string.
    

If the code prints  `8`, it means the table you're looking for is the 9th table in the list (since indexing starts at 0). If you have a separate table that you know is the one you need, you can directly access it using its index in the  `tables`  list. If you need further clarification or assistance, feel free to ask!

See if you can locate the table name of the table, `10 most densly populated countries`, below.

In [None]:
print(tables[table_index].prettify())

The code snippet is used to extract data from a specific HTML table and store it in a `Pandas DataFrame` called <code>population_data</code>.

In [64]:
rows_list = []


#iterates over each table row (<tr>) in the body (<tbody>) of the table located at table_index in the tables list.
for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td") # finds all the table data cells (<td>) in the current row.
    if col: 
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
       # Append the row as a dictionary to the list
        rows_list.append({"Rank": rank, "Country": country, "Population": population, "Area": area, "Density": density})

#Creating the datafram from list of rows
population_data = pd.DataFrame(rows_list)

#Displaying the dataframe
population_data


#Assigning Data to Variables:
#rank, country, population, area, and density are extracted from the respective columns and any extra spaces are removed using .strip().

#appends the extracted data as a new row to the population_data DataFrame. 

#The ignore_index=True parameter ensures that the DataFrame's index is reset.

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,\n Palestine[note 3][100]\n\n,5223000,6025,867
3,4,Taiwan[note 4],23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,Israel,9402617,21937,429
9,10,India,1389637446,3287263,423


In [61]:
# Example of using pd.concat to combine multiple DataFrames
df1 = pd.DataFrame([{"A": 1, "B": 2}])
df2 = pd.DataFrame([{"A": 3, "B": 4}])

# Concatenate the DataFrames
combined_df = pd.concat([df1, df2], ignore_index=True)

# Display the combined DataFrame
print(combined_df)

   A  B
0  1  2
1  3  4
