# **Web Scraping Lab**


Estimated time needed: **30** minutes


## Objectives


After completing this lab, you will be able to:


* Download a webpage using requests module
* Scrape all links from a webpage
* Scrape all image URLs from a web page
* Scrape data from html tables


## Scrape www.ibm.com


Import the required modules and functions


In [2]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a webpage

Download the contents of the webpage


In [3]:
url = "http://www.ibm.com"

In [4]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text 

Create a soup object using the class BeautifulSoup


In [5]:
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

Scrape all links


In [6]:
for link in soup.find_all('a'):  # in html anchor/link is represented by the tag <a>
    print(link.get('href'))

https://www.ibm.com/products/guardium-ai-security?lnk=hpls1us
https://newsroom.ibm.com/2025-06-18-ibm-introduces-industry-first-software-to-unify-agentic-governance-and-security?lnk=hpls2us
https://www.ibm.com/granite?lnk=hpdev1us
https://ibmtechxchange.bemyapp.com/#/event?lnk=hpdev2us
https://skillsbuild.org/?lnk=hpdev3us
https://www.ibm.com/new/announcements/agentic-ai-governance-evaluation-and-lifecycle?lnk=hpdev4us
https://www.ibm.com/new/announcements/ibm-named-a-leader-in-the-2025-gartner-magic-quadrant-for-data-science-and-machine-learning-platforms?lnk=hpdev5us
https://www.ibm.com/new/announcements/ibm-leader-2025-omdia-universe-on-no-low-pro-ide-assistants-report?lnk=hpdev6us
https://www.ibm.com/products/watsonx-code-assistant/pricing?lnk=hpdev7us
https://www.ibm.com/products/watsonx-ai?lnk=hpdev8us
https://www.ibm.com/products/spss-statistics/whats-new?lnk=hppr1us
https://www.ibm.com/artificial-intelligence?lnk=hpfp1us
https://www.ibm.com/solutions/hybrid-cloud
https://www.ib

Scrape  all images


In [7]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link.get('src'))

## Scrape data from html tables


In [8]:
#The below URL contains a html table with data about colors and color codes.
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"


Before proceeding to scrape a website, you need to examine the contents, and the way data is organized on the website. Open the above URL in your browser and check how many rows and columns are there in the color table.


In [9]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(URL).text

In [10]:
soup = BeautifulSoup(data,"html.parser")

In [11]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

# Get all rows from the table


In [19]:
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].getText() # store the value in column 3 as color_name
    color_code = cols[3].getText() # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->Hex Code#RRGGBB
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF


In [13]:
color_dict = {}
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].getText() # store the value in column 3 as color_name
    color_code = cols[3].getText() # store the value in column 4 as color_code
    color_dict[color_name] = color_code

print(color_dict)

{'Color Name': 'Hex Code#RRGGBB', 'lightsalmon': '#FFA07A', 'salmon': '#FA8072', 'darksalmon': '#E9967A', 'lightcoral': '#F08080', 'coral': '#FF7F50', 'tomato': '#FF6347', 'orangered': '#FF4500', 'gold': '#FFD700', 'orange': '#FFA500', 'darkorange': '#FF8C00', 'lightyellow': '#FFFFE0', 'lemonchiffon': '#FFFACD', 'papayawhip': '#FFEFD5', 'moccasin': '#FFE4B5', 'peachpuff': '#FFDAB9', 'palegoldenrod': '#EEE8AA', 'khaki': '#F0E68C', 'darkkhaki': '#BDB76B', 'yellow': '#FFFF00', 'lawngreen': '#7CFC00', 'chartreuse': '#7FFF00', 'limegreen': '#32CD32', 'lime': '#00FF00', 'forestgreen': '#228B22', 'green': '#008000', 'powderblue': '#B0E0E6', 'lightblue': '#ADD8E6', 'lightskyblue': '#87CEFA', 'skyblue': '#87CEEB', 'deepskyblue': '#00BFFF', 'lightsteelblue': '#B0C4DE', 'dodgerblue': '#1E90FF'}


In [14]:
pip install pandas


Collecting pandas
  Downloading pandas-2.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.3.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m144.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.3.1-cp312-cp312-manylinux_2_28_x86_64.whl (16.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m184.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: tzdata, numpy, pandas
Successfully installed numpy-2.3.1 pandas-2.3.0 tzdata-2025.2
Note: you may need to restart the kernel to use updated package

In [16]:
import pandas as pd
# Remove the descriptive first key
color_dict.pop('Color Name', None)

# Convert to DataFrame
df = pd.DataFrame(list(color_dict.items()), columns=['Color Name', 'Hex Code'])

print(df.head())

    Color Name Hex Code
0  lightsalmon  #FFA07A
1       salmon  #FA8072
2   darksalmon  #E9967A
3   lightcoral  #F08080
4        coral  #FF7F50


In [17]:
last_item = color_dict.popitem()

In [18]:
last_item

('dodgerblue', '#1E90FF')

In [None]:
print(list(color_dict.items()))

## Authors


Ramesh Sannareddy


### Other Contributors


Rav Ahuja


<!--
## Change Log
|  Date (YYYY-MM-DD) |  Version | Changed By  |  Change Description |
|---|---|---|---|
| 2024-10-29  | 0.2  | Madhusudhan Moole |  Updated lab |
| 2020-10-17  | 0.1  | Ramesh Sannareddy  |  Created initial version of the lab |
--!>


Copyright © IBM Corporation. All rights reserved.
