# Web Scraping Review and Data Extraction

This notebook demonstrates additional web scraping techniques used to extract and structure data from web pages for analytical purposes.

## Workflow
- Retrieve webpage content
- Parse HTML structure
- Extract relevant information
- Prepare structured dataset for analysis

In [1]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a webpage

In [2]:
url = "http://www.ibm.com"

In [3]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text

In [4]:
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

In [5]:
for link in soup.find_all('a'):  # in html anchor/link is represented by the tag <a>
    print(link.get('href'))

https://www.ibm.com/products/flashsystem?lnk=hpls1us
https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/ai-infrastructure?lnk=hpls2us
https://www.parsintl.com/eprints/135920.pdf?lnk=hprc1us
https://www.ibm.com/solutions/storage-data-resilience?lnk=hprc2us
https://www.hashicorp.com/en/do-cloud-right-explained?lnk=hprc3us
https://skillsbuild.org/artificial-intelligence/?utm_source=ibm.com&utm_campaign=ibm.com&utm_medium=organic&lnk=hprc4us
https://www.ibm.com/case-studies/scuderia-ferrari?lnk=hpcs1us
https://www.ibm.com/case-studies/avid-solutions-international?lnk=hpcs2us
https://www.ibm.com/case-studies/pfizer?lnk=hpcs3us
https://www.ibm.com/case-studies/us-open?lnk=hpcs4us
https://ibm.webcasts.com/starthere.jsp?ei=1749693&tp_key=83a9212ff7&sti=hp&lnk=hppr1us
https://www.ibm.com/software?lnk=hpfp1us
https://www.ibm.com/software/ai-productivity?lnk=hpfp2us
https://www.ibm.com/software/data-management?lnk=hpfp3us
https://www.ibm.com/software/security-governance?

In [6]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link.get('src'))

https://assets.ibm.com/is/image/ibm/ibm_inspired_flex_3d_06_12k?ts=1770714682329&dpr=off
https://assets.ibm.com/is/image/ibm/hybridcloud-b2-v2-morningbrew-1000x750?ts=1770714683505&dpr=off
https://assets.ibm.com/is/image/ibm/cognos_sevone-overview_1x1?ts=1770714699615&dpr=off
https://assets.ibm.com/is/image/ibm/student-promo-banner?ts=1770714706585&dpr=off
https://assets.ibm.com/is/image/ibm/ferrari-shield-transparent?ts=1770714710861&dpr=off
https://assets.ibm.com/is/content/ibm/avidsolutions?ts=1770714714005&dpr=off
https://assets.ibm.com/is/image/ibm/pfizer_logo-1?ts=1770714717218&dpr=off
https://assets.ibm.com/is/image/ibm/us-open-logo-2-1?ts=1770714720276&dpr=off
https://assets.ibm.com/is/image/ibm/f678_still_f06_v10_zyn_4096x4096?ts=1770714710547&dpr=off
https://assets.ibm.com/is/image/ibm/avid-solutions-leadspace?ts=1770714713717&dpr=off
https://assets.ibm.com/is/image/ibm/pfizer_thumb?ts=1770714716902&dpr=off
https://assets.ibm.com/is/image/ibm/ashe-day-1-dustin-satloff_usta-25

In [7]:
#The below URL contains a html table with data about colors and color codes.
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"


In [8]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(URL).text

In [9]:
soup = BeautifulSoup(data,"html.parser")

In [10]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

In [11]:
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].getText() # store the value in column 3 as color_name
    color_code = cols[3].getText() # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

Color Name--->Hex Code#RRGGBB
lightsalmon--->#FFA07A
salmon--->#FA8072
darksalmon--->#E9967A
lightcoral--->#F08080
coral--->#FF7F50
tomato--->#FF6347
orangered--->#FF4500
gold--->#FFD700
orange--->#FFA500
darkorange--->#FF8C00
lightyellow--->#FFFFE0
lemonchiffon--->#FFFACD
papayawhip--->#FFEFD5
moccasin--->#FFE4B5
peachpuff--->#FFDAB9
palegoldenrod--->#EEE8AA
khaki--->#F0E68C
darkkhaki--->#BDB76B
yellow--->#FFFF00
lawngreen--->#7CFC00
chartreuse--->#7FFF00
limegreen--->#32CD32
lime--->#00FF00
forestgreen--->#228B22
green--->#008000
powderblue--->#B0E0E6
lightblue--->#ADD8E6
lightskyblue--->#87CEFA
skyblue--->#87CEEB
deepskyblue--->#00BFFF
lightsteelblue--->#B0C4DE
dodgerblue--->#1E90FF
