# 🕸️ Web Scraping Project – IBM Data Analyst Capstone

This project demonstrates how to:
- Download webpages
- Scrape hyperlinks and images
- Extract tabular data

---


# **Case Study: Web Scraping Lab**

# Objectives
After completing this lab we will be able to:

<ul>
<li>Download a webpage using requests module.</li>
<li>Scrape all links from a webpage.</li>
<li>Scrape all image URLs from a web page.</li>
<li>Scrape data from html tables.</li>
</ul>

## Scrape www.ibm.com


Import the required modules and functions


In [None]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a webpage

print(soup.prettify()[:500])  # Preview first 500 characters of parsed HTML

Download the contents of the webpage


In [None]:
url = "http://www.ibm.com"

In [None]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(url).text 

print(response.status_code)  # Confirm successful download (200 OK)

Create a soup object using the class BeautifulSoup


In [None]:
soup = BeautifulSoup(data,"html.parser")  # create a soup object using the variable 'data'

print(soup.prettify()[:500])  # Preview first 500 characters of parsed HTML

Scrape all links


In [None]:
for link in soup.find_all('a'):  # in html anchor/link is represented by the tag <a>
    print(link.get('href'))

print(f"Number of links found: {len(links)}")

Scrape  all images


In [None]:
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link.get('src'))

print(f"Number of links found: {len(links)}")

print(f"Number of images found: {len(images)}")

## Scrape data from html tables


In [None]:
#The below URL contains a html table with data about colors and color codes.
URL = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Before proceeding to scrape a website, you need to examine the contents, and the way data is organized on the website. Open the above URL in your browser and check how many rows and columns are there in the color table.


In [None]:
# get the contents of the webpage in text format and store in a variable called data
data  = requests.get(URL).text

print(response.status_code)  # Confirm successful download (200 OK)

In [None]:
soup = BeautifulSoup(data,"html.parser")

print(soup.prettify()[:500])  # Preview first 500 characters of parsed HTML

In [None]:
#find a html table in the web page
table = soup.find('table') # in html table is represented by the tag <table>

# Get all rows from the table


In [None]:
for row in table.find_all('tr'): # in html table row is represented by the tag <tr>
    # Get all columns in each row.
    cols = row.find_all('td') # in html a column is represented by the tag <td>
    color_name = cols[2].getText() # store the value in column 3 as color_name
    color_code = cols[3].getText() # store the value in column 4 as color_code
    print("{}--->{}".format(color_name,color_code))

print(f"Number of links found: {len(links)}")

In [None]:
import pandas as pd

# Define the extracted color data as a dictionary
color_data = {
    "Color Name": [
        "lightsalmon", "salmon", "darksalmon", "lightcoral", "coral", "tomato",
        "orangered", "gold", "orange", "darkorange", "lightyellow", "lemonchiffon",
        "papayawhip", "moccasin", "peachpuff", "palegoldenrod", "khaki", "darkkhaki",
        "yellow", "lawngreen", "chartreuse", "limegreen", "lime", "forestgreen",
        "green", "powderblue", "lightblue", "lightskyblue", "skyblue", "deepskyblue",
        "lightsteelblue", "dodgerblue"
    ],
    "Hex Code": [
        "#FFA07A", "#FA8072", "#E9967A", "#F08080", "#FF7F50", "#FF6347",
        "#FF4500", "#FFD700", "#FFA500", "#FF8C00", "#FFFFE0", "#FFFACD",
        "#FFEFD5", "#FFE4B5", "#FFDAB9", "#EEE8AA", "#F0E68C", "#BDB76B",
        "#FFFF00", "#7CFC00", "#7FFF00", "#32CD32", "#00FF00", "#228B22",
        "#008000", "#B0E0E6", "#ADD8E6", "#87CEFA", "#87CEEB", "#00BFFF",
        "#B0C4DE", "#1E90FF"
    ]
}

# Create a DataFrame
df_colors = pd.DataFrame(color_data)

# Display the table
df_colors


In [None]:
import os

# Get the absolute file path of the notebook file
file_path = os.path.abspath("Web-Scraping-Review-Lab.ipynb")
print("The notebook is located at:", file_path)

In [None]:
import nbconvert
import nbformat
import pdfkit

# Corrected file paths (Using raw string notation or forward slashes)
input_file_path = r"C:\Users\Ede\Desktop\IBM_Coursera_Data_Analyst_Projects\CapStoneProjects\module1\Web-Scraping-Review-Lab.ipynb"
output_pdf_path = r"C:\Users\Ede\Desktop\IBM_Coursera_Data_Analyst_Projects\CapStoneProjects\module1\Web-Scraping-Review-Lab.pdf"

# Load the Jupyter Notebook file
with open(input_file_path, 'r', encoding='utf-8') as f:
    notebook_content = nbformat.read(f, as_version=4)

# Convert the notebook to HTML
html_exporter = nbconvert.HTMLExporter()
html_exporter.exclude_input = False  # Include code cells in the output
(body, resources) = html_exporter.from_notebook_node(notebook_content)

# Convert HTML to PDF
pdfkit.from_string(body, output_pdf_path)

# Return the PDF file path
print(f"Notebook successfully converted to PDF: {output_pdf_path}")


In [None]:
!jupyter nbconvert --to html "Web-Scraping-Review-Lab.ipynb"

# Congratulations to us for having successfully completed the above lab!
# Authors: 
<h4>Kelechukwu Innocent Ede and Ramesh Sannareddy</h4>

# Other Contributors:
<ul>
<li>Rav Ahuja</li>
</ul>