Threat Hunting — Web Scraping

Web scraping is the process of automatically extracting data from websites. This typically involves sending HTTP requests to a website’s server, parsing the HTML or XML content of the response, and then extracting the relevant data using specific patterns or rules. This data is then stored in a structured format such as a database or spreadsheet where it can be analyzed or used. 
To perform web scraping in Python we will use Beautiful Soup, which is a Python library for web scraping that provides a simple and intuitive way to parse HTML and XML documents. It is widely used in web scraping because it can handle various parsing tasks with minimal code.

To demonstrate web scraping we are going to keep track of high severity vulnerabilities (vulnerabilities with a CVSS base score of 7.0–10.0), very common in the world of threat intelligence. These vulnerabilities can often lead to Remote Code Execution (RCE) and provide an attacker initial access to a system. 
The US-based Cybersecurity and Infrastructure Security Agency (CISA) maintains a weekly summary of vulnerabilities which it posts as a bulletin on its website. This page divides vulnerabilities into four tables based on severity; High Vulnerabilities, Medium Vulnerabilities, Low Vulnerabilities, and Severity Not Yet Assigned. We will be focusing on the High Vulnerabilities table, which is the first table on the page. To capture this table we first need to make a request to download this page and use Beautiful Soup’s html.parser to parse the HTML returned by our request.

In [None]:
# This script is to extract high vulnerabilities found on CISA’s security bulletin site (https://www.cisa.gov/news-events/bulletins).
# Copy the URL into the CISA_URL variable and run the script, you will get a list of high vulnerabilities for the week.

import requests, csv
from bs4 import BeautifulSoup

# Download the page and parse the page
CISA_URL = "https://www.cisa.gov/news-events/bulletins/sb25-209"        # Change the url to the one you want to investigate
page = requests.get(CISA_URL)
soup = BeautifulSoup(page.content, "html.parser")

# define variables
Page_Title = soup.title.string
a = soup.title.string.split("of")
b = a[1].split("|")
Filename = "CISA Vulnerabilities - " + b[0].strip() + ".csv"

# capture high vulnerabilities table
table = soup.find("table")
table_body = table.find("tbody")
rows = table_body.find_all("tr")        # find all the table rows using the tr (table row) tag (<tr>).

vulns = []      # create vulnerabilities list

# use a loop to iterate through all the rows stored in the rows variable and finds the table data stored in the HTML td (table data) tag (<td>)
for row in rows:
    cols = [x for x in row.find_all("td")]

    # get the text property and use the Python string method .strip() to remove whitespaces
    product, vendor = cols[0].text.split("--")
    description = cols[1].text.strip()
    published = cols[2].text.strip()
    cvss = cols[3].text.strip()
    cve = cols[4].find("a").text.strip()        # find the anchor tag (<a>) and extract the text property
    reference = cols[4].find("a").get("href")   # extract the link at this anchor tag (stored in its href attribute)

    # create a Python dictionary object vuln to store the data
    vuln = {
        "product": product.strip(),
        "vendor": vendor.strip(),
        "description": description,
        "published": published,
        "cvss": cvss,
        "cve": cve,
        "reference": reference
    }
    vulns.append(vuln)

# define csv file header row
header_row = ["Product", "Vendor", "Description", "Published", "CVSS", "CVE", "Reference"]

# create a csv file as output to store all the data parsed, this file can transfer to relevant parties for system patching planning
with open(Filename, "w", encoding='UTF8', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(header_row)

    for vuln in vulns:
        data_row = [vuln['product'], vuln['vendor'], vuln['description'], vuln['published'], vuln['cvss'], vuln['cve'], vuln['reference']]
        writer.writerow(data_row)

print(f"<{Page_Title}>\n=> CISA vulnerability summary page parsed and file <{Filename}> created.")

<Vulnerability Summary for the Week of July 21, 2025 | CISA>
=> CISA vulnerability summary page parsed and file <CISA Vulnerabilities - July 21, 2025.csv> created.
