# Automated Web Scraping and Knowledge Base Generation

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://files.oaiusercontent.com/file-Kr7R2nnbT8LsxigKKcNG5e?se=2025-01-20T11%3A43%3A46Z&sp=r&sv=2024-08-04&sr=b&rscc=max-age%3D604800%2C%20immutable%2C%20private&rscd=attachment%3B%20filename%3Daf8131b8-9867-4984-ba3c-c458804c136e.webp&sig=2Vv8Uze4TKWm%2BiHNE3hNxmusBCdJr1PcCwr6N719nzc%3D"> 
</p>
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://files.oaiusercontent.com/file-ByUJDMwhofGn5MMgMCdAw2?se=2025-01-20T11%3A42%3A38Z&sp=r&sv=2024-08-04&sr=b&rscc=max-age%3D604800%2C%20immutable%2C%20private&rscd=attachment%3B%20filename%3Df9b97315-b42b-45e8-8b67-1b974115012f.webp&sig=agXVl/OOxnBRx9BYolHV66LQ/hLlJt/OtOyEIR2QIQ8%3D"> 
</p>
</div>

- Automates the process of scraping `blog` and `knowledge base` links.

- Fetches and formats the `content`, compiling it into a `markdown` based knowledge base.

- Manages `environment setup`, installs required `dependencies`, and c
onfigures necessary
 `environment variables`.

- Fetches content from multiple URLs, including `YouTube transcripts`.

- Converts content into `markdown` and stores it in a file.

- Implements `caching` to optimize performance.

- Includes `error handling` to manage issues during the scraping process.

- Outlines future steps for optimization and integration with `AI systems`.

## Step 1: Environment Setup and Requirements Installation Script

This script sets up the environment for the notebook by installing necessary dependencies and checking for required environment variables. It installs the dependencies from a `requirements.txt file`, retries up to three times in case of failure, and ensures that essential environment variables are set. If any variable is missing, the script exits and prompts the user to set it. Once the setup is complete, a success message is displayed.

In [None]:
# Boilerplate: This block goes into every notebook.
# It sets up the environment, installs the requirements, and checks for the required environment variables.

from IPython.display import clear_output
from dotenv import load_dotenv
import os

requirements_installed = False
max_retries = 3
retries = 0
REQUIRED_ENV_VARS = []


def install_requirements():
    """Installs the requirements from requirements.txt file"""
    global requirements_installed
    if requirements_installed:
        print("Requirements already installed.")
        return

    print("Installing requirements...")
    install_status = os.system("pip install -r requirements.txt")
    if install_status == 0:
        print("Requirements installed successfully.")
        requirements_installed = True
    else:
        print("Failed to install requirements.")
        if retries < max_retries:
            print("Retrying...")
            retries += 1
            return install_requirements()
        exit(1)
    return


def setup_env():
    """Sets up the environment variables"""

    def check_env(env_var):
        value = os.getenv(env_var)
        if value is None:
            print(f"Please set the {env_var} environment variable.")
            exit(1)
        else:
            print(f"{env_var} is set.")

    load_dotenv(override=True)

    variables_to_check = REQUIRED_ENV_VARS

    for var in variables_to_check:
        check_env(var)


install_requirements()
clear_output()
setup_env()
print("🚀 Setup complete. Continue to the next cell.")

## Step 2: Web Scraping and Content Extraction with Caching and YouTube Transcript Support

This script scrapes web pages to extract links and content, converting the HTML into Markdown format. It caches the fetched links and content to optimize performance by preventing redundant requests. It also supports YouTube video transcription, retrieving and formatting the transcript using the `YouTubeTranscriptApi`. The script handles errors, retries failed requests, and processes multiple URLs in batches for efficient content extraction.

### 1. Library Imports and Initial Setup
The code starts by importing necessary libraries:

- `BeautifulSoup (from bs4)`: A library used to parse and extract data from HTML documents.

- `Requests`: Used to send HTTP requests to retrieve web pages.

- `Typing (List)`: A hint for function arguments to indicate that they accept a list of items.

- `Markdownify`: Converts HTML content to Markdown format.

- `YouTube Transcript API`: Used to fetch and format YouTube video transcripts.

- `Traceback`: A module to handle error traces and provide detailed error logs.


In [2]:
from bs4 import BeautifulSoup
import requests
from typing import List
import traceback
from markdownify import markdownify as md
from youtube_transcript_api.formatters import TextFormatter
from youtube_transcript_api import YouTubeTranscriptApi

### 2. Global Variables and Caching
The script uses global variables for caching and configuration:

- `cache`: A dictionary that stores previously fetched data (links and page content) to avoid redundant web requests.

- `formatter`: An instance of TextFormatter from youtube_transcript_api, used to format YouTube video transcripts.

In [None]:
formatter = TextFormatter()

cache = {}

### 3. Get Base URL
This function extracts the base URL (scheme and domain) from a provided URL to resolve relative links.

For example:

For `https://example.com/path/to/page`, it returns `https://example.com`.

In [None]:
def get_base_url(url):
    return "/".join(url.split("/")[:3])


### 4. Get Links From Page
This function fetches all links (a tags) from a given webpage:

- `Cache Checking`: If links from the page have already been cached, they are returned immediately to optimize performance.

- `Sending Request`: A requests.get call fetches the content of the URL.

- `Parsing HTML`: BeautifulSoup is used to parse the HTML and extract all hyperlinks (href attributes).

- `Handling Relative Links`: If the link is relative, it's resolved to an absolute URL by appending it to the base URL of the page.

- `Caching`: The links are cached to avoid redundant requests the next time they are needed.

In [None]:
def get_links_from_page(url):
    global cache
    try:
        cached_item = cache.get(url)
        cached_urls = cached_item.get("urls") if cached_item else None
        if cached_urls:
            print(f"Returning cached links for {url}")
            return cached_urls, None
        base_url = get_base_url(url)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        links = []
        for link in soup.find_all("a"):
            cur_link = link.get("href")
            if not cur_link:
                continue
            if not cur_link.startswith("http"):
                links.append(f"{base_url}{cur_link}")
            else:
                links.append(cur_link)
        if not cached_item:
            cache[url] = {
                "urls": links,
                "content": None,
                "url": url,
            }
        else:
            cached_item["urls"] = links
            cache[url] = cached_item
        return links, None
    except Exception as e:
        print(f"Failed to get links from {url}. Error: {e}")
        traceback.print_exc()
        return [], str(e)


### 5. Get Page Content
This function retrieves the content of a given webpage and converts it to Markdown format:

- `Cache Checking`: It first checks if the content for the page is already cached. If it is, it returns the cached content.

- `YouTube Video Handling`: If the URL corresponds to a YouTube video, the function extracts the video ID, fetches the transcript using the YouTubeTranscriptApi, and formats the transcript using the TextFormatter.

- `HTML Parsing`: For other URLs, it parses the HTML using BeautifulSoup and converts the HTML to Markdown format with the markdownify library.

- `Caching`: The content is cached for future requests.

In [None]:
def get_page_content(url):
    try:
        cached_item = cache.get(url)
        cached_content = cached_item.get("content") if cached_item else None
        if cached_content:
            print(f"Returning cached content for {url}")
            return cached_content, None
        if "youtube" in url or "youtu.be" in url or "youtube.com" in url:
            video_id = url.split("=")[-1]
            transcript = YouTubeTranscriptApi.get_transcript(video_id)
            text_transcript = formatter.format_transcript(transcript)
            return md(text_transcript), None
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        result = md(str(soup))
        if not cached_item:
            cache[url] = {
                "urls": None,
                "content": result,
                "url": url,
            }
        else:
            cached_item["content"] = result
            cache[url] = cached_item
        return result, None
    except Exception as e:
        print(f"Failed to get content from {url}. Error: {e}")
        traceback.print_exc()
        return "", str(e)


### 6. Get Page Content Batch
This function processes multiple URLs in a batch:

It checks if each URL’s content is cached. If cached, it is returned; otherwise, the content is fetched and processed.
It appends the results to a list, which is returned after processing all URLs in the batch.

In [None]:
def get_page_content_batch(urls: List[str]):
    results = []
    for url in urls:
        cached_item = cache.get(url)
        cached_content = cached_item.get("content") if cached_item else None
        if cached_content:
            print(f"Returning cached content for {url}")
            results.append({"url": url, "content": cached_content, "error": None})
            continue
        print(f"Getting content from {url}")
        content, error = get_page_content(url)
        results.append({"url": url, "content": content, "error": error})
        if not cached_item:
            cache[url] = {
                "urls": None,
                "content": content,
                "url": url,
            }
        else:
            cached_item["content"] = content
            cache[url] = cached_item
        result_bytes_count = len(content.encode("utf-8"))
        print(f"Content from {url} fetched. Size: {result_bytes_count} bytes!")
    return results


## Step 3 : Process of Fetching, Cleaning, and Storing Knowledge Base Content

### 1. Fetching AE Blog Links

- The function get_ae_blog_links() fetches the links of blogs from the base URL https://arpitbhayani.me/blogs.

- It calls get_links_from_page() to fetch all the links from the page.

- If successful, it returns the list of blog links; otherwise, it returns an error.

In [None]:
from datetime import datetime

knowledge_base_cache = {}


def get_ae_blog_links():
    blogs_base_url = "https://arpitbhayani.me/blogs"
    blog_links = []
    links, error = get_links_from_page(blogs_base_url)
    if error:
        return blog_links, error
    blog_links.extend(links)
    return blog_links, None

 ### 2. Whitespace Cleaning

- The function clean_whitespace() is designed to clean up unnecessary whitespaces in the text.

- It performs the following tasks:
    - Removes leading and trailing whitespaces.

    - Replaces multiple whitespaces with a sing
    le whitespace.

    - Replaces multiple newlines with a single newline.

    - Removes leading or trailing newlines.

In [None]:
def clean_whitespace(text: str) -> str:
    """
    Cleans the whitespace as per following rules;
    - Removes leading and trailing whitespaces.
    - Replaces multiple whitespaces with a single whitespace.
    - Replaces multiple newlines with a single newline.
    - Removes any leading or trailing newlines.
    """
    if not text:
        return ""
    multiple_whitespaces_remover = " ".join(text.split())
    return (
        multiple_whitespaces_remover.replace("\n ", "\n").replace(" \n", "\n").strip()
    )


### 3. Fetching Knowledge Base

- The function fetch_knowledge_base() retrieves links from the knowledge-base page (https://arpitbhayani.me/knowledge-base).

- It filters out links related to Google Drive and other irrelevant links.

- For each valid link, it retrieves blog links and collects them in a list.

- Finally, it calls get_page_content_batch() to fetch content from the collected links.


In [3]:
def fetch_knowledge_base():
    knowledge_base_link = "https://arpitbhayani.me/knowledge-base"
    links_to_fetch = []
    sub_page_links, error = get_links_from_page(knowledge_base_link)
    if error:
        return [], error
    for link in sub_page_links:
        is_knowledge_base_link = "knowledge-base" in link
        if not is_knowledge_base_link:
            continue
        is_google_drive_link = "drive.google.com" in link
        if is_google_drive_link:
            continue
        result_links = []
        try:
            blog_links, error = get_links_from_page(link)
            if error:
                print(f"Failed to get links from {link}. Error: {error}")
                continue
            result_links.extend(blog_links)
        except Exception as e:
            print(f"Failed to get links from {link}. Error: {e}")
            traceback.print_exc()
        links_to_fetch.extend(result_links)
    result = get_page_content_batch(links_to_fetch)
    return result

### 4. Building Knowledge Base and Saving to File

- The build_knowledge_base() function assembles the complete knowledge base:

    - It generates a timestamped output file name.

    - It checks if the file exists, and if not, creates it and writes a header.

    - It collects blog links and content by calling get_ae_blog_links() and fetch_knowledge_base().

    - The collected content is written into the output file in Markdown format.

    - If an error occurs, it logs the error and returns a message indicating the failure.    

In [None]:
def build_knowledge_base(output_file="ae_knowledge_base.md"):
    try:
        now = datetime.now()
        now_human_formtted = now.strftime("%d_%m_%Y_%H_%M_%S")
        output_file_name, extension = os.path.splitext(output_file)
        output_file = f"{output_file_name}_{now_human_formtted}{extension}"
        file_exists = os.path.exists(output_file)
        if not file_exists:
            with open(output_file, "w") as f:
                f.write("# AE Knowledge Base\n\n")
        blog_links, error = get_ae_blog_links()
        if error:
            return [], error
        all_content = []
        content = get_page_content_batch(blog_links)
        knowledge_base_content = fetch_knowledge_base()
        all_content.extend(knowledge_base_content)
        with open(output_file, "w") as f:
            for c in all_content:
                # c['content'] = clean_whitespace(c['content'])
                f.write(f"# {c['url']}\n")
                f.write(c["content"])
                f.write("\n\n")
        return content, None
    except Exception as e:
        print(f"Failed to build knowledge base. Error: {e}")
        traceback.print_exc()
        return [], str(e)

## Step 4: Build the Knowledge Base and Handle Errors:

- ### Fetch AE Blog Links:

    - get_ae_blog_links() fetches the links for AE blogs.

    - Print the number of links found.

- ### Fetch Knowledge Base Content:

    - fetch_knowledge_base() fetches the content from the knowledge base.
    
    - Print the number of content sections fetched.

- ### Build the Knowledge Base:

    - build_knowledge_base() builds the knowledge base, writing the data into a markdown file (ae_knowledge_base_v0_0_2.md).

    - Print success or error messages based on the operation.

In [None]:
ae_links, error = get_ae_blog_links()

print(f"Found {len(ae_links)} links")

In [None]:
knowledge_base_content = fetch_knowledge_base()

print(f"Fetched {len(knowledge_base_content)} links.")

In [None]:
output_file = "ae_knowledge_base_v0_0_2.md"
contents, error = build_knowledge_base(output_file=output_file)


if error:
    print(f"Failed to build knowledge base. Error: {error}")
else:
    print(f"Knowledge base built successfully. Output file: {output_file}")
    print(f"Content: {len(contents)} sections. ")

### TODO: `Things to do till this is complete`.
- ##### Load and clean the knowledge base content.
- ##### Load the knowledge base content into a vector store.
- ##### Implement a simple RAG using Phi4 and the vector store.
- ##### Implement any optimizations on the RAG to improve response quality.

---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>
