# Knowledge Base Builder

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>

## Description:

The Knowledge Base Builder is a powerful tool designed to scrape and organize valuable information from various online resources, including blogs and knowledge bases.

### Features:

- **Web Scraping**:  
  Utilizes BeautifulSoup for scraping data from web pages, ensuring that relevant content is extracted.

- **Markdown Conversion**:  
  Converts scraped content into markdown format, making it easily readable and portable.

- **YouTube Integration**:  
  Integrates YouTube video transcripts to provide a comprehensive knowledge collection.

- **Environment Setup**:  
  Ensures all necessary environment variables are set and automatically installs required dependencies for smooth operation.

- **Data Storage**:  
  Once the data is fetched, it is saved in a markdown file, creating a structured knowledge base.

### Use Cases:
- **Knowledge Aggregation**:  
  Collect information from various sources into one centralized repository.

- **Personal/Team Knowledge Repositories**:  
  Build a knowledge base for personal or team use, making information accessible and organized.

- **Customizable Domains**:  
  The app can be customized to gather data from specific knowledge domains, ensuring it meets your needs.

### Summary:
This application provides a streamlined process for building and organizing knowledge bases, automating data collection, formatting, and storage into markdown files for easy access and management.


## Step 1: Installing dependencies

This block of code handles the setup process for the notebook. It first installs the required Python packages listed in the `requirements.txt` file using `pip`. If the installation is successful, it moves on to loading environment variables from a `.env` file. It then checks if the necessary environment variables (such as API keys) are set, and if any are missing, it prompts the user to set them. Finally, it ensures that the environment is ready for the rest of the notebook to execute, clearing output to maintain a clean start.

In [None]:
# Boilerplate: This block goes into every notebook.
# It sets up the environment, installs the requirements, and checks for the required environment variables.

from IPython.display import clear_output
from dotenv import load_dotenv
import os

requirements_installed = False
max_retries = 3
retries = 0
REQUIRED_ENV_VARS = []


def install_requirements():
    """Installs the requirements from requirements.txt file"""
    global requirements_installed
    if requirements_installed:
        print("Requirements already installed.")
        return

    print("Installing requirements...")
    install_status = os.system("pip install -r requirements.txt")
    if install_status == 0:
        print("Requirements installed successfully.")
        requirements_installed = True
    else:
        print("Failed to install requirements.")
        if retries < max_retries:
            print("Retrying...")
            retries += 1
            return install_requirements()
        exit(1)
    return


def setup_env():
    """Sets up the environment variables"""

    def check_env(env_var):
        value = os.getenv(env_var)
        if value is None:
            print(f"Please set the {env_var} environment variable.")
            exit(1)
        else:
            print(f"{env_var} is set.")

    load_dotenv(override=True)

    variables_to_check = REQUIRED_ENV_VARS

    for var in variables_to_check:
        check_env(var)


install_requirements()
clear_output()
setup_env()
print("🚀 Setup complete. Continue to the next cell.")

## Step 2: Web Scraping and Content Extraction

This block of code defines functions to scrape and extract web content efficiently. It fetches links from a specified URL, caches them for reuse, and processes them in batches to avoid redundant web requests. For individual URLs, the code checks if the content is already cached, and if not, it retrieves the page content, converts it to Markdown, and stores it in a cache. Additionally, it handles YouTube video URLs by fetching and formatting video transcripts into readable text. These steps ensure optimized data extraction and reduce unnecessary web traffic by leveraging caching.

In [2]:
from bs4 import BeautifulSoup
import requests
from typing import List
import traceback
from markdownify import markdownify as md
from youtube_transcript_api.formatters import TextFormatter
from youtube_transcript_api import YouTubeTranscriptApi

formatter = TextFormatter()

cache = {}


def get_base_url(url):
    return "/".join(url.split("/")[:3])


def get_links_from_page(url):
    global cache
    try:
        cached_item = cache.get(url)
        cached_urls = cached_item.get("urls") if cached_item else None
        if cached_urls:
            print(f"Returning cached links for {url}")
            return cached_urls, None
        base_url = get_base_url(url)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        links = []
        for link in soup.find_all("a"):
            cur_link = link.get("href")
            if not cur_link:
                continue
            if not cur_link.startswith("http"):
                links.append(f"{base_url}{cur_link}")
            else:
                links.append(cur_link)
        if not cached_item:
            cache[url] = {
                "urls": links,
                "content": None,
                "url": url,
            }
        else:
            cached_item["urls"] = links
            cache[url] = cached_item
        return links, None
    except Exception as e:
        print(f"Failed to get links from {url}. Error: {e}")
        traceback.print_exc()
        return [], str(e)


def get_page_content(url):
    try:
        cached_item = cache.get(url)
        cached_content = cached_item.get("content") if cached_item else None
        if cached_content:
            print(f"Returning cached content for {url}")
            return cached_content, None
        if "youtube" in url or "youtu.be" in url or "youtube.com" in url:
            video_id = url.split("=")[-1]
            transcript = YouTubeTranscriptApi.get_transcript(video_id)
            text_transcript = formatter.format_transcript(transcript)
            return md(text_transcript), None
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        result = md(str(soup))
        if not cached_item:
            cache[url] = {
                "urls": None,
                "content": result,
                "url": url,
            }
        else:
            cached_item["content"] = result
            cache[url] = cached_item
        return result, None
    except Exception as e:
        print(f"Failed to get content from {url}. Error: {e}")
        traceback.print_exc()
        return "", str(e)


def get_page_content_batch(urls: List[str]):
    results = []
    for url in urls:
        cached_item = cache.get(url)
        cached_content = cached_item.get("content") if cached_item else None
        if cached_content:
            print(f"Returning cached content for {url}")
            results.append({"url": url, "content": cached_content, "error": None})
            continue
        print(f"Getting content from {url}")
        content, error = get_page_content(url)
        results.append({"url": url, "content": content, "error": error})
        if not cached_item:
            cache[url] = {
                "urls": None,
                "content": content,
                "url": url,
            }
        else:
            cached_item["content"] = content
            cache[url] = cached_item
        result_bytes_count = len(content.encode("utf-8"))
        print(f"Content from {url} fetched. Size: {result_bytes_count} bytes!")
    return results

## Step 3: Knowledge Base Extraction and Compilation

This step fetches blog links and content from the AE (Arpit Bhayani) website, cleaning and organizing the content. It builds a knowledge base by fetching relevant pages, extracting their text content, and saving it into a markdown file. The process ensures that all data is structured and saved for further use or analysis.

In [3]:
from datetime import datetime

knowledge_base_cache = {}


def get_ae_blog_links():
    blogs_base_url = "https://arpitbhayani.me/blogs"
    blog_links = []
    links, error = get_links_from_page(blogs_base_url)
    if error:
        return blog_links, error
    blog_links.extend(links)
    return blog_links, None


def clean_whitespace(text: str) -> str:
    """
    Cleans the whitespace as per following rules;
    - Removes leading and trailing whitespaces.
    - Replaces multiple whitespaces with a single whitespace.
    - Replaces multiple newlines with a single newline.
    - Removes any leading or trailing newlines.
    """
    if not text:
        return ""
    multiple_whitespaces_remover = " ".join(text.split())
    return (
        multiple_whitespaces_remover.replace("\n ", "\n").replace(" \n", "\n").strip()
    )


def fetch_knowledge_base():
    knowledge_base_link = "https://arpitbhayani.me/knowledge-base"
    links_to_fetch = []
    sub_page_links, error = get_links_from_page(knowledge_base_link)
    if error:
        return [], error
    for link in sub_page_links:
        is_knowledge_base_link = "knowledge-base" in link
        if not is_knowledge_base_link:
            continue
        is_google_drive_link = "drive.google.com" in link
        if is_google_drive_link:
            continue
        result_links = []
        try:
            blog_links, error = get_links_from_page(link)
            if error:
                print(f"Failed to get links from {link}. Error: {error}")
                continue
            result_links.extend(blog_links)
        except Exception as e:
            print(f"Failed to get links from {link}. Error: {e}")
            traceback.print_exc()
        links_to_fetch.extend(result_links)
    result = get_page_content_batch(links_to_fetch)
    return result


def build_knowledge_base(output_file="ae_knowledge_base.md"):
    try:
        now = datetime.now()
        now_human_formtted = now.strftime("%d_%m_%Y_%H_%M_%S")
        output_file_name, extension = os.path.splitext(output_file)
        output_file = f"{output_file_name}_{now_human_formtted}{extension}"
        file_exists = os.path.exists(output_file)
        if not file_exists:
            with open(output_file, "w") as f:
                f.write("# AE Knowledge Base\n\n")
        blog_links, error = get_ae_blog_links()
        if error:
            return [], error
        all_content = []
        content = get_page_content_batch(blog_links)
        knowledge_base_content = fetch_knowledge_base()
        all_content.extend(knowledge_base_content)
        with open(output_file, "w") as f:
            for c in all_content:
                # c['content'] = clean_whitespace(c['content'])
                f.write(f"# {c['url']}\n")
                f.write(c["content"])
                f.write("\n\n")
        return content, None
    except Exception as e:
        print(f"Failed to build knowledge base. Error: {e}")
        traceback.print_exc()
        return [], str(e)

## Step 4: Fetch AE Blog Links

This step retrieves the list of blog links from the AE website by calling the `get_ae_blog_links()` function. It then prints the total number of links found, providing a preview of the available blog resources. The result is a collection of blog links that will be processed further for content extraction.

In [None]:
ae_links, error = get_ae_blog_links()

print(f"Found {len(ae_links)} links")

## Step 5: Fetch Knowledge Base Content

In this step, the function `fetch_knowledge_base()` is called to fetch content from various links within the knowledge base. The retrieved content is stored in `knowledge_base_content`, and the number of links fetched is printed to give an overview of the amount of knowledge base material collected. This content is later used to build the comprehensive knowledge base.

In [None]:
knowledge_base_content = fetch_knowledge_base()

print(f"Fetched {len(knowledge_base_content)} links.")

## Step 6: Build Knowledge Base

In this step, the `build_knowledge_base()` function is called to create the knowledge base by collecting and processing content from various sources. The output is saved in a markdown file named `ae_knowledge_base_v0_0_2.md`. If the process is successful, it prints the success message along with the number of sections in the content; otherwise, it handles errors and displays the failure message. This step finalizes the creation of the knowledge base.

In [None]:
output_file = "ae_knowledge_base_v0_0_2.md"
contents, error = build_knowledge_base(output_file=output_file)


if error:
    print(f"Failed to build knowledge base. Error: {error}")
else:
    print(f"Knowledge base built successfully. Output file: {output_file}")
    print(f"Content: {len(contents)} sections. ")

## To-Do Tasks for Knowledge Base Integration:

This step outlines the remaining tasks to complete the project :


- Load and clean the knowledge base content: This involves processing the fetched knowledge base data to ensure it is well-structured and free of any unwanted formatting or inconsistencies.

- Load the knowledge base content into a vector store: This step stores the cleaned content in a vector format, allowing efficient searching and retrieval.

- Implement a simple RAG (Retrieval-Augmented Generation) using Phi4 and the vector store: This introduces the use of RAG for enhancing response quality by integrating the vector store with a retrieval mechanism.

- Implement optimizations on the RAG to improve response quality: This final task focuses on fine-tuning the RAG system for better performance and more accurate results in real-world applications.

In [8]:
#
# TODO: Things to do till this is complete.
# - Load and clean the knowledge base content.
# - Load the knowledge base content into a vector store.
# - Implement a simple RAG using Phi4 and the vector store.
# - Implement any optimizations on the RAG to improve response quality.
#

## Conclusion:

This app serves as a comprehensive tool for building a knowledge base from various online sources, specifically blogs and knowledge base pages. By leveraging the `BeautifulSoup` and `YouTubeTranscriptApi` libraries, it efficiently extracts, processes, and cleans content, storing it in a markdown format. The app also includes mechanisms for caching to optimize repeated fetches. Moving forward, the next steps involve implementing a vector store to load the knowledge base and utilizing Retrieval-Augmented Generation (RAG) for improved response quality, paving the way for advanced AI-driven applications.



---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>