## Manualify.ai
---

`Manualify.ai:` A Smart AI for Answering Questions from Tutorials: Harnessing the Power of RAG LLMs and Documents.

The objective is to develop an AI capable of answering questions by harnessing information from websites, tutorials, ServiceNow, customer support resources, and transactional databases.

**Author:** Amit Shukla

**Connect**
Author: Amit Shukla

[GitHub](https://github.com/AmitXShukla) | [X](https://x.com/@ashuklax) | [YouTube](https://youtube.com/@Amit.Shukla) | [Medium](https://amit-shukla.medium.com)

---

In this blog, we will create an online manual for Python, Oracle or Julia Lang, Angular and Flutter.

**Step 0:** getting started

**Step 1:** build a simple web crawler - Scrapify
    
In this section, we will

1. query LLM API and build a Q&A from LLM.
2. read data from PDF files and then query LLM.
3. Scape an online page and query LLM.
4. automated crawling: build a simple `web crawler` to gather text from the website. The crawler will collect links from the given domain and then visit each link to download the associated text. 
5. explore various options for downloading data from Single Page Applications (SPAs) using web scraping techniques and libraries such as BeautifulSoup, Scrapy, and Selenium.
6. Image data extraction:
look into different methods for extracting data from images, including Optical Character Recognition (OCR) techniques and libraries such as Tesseract.
7. read data from PDF files: Additionally, we will examine ways to read and extract data from PDF files using libraries such as PyPDF2 and PDFMiner.
8. Querying Language Models (LLMs):
Once we have extracted the data, we will then query the Language Models (LLMs) with the extracted data to generate insights and answers.

**Step 2:** We will convert all PDFs to csv and simply build a Q&A prompt using Gen AI (Claude) with entire file content at once.

**Step 3:** Creating Embedding from csvs and other documents to create a Vector database.

**Step 4:** Using RAG and LLMs to query manual documments.

**Step 5:** Using SQL queries with Functional calling.

**Step 6:** creating an online app and hosting

At some point, you may wonder, Is it really worth investing time and resources into building a data lake and RAG database when AI models can handle massive datasets? Can't we just let machines do the heavy lifting?

LLM inference is still costly and use-case dependent. Even with massive token processing capacity, it's meaningless without intelligent data. RAG and fine-tuning models will continue to thrive as long as it serve our needs.

I'd rather spend time learning and building tools using these technologies than engaging in pointless debates. Building takes less time and effort than debating, so I'll focus on creating and refining my models with my own data. Then, I'll compare results to see what works best.

## Steps to create Manualify

---
<!-- gitGraph TB:
    commit id: "query" tag: "build prompt"
    commit id: "1"
    branch Query
    commit id: "2" tag: "LLM"
    branch PDFQuery
    commit id: "3" tag: "send PDF as sontext"
    branch OnlineScapping
    commit id: "4" tag: "online content as context"
    branch AutoScapping
    commit id: "5" tag: "auto online download"
    branch EmbeddingsVectorDB
    commit id: "6" tag: "embeddings Vector DB"
    branch RAG
    commit id: "7" tag: "query RAG"
    branch SQLQueries
    commit id: "8" tag: "query RDBMs"
    branch FunctionCalling
    commit id: "9" tag: "Function Calling" -->

`Steps to create Manualify`

![Process Flow](../images/processflow.png)

## Process Flow diagram
---

`Brief overview of the RAG stack : Voyage AI`

![Brief overview of the RAG stack : Voyage AI](https://files.readme.io/ec25408-RAG-white.png)

The diagram below illustrates the high-level architecture and data flow of this project. Please note that **not** all of these features are included in the Community version, and the Pro/Custom version may vary significantly from this diagram based on individual implementation.

As depicted in the diagram, this basic web crawler accepts a URL as input and navigates through all linked sub-pages, collecting text from the specific website one page at a time.
Our current goal is straightforward: we aim to extract relevant text information, metadata and other useful details using this crawler. In subsequent blogs, we plan to construct an embedding vector data store or a vector database composed of embeddings derived from the text of the website and other available documentation and knowledge bases.

Now let's proceed to construct our simple web crawler that fetches text and pertinent information from all pages of a given website.

![Process Flow](../images/process_flow.png)

## Step 0: Getting Started
---

We'll discuss few different approaches.
- **Approach 1** - Setup Anthropic LLMs
- **Approach 2** - Setup Open AI LLMs
- **Approach 3** - Setup Google Gemini AI
- **Approach 4** - Setup Groq API
- **Approach 5** - Setup Locally running models using Ollama
			similar steps can be done for other local LLMs like llama.cpp etc.

The selection of a Large Language Model (LLM) is influenced by factors such as your specific business needs, financial constraints, and personal tastes.

In this demo, I'll present few distinct methods for establishing connections with various LLM service providers.
Additionally, I recommend utilizing this occasion as an avenue to evaluate Large Language Models against the backdrop of your unique inputs and industry-specific needs.

#### Approach 1 : Setup Anthropic LLMs
Let's first set up our Python working environment. While we can also use Node.js, please note that for the current version, we will be using Python for development. 

Please signup using these links and get your own API Keys.
- [ANTHROPIC_API_KEY](https://docs.anthropic.com/claude/reference/getting-started-with-the-api)
- [YOUR_PINECONE_API_KEY](https://docs.pinecone.io/docs/quickstart)
- [VOYAGE_API_KEY](https://docs.voyageai.com/docs/api-key-and-installation)

In [None]:
##########################################################################
## Although not mandatory, 
## it is highly recommended to set up a new Python working environment. ##
##########################################################################
## To create a virtual environment called `GenAI`, follow the steps:

## On Windows: 
# !python -m venv GenAI GenAI\Scripts\activate

## On macOS or Linux: 
# !python3 -m venv GenAI source GenAI/bin/activate

## Then, install required packages using pip: 
# !pip install pandas numpy matplotlib seaborn tqdm beautifulsoup4

## install only in case if you are using OpenAI
# !pip install openai

## install only in case if you are using Claude
# !pip install anthropic datasets pinecone-client voyageai

## in case if you fork this repo, just run
# !pip install -r requirements.txt

# !pip install --upgrade pip
# !pip freeze > requirements.txt

In [None]:
import platform;
print(platform.processor())

import os

####################################################
## if you are using OpenAI LLMs
## setup windows environment variable OPENAI_API_KEY
####################################################
# import openai
# openai.api_key = os.getenv("OPENAI_API_KEY")

####################################################
## if you are using Anthropic Claude LLMs
## sign up for API keys & setup windows environment variable 
## ANTHROPIC_API_KEY, PINECONE_API_KEY & VOYAGE_API_KEY
####################################################
# import anthropic

if (not os.environ.get("ANTHROPIC_API_KEY")) | (not os.environ.get("PINECONE_API_KEY")) | (not os.environ.get("VOYAGE_API_KEY")):
    print("One of the api key is missing.")
else:
    print("All API Keys are in place.")

In [None]:
# Test LLM API
import anthropic

client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    # you don't need to pass api_key explicitly
    api_key=os.environ.get("ANTHROPIC_API_KEY")
)

message = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1000,
    temperature=0.0,
    system="Respond only in Yoda-speak.",
    messages=[
        {"role": "user", "content": "how are args and keyword arguments defined in python?"}
    ]
)
print(message.content)

#### Approach 2 : Setup Open AI LLMs

Let's first set up our Python working environment. While we can also use Node.js, please note that for the current version, we will be using Python for development.

Please signup and get your own API Keys.

[Open AI API Key](https://platform.openai.com/docs/api-reference/introduction)

In [None]:
# !pip install openai

In [None]:
# Let's make sure your API keys are properly setup.
import platform;
print(platform.processor())
import os

if (not os.environ.get("OPENAI_API_KEY")):
    print("One of the api key is missing.")
else:
    print("All API Keys are in place.")

In [None]:
# test LLM
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o-mini-2024-07-18",
  messages=[{"role": "user", "content": "how are args and keyword arguments defined in python?"}],
  temperature=1,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

#### Approach 3 : Setup Google Gemini LLMs

Let's first set up our Python working environment. While we can also use Node.js, please note that for the current version, we will be using Python for development.

Please signup and get your own API Keys.

[Gemini API Key](https://ai.google.dev/gemini-api)

In [None]:
# !python.exe -m pip install --upgrade pip
# !pip install -q -U google-generativeai

In [None]:
# export API_KEY=<YOUR_API_KEY>

import google.generativeai as genai
import os

genai.configure(api_key=os.environ["API_KEY"])
model = genai.GenerativeModel('gemini-1.5-flash')

response = model.generate_content("Write a story about an AI and magic")
print(response.text)

#### Approach 4 - Running Groq Llama 3.2 API

Let's first set up our Python working environment. While we can also use Node.js, please note that for the current version, we will be using Python for development.

Please signup and get your own API Keys.

[Groq API Key](https://console.groq.com/keys)

In [None]:
# !pip install groq

In [None]:
from groq import Groq

client = Groq()
completion = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[
        {
        "role": "user",
        "content": ""
        }
    ],
    temperature=1,
    max_tokens=1024,
    top_p=1,
    stream=True,
    stop=None,
)

for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

#### Approach 5 - Setup Locally running models using Ollama
[download](https://ollama.com/download) and setup Ollama

Let's first set up our Python working environment.

In [None]:
# !pip install graphviz Pillow networkx requests ollama pypdf beautifulsoup4 tiktoken pandas matplotlib seaborn langchain chromadb pysqlite3-binary

In [None]:
#############################################################
# locally hosted ollama models
# install ollama binaries/exe from https://ollama.com/download
#############################################################

# run local installation then
# on windows powershell | MacOS Terminal
# run following commands one at a time
#############################################################

# !ollama list

# ## don't install heavy size LLMs, start with small LLMs like 3B or 7B versions
# ## phi3:mini is a great LLM model, I find it compatible to do most of the tasks
# !ollama pull phi3:mini
# !ollama pull llama3.2

# ## don't install heavy size LLMs, start with small LLMs with 3B or 7B versions
# ## This is an embedding model
# ollama pull all-minilm:latest
# ollama pull mxbai-embed-large # SOTA model - recommended

# ## don't install heavy size LLMs, start with small LLMs with 3B or 7B versions
# ## This is a code model
# ollama pull codegemma

# ollama list


automating ollama start

In [None]:
# to setup auto start on Linux VM
# add this to your .zshrc file

# alias l3="ollama run llama3.2"

In [None]:
# run ollama LLM
# this code assumes you have a local ollama running on your machine
# refer to previous step if you have any issues

import ollama
response = ollama.chat(model='llama3.2', messages=[
    {
        'role': 'user',
        'content': 'rephrase sentence: let us first make sure, we have a LLM/Embedding API running successfully',
    },
])

print(response['message']['content'])
# When the output displays, it indicates that your program has been run without errors and completed its intended tasks effectively.
# congratulations!, you have a LLM running locally on your machine.

In [None]:
#######################################
# fast API based hosted models
# for instance llama.ccp
#######################################

# !curl -X 'GET' '<<IP.HOSTED.Models>>/v1/models' -H 'accept: application/json'

import requests
url = '<<IP.HOSTED.Models>>/v1/models'
payload = "models"
headers = {'content-type': 'application/json', 'Accept-Charset': 'UTF-8'}
r = requests.post(url, data=payload, headers=headers)

!curl -X 'GET' 'https://<<IP.HOSTED.Models>>/v1/models' -H 'accept: application/json'

r = requests.get(url)
r.json()

With LLM models running locally or via API, we'll create flexible functions to interact with them. This approach allows us to easily switch between local, cloud-based, or other LLM models as needed.

#### define generic reusable static contexts

In [None]:
# define a generic function whick takes context, message and return results

#########################################################
# These contexts are temporary and will be later replaced
# by LLM Tools | Functions automation (later sections)
# or use langchain to change context/prompt dynamically
#########################################################

# define different versions of contexts
# for POC purpose, these context will be used along with user prompt

context_append_bool_assistant = "respond in only true or false. "
context_append_one_word_assistant = "respond in only one word. "
context_append_code_assistant = "You are a helpful assistant with advance SQL coding skills. "
context_append_service_assistant = "You are a helpful assistant with reasoning skills. "

context_append_output_assistant = """Ensure that the response strictly adheres to JSON format, excluding all additional content. Attached is an example demonstrating the desired JSON structure for the expected output. """

context_append_SQL_assistant = "return SQL or schema only and nothing else. "

context_sample_metadata = context_append_service_assistant + context_append_output_assistant + """
                                From the given message, retrieve, what domain user is referring to? for example
                                {
                                    "Domain": ["Sales Order"]
                                }
                                """
context_sample_userinfo = context_append_service_assistant + context_append_output_assistant + """
                                From the given message, retreive, data/user information. example Joe Biden is POTUS,
                                his email ID will be Joe.X.Biden@whitehouse.org because, even if he does have a middle name, he doesn't use it that often.
                                so put an X there in middle when creating his email ID.
                                but answer only in this format.
                                {
                                    "Person": ["Joe Biden"],
                                    "EMPLOYEEID": ["ABC1234"],
                                    "emailID": ["Joe.X.Biden@whitehouse.org"]
                                }
                            ,"""

context_sample_table_schema = context_append_code_assistant + context_append_SQL_assistant + """retrieve table schema from this SQL
                                SELECT name, salary
                                FROM EMPLOYEES
                                WHERE department = 'Engineering';
                            """

context_sample_SQL = context_append_code_assistant + context_append_SQL_assistant + """
                                write a select data from this schema, include filter for department equals 1234. example schema is
                                CREATE TABLE EMPLOYEES (
                                    id INTEGER
                                    name TEXT
                                    department TEXT
                                salary INTEGER
                                )
                            """

build a generic function which queries LLM, given a context and message and Train LLM to answer in said format

In [None]:
def getResults(context, message, code):
        model = 'llama3'
        if code == "text": # use code LLM model
            model = 'llama3'
            response = ollama.chat(model=model, messages=[
            {
            'role': 'user',
            'content': context + message,
            },
        ])
            return response['message']['content']
        elif code == "embed": # use embedding model
        # model = 'all-minilm:latest'
            model = 'mxbai-embed-large'
            return ollama.embeddings(model=model, prompt=message)
        else:
            model = 'codestral'
            response = ollama.generate(
            model='codestral:latest',
            prompt=message,
            options={
                'num_predict': 128,
                'temperature': 0,
                'top_p': 0.9,
                'stop': ['<|file_separator|>'],
            },
        )
        return response["response"]

message_1 = """My Name is Amit Shukla, My employee ID is ABC4563
        and my email is my First name followed by a dot,
        followed by x since I don't have a middle name,
        followed by dot, followed by my last name.
        I work at whitehouse"""

# run below test messages once RAG is ready
# message_1 = """My Name is Amit Shukla from POTUS IT department, I challange you to find my Employee number and whitehouse email ID."""

# message_1 = """My Name is Amit Shukla, go find out my department based on previous interactions.
        #  I challange you to find my employee number and whitehouse email ID."""

print(getResults(context_sample_userinfo, message_1, "text"))

message_2 = "generate SQL"
# context param is irrelevant here, make it optional kwarg
print(getResults(context_sample_SQL, context_sample_SQL, "code"))

message_3 = "Vendor HP purchased hundreds of Z book computers for Department 1234."
# context param is irrelevant here, make it optional kwarg
print(getResults(context_append_bool_assistant, message_3, "embed"))

## Step 1: build a simple web crawler - Scrapify
---

### Step 1.1: query LLM API and build Q&A system

In this step, While the LLM's responses currently rely solely on its knowledge, we aim to make the most of the LLM by inputting provided data as input, enabling it to learn and use this data to answer questions more accurately.

In [None]:
import ollama

data =""
prompt = "how are args and kwargs different in python"
import ollama
output = ollama.generate(
  model="llama3.1",
  prompt=f"""answer this question : {prompt}"""
)

print(output["response"])  # type: ignore

### Step 1.2: read data from PDF files and then query LLM

As demonstrated in the previous steps, manual data input is currently required to serve as input and generate answers. In this case, we will use a PDF such as manual or tutorial as reference to query.

In [None]:
# !pip install pypdf
# !curl -O https://github.com/AmitXShukla/RPA/blob/main/SampleData/The%20Ultimate%20Guide%20to%20Data%20Wrangling%20with%20Python%20-%20Rust%20Polars%20Data%20Frame.pdf

In [None]:
from pypdf import PdfReader

reader = PdfReader("../downloads/Python - understanding functions.pdf")
number_of_pages = len(reader.pages)
text = ''.join([page.extract_text() for page in reader.pages])
print(text[:2155])

import ollama

data =""
prompt = "how are args and kwargs different in python"
import ollama

def get_completion(prompt):
    output = ollama.generate(
        model="llama3.1",
        prompt=f"""answer this question : {prompt}"""
        )
    return output["response"]  # type: ignore

completion = get_completion(
    f"""Here is a local guide: <guide>{text}</guide>    

Please do the following:
1. Summarize the abstract about Python args 
and keyword args understanding at a kindergarten reading level. (In <kindergarten_abstract> tags.)
2. Write the Methods section as a recipe from the Moosewood Cookbook. (In <moosewood_methods> tags.)
"""
)
print(completion)

### Step 1.3: Scrape an online page and query LLM

In this step, we will construct a basic web crawler that will download text from a specified URL, using this downloaded text as input. By automating this process, we aim to eliminate the need for manual data input.

These approaches are very useful in automation, for example, you want to run Assistant based on some online search or SQL results. You can achieve full automation while using these codes.

In [None]:
# !pip install anthropic requests beautifulsoup4

In [None]:
USER_QUESTION="how are args and keyword arguments defined in python?"

In [None]:
GENERATE_QUERIES=f"""\n\nHuman: You are an expert at Python programmer. 
Your proficiency in Python programming is exceptional.
You have the ability to craft code, author blogs, and create tutorials. 
Typically, when a question is posed to you, your response is comprehensive and often includes illustrative code examples.

User question: {USER_QUESTION}

Format: {{"queries": ["query_1", "query_2", "query_3"]}}\n\nAssistant: {{"""

In [None]:
import ollama

# define generic function to query LLM
def get_completion(prompt: str):
    output = ollama.generate(
    model="llama3.1",
    prompt=f"""answer this question : {prompt}"""
    )
    return output["response"]  # type: ignore

# query LLM based on fixed text
queries_json = "{" + get_completion(GENERATE_QUERIES)
print(queries_json)

let's write code to scrape an online page.

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://docs.python.org/3/tutorial/controlflow.html#more-on-defining-functions"

def get_page_content(url : str) -> str:
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text(strip=True, separator='\n')

In [None]:
formatted_search_results = get_page_content(url)
print(formatted_search_results)

In [None]:
# inject search results into prompt
ANSWER_QUESTION = f"""\n\nHuman: I have provided you with the following search results:
{formatted_search_results}

Please answer the user's question using only information from the search results. Reference the relevant search result urls within your answer as links. Keep your answer concise.

User's question: {USER_QUESTION} \n\nAssistant:
"""
print(ANSWER_QUESTION)

In [None]:
print(get_completion(ANSWER_QUESTION))

### Step 1.4: Scrapify: build a simple `web crawler`

Now, since We can build and automate text retreival from one HTML page, we will build a simple `web crawler` to gather text from the website. The crawler will collect links from the given domain and then visit each link to download the associated text.

In [None]:
#############################################################################
# credit: majority of this code is reference through OpenAI documentation
# however, it's not necessary to use OpenAI API
# like in this blog, we will use local ollama instead
# https://platform.openai.com/docs/tutorials/web-qa-embeddings
#############################################################################

# install dependencies
# !pip install requests pandas beautifulsoup4 tiktoken openai

# some of embedding packages were not working so tried downgrading openai version
# from openai.embeddings_utils import distances_from_embeddings, cosine_similarity
# !pip install openai==0.27.7

# !pip show openai

In [None]:
# import all packages
import requests
import re
import urllib.request
from bs4 import BeautifulSoup
from collections import deque
from html.parser import HTMLParser
from urllib.parse import urlparse
import os
import pandas as pd
# import tiktoken
# import openai
import numpy as np
# from openai.embeddings_utils import distances_from_embeddings, cosine_similarity
from ast import literal_eval

In [None]:
# Regex pattern to match a URL
HTTP_URL_PATTERN = r'^http[s]{0,1}://.+$'

# Regex pattern to match a Phone number
PHONE_PATTERN = r'^http[s]{0,1}://.+$'

# Regex pattern to match an email
EMAIL_PATTERN = r'^http[s]{0,1}://.+$'

# Define OpenAI api_key
# openai.api_key = '<Your API Key>'

# Define root domain to crawl
domain = "oracle.com"
full_url = "https://docs.oracle.com/en/cloud/saas/financials/24b/books.html"

# Create a class to parse the HTML and get the hyperlinks
class HyperlinkParser(HTMLParser):
    def __init__(self):
        super().__init__()
        # Create a list to store the hyperlinks
        self.hyperlinks = []

    # Override the HTMLParser's handle_starttag method to get the hyperlinks
    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)

        # If the tag is an anchor tag and it has an href attribute, add the href attribute to the list of hyperlinks
        if tag == "a" and "href" in attrs:
            self.hyperlinks.append(attrs["href"])

In [None]:
# Function to get the hyperlinks from a URL
def get_hyperlinks(url):
    
    # Try to open the URL and read the HTML
    try:
        # Open the URL and read the HTML
        with urllib.request.urlopen(url) as response:

            # If the response is not HTML, return an empty list
            if not response.info().get('Content-Type').startswith("text/html"):
                return []
            
            # Decode the HTML
            html = response.read().decode('utf-8')
    except Exception as e:
        print(e)
        return []

    # Create the HTML Parser and then Parse the HTML to get hyperlinks
    parser = HyperlinkParser()
    parser.feed(html)

    return parser.hyperlinks

In [None]:
# Function to get the hyperlinks from a URL that are within the same domain
def get_domain_hyperlinks(local_domain, url):
    clean_links = []
    for link in set(get_hyperlinks(url)):
        clean_link = None

        # If the link is a URL, check if it is within the same domain
        if re.search(HTTP_URL_PATTERN, link):
            # Parse the URL and check if the domain is the same
            url_obj = urlparse(link)
            if url_obj.netloc == local_domain:
                clean_link = link

        # If the link is not a URL, check if it is a relative link
        else:
            if link.startswith("/"):
                link = link[1:]
            elif (
                link.startswith("#")
                or link.startswith("mailto:")
                or link.startswith("tel:")
            ):
                continue
            clean_link = "https://" + local_domain + "/" + link

        if clean_link is not None:
            if clean_link.endswith("/"):
                clean_link = clean_link[:-1]
            clean_links.append(clean_link)

    # Return the list of hyperlinks that are within the same domain
    return list(set(clean_links))

In [None]:
def crawl(url):
    # Parse the URL and get the domain
    local_domain = urlparse(url).netloc

    # Create a queue to store the URLs to crawl
    queue = deque([url])

    # Create a set to store the URLs that have already been seen (no duplicates)
    seen = set([url])

    # Create a directory to store the text files
    if not os.path.exists("text/"):
            os.mkdir("text/")

    if not os.path.exists("text/"+local_domain+"/"):
            os.mkdir("text/" + local_domain + "/")

    # Create a directory to store the csv files
    if not os.path.exists("processed"):
            os.mkdir("processed")

    # While the queue is not empty, continue crawling
    while queue:

        # Get the next URL from the queue
        url = queue.pop()
        print(url) # for debugging and to see the progress
        
        # Try extracting the text from the link, if failed proceed with the next item in the queue
        try:
            # Save text from the url to a <url>.txt file
            with open('text/'+local_domain+'/'+url[8:].replace("/", "_") + ".txt", "w", encoding="UTF-8") as f:

                # Get the text from the URL using BeautifulSoup
                soup = BeautifulSoup(requests.get(url).text, "html.parser")

                # Get the text but remove the tags
                text = soup.get_text()

                # If the crawler gets to a page that requires JavaScript, it will stop the crawl
                if ("You need to enable JavaScript to run this app." in text):
                    print("Unable to parse page " + url + " due to JavaScript being required")
            
                # Otherwise, write the text to the file in the text directory
                f.write(text)
        except Exception as e:
            print("Unable to parse page " + url)

        # Get the hyperlinks from the URL and add them to the queue
        for link in get_domain_hyperlinks(local_domain, url):
            if link not in seen:
                queue.append(link)
                seen.add(link)

crawl(full_url)

### Step 1.5: Scrapify: Crawling SPAs as screenshots

Although previous steps works fine for static content website, it often fails to scrape data from dynamic and single page app (SPAs) webpages. In this case, we will convert these pages to screenshot.

First step is to [download chrome web-driver](https://chromedriver.chromium.org/downloads). Please make sure, web-driver version matches with your chrome version.

(Open Chrome -> Help -> About chrome -> check version).

download appropriate version depending on machine OS and unzip/extract to a local folder.

In [None]:
# !pip install Pillow selenium

In [None]:
from selenium import webdriver
from PIL import Image

# Define the URL of the web page we want to screenshot

url = 'https://finance.yahoo.com/quote/AAPL?p=AAPL&.tsrc=fin-srch'

# Define the path to the webdriver executable (e.g., chromedriver.exe)

# webdriver_path = '/path/to/webdriver/executable'
webdriver_path = r'C:\amit.la\WIP\RPA\downloads\chromedriver.exe'

# Set up the webdriver

options = webdriver.ChromeOptions()
options.headless = True # type: ignore # Run the browser in headless mode to prevent a window from popping up
driver = webdriver.Chrome(options=options) # type: ignore

# Load the web page

driver.get(url)

# Take a screenshot of the entire page

# screenshot = driver.find_element_by_tag_name('body').screenshot_as_png
screenshot = driver.save_screenshot('../downloads/screenshot.png')

# Close the webdriver

driver.quit()

# Save the screenshot to a file

# with open('../SampleData/screenshot.png', 'wb') as file:
#     file.write(screenshot)

# Open the screenshot with Pillow to display it (optional)

img = Image.open('../downloads/screenshot.png')
img.show()

In [None]:
import os
urls = {
        "AAPL.png": "https://finance.yahoo.com/quote/AAPL?p=AAPL&.tsrc=fin-srch",
        "ORCL.png": "https://finance.yahoo.com/quote/ORCL?p=ORCL&.tsrc=fin-srch",
        "TSLA.png": "https://finance.yahoo.com/quote/TSLA?p=TSLA&.tsrc=fin-srch",
        "GOOG.png": "https://finance.yahoo.com/quote/GOOG?p=GOOG&.tsrc=fin-srch",
        "MSFT.png": "https://finance.yahoo.com/quote/MSFT?p=MSFT&.tsrc=fin-srch"
    }

In [None]:
def takeScreenshots(outputFileName, url):
    driver.get(url)
    driver.save_screenshot(os.path.join('../downloads/',outputFileName))

In [None]:
# take multiple screen shots
# automate this script to autodownload data

for key,value in urls.items():
    takeScreenshots(key, value)

### Step 1.6: Scrapify: read Image and query LLM

In this step, we will finally learn to read data from images and build our knowledge base.

To read text from images using Tesseract OCR in Python, we can use the pytesseract library, which is a Python wrapper for the Tesseract OCR engine. Here's an example code snippet:

[download tesseract here](https://tesseract-ocr.github.io/tessdoc/#binaries)

`Note that Tesseract OCR is not perfect and may not be able to extract text accurately from all images.`

In [None]:
# py -m pip install pytesseract PIL

In [None]:
from PIL import Image
img = Image.open('../downloads/AAPL.png')
img.show()

# make sure, you have tesseract included in your environment path

import os
os.getenv("tesseract")

In [None]:
import pytesseract
from PIL import Image

##############################################################################
# in case if tesseract is not included in PATH
pytesseract.pytesseract.tesseract_cmd = r'C:\amit.la\WIP\RPA\downloads\ts\tesseract.exe'
##############################################################################

def read_image_text(image_path):
    """
    Reads text from an image file using Tesseract OCR.

    Args:
        image_path (str): The file path to the input image.

    Returns:
        str: The extracted text from the image.
    """
    # Load the image file
    image = Image.open(image_path)

    # Use Tesseract OCR to extract the text from the image
    text = pytesseract.image_to_string(image)

    return text

# Example usage
image_path = "../downloads/APPL.png"
text = read_image_text(image_path)
print(text)

In [None]:
images = {
        "AAPL": "../downloads/AAPL.png",
        "ORCL": "../downloads/ORCL.png",
        "TSLA": "../downloads/TSLA.png",
        "GOOG": "../downloads/GOOG.png",
        "MSFT": "../downloads/MSFT.png"
    }

# automate reading images and creating text from these images
# you can further store these texts into a database

for key,value in images.items():
    # print(key, value)
    text = read_image_text(value)
    print(text)

## Step 2: convert all PDFs | texts to one csv
---
In this step, We will convert all PDFs/text files to one csv and simply build a Q&A prompt using Gen AI (Claude) with entire file content at once. As one big csv can be too much data to input, we will split big csv into smaller csvs.

In [None]:
# remove new lines function converts all new line and tab chars to space
def remove_newlinechars(txt):
    txt = txt.str.replace('\n', ' ')
    txt = txt.str.replace('\\n', ' ')
    txt = txt.str.replace('  ', ' ')
    txt = txt.str.replace('  ', ' ')
    return txt

In [None]:
import pandas as pd
# Create a list to store the text files
texts=[]

# Get all the text files in the text directory
for file in os.listdir("../downloads/texts/"):

    # Open the file and read the text
    with open("../downloads/texts/" + file, "r", encoding="UTF-8") as f:
        text = f.read()

        # Omit the first 11 lines and the last 4 lines, then replace -, _, and #update with spaces.
        texts.append((file[11:-4].replace('-',' ').replace('_', ' ').replace('#update',''), text))

# Create a dataframe from the list of texts
df = pd.DataFrame(texts, columns = ['fname', 'text'])

# Set the text column to be the raw text with the newlines removed
df['text'] = df.fname + ". " + remove_newlinechars(df.text)
df.to_csv('../downloads/scraped.csv')
df.head()

# as you can see, we created one row in csv per txt file 
# i.e. chapter 3, 4 and 5 each has one row in csv

In [None]:
import os
os.listdir("../downloads/")

## Step 3: Creating Embedding from csvs and other documents to create a Vector database.
---

Most of the API limit number of input tokens for embeddings. In this step, we will split rows into tokens.

#### Step 3.1: Creating tokens from text

In [None]:
# !pip install tiktoken

In [None]:
########################
# visualize text tokens
########################
import pandas as pd
import tiktoken

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

df = pd.read_csv('../downloads/scraped.csv', index_col=0)
df.columns = ['title', 'text']

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

print(df)
# Visualize the distribution of the number of tokens per row using a histogram
df.n_tokens.hist()

# as you can see in below results
# Chapter 3, 4 & 5 has appx 4.2, 8 and 5.7k tokens
# this is expectecd, as Chapter as longest text

# now, we will need to further split these rows based on number of tokens
# because most of the vector databases have upper limits of # of tokens that can be stored

#### Step 3.2: Creating equal tokens from text

In [None]:
#####################################################################
# let's say we want to split csv into chunks of 500 tokens
#####################################################################

max_tokens = 500

# Function to split the text into chunks of a maximum number of tokens
def split_into_many(text, max_tokens = max_tokens):

    # Split the text into sentences
    sentences = text.split('. ')

    # Get the number of tokens for each sentence
    n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]

    chunks = []
    tokens_so_far = 0
    chunk = []

    # Loop through the sentences and tokens joined together in a tuple
    for sentence, token in zip(sentences, n_tokens):

        # If the number of tokens so far plus the number of tokens in the current sentence is greater
        # than the max number of tokens, then add the chunk to the list of chunks and reset
        # the chunk and tokens so far
        if tokens_so_far + token > max_tokens:
            chunks.append(". ".join(chunk) + ".")
            chunk = []
            tokens_so_far = 0

        # If the number of tokens in the current sentence is greater than the max number of
        # tokens, go to the next sentence
        if token > max_tokens:
            continue

        # Otherwise, add the sentence to the chunk and add the number of tokens to the total
        chunk.append(sentence)
        tokens_so_far += token + 1

    return chunks


shortened = []

# Loop through the dataframe
for row in df.iterrows():

    # If the text is None, go to the next row
    if row[1]['text'] is None:
        continue

    # If the number of tokens is greater than the max number of tokens, split the text into chunks
    if row[1]['n_tokens'] > max_tokens:
        shortened += split_into_many(row[1]['text'])

    # Otherwise, add the text to the list of shortened texts
    else:
        shortened.append( row[1]['text'] )

In [None]:
#####################################################################
# Visualizing the updated histogram again can help to confirm
# if the rows were successfully split into shortened sections.
#####################################################################

df = pd.DataFrame(shortened, columns = ['text'])
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
print(df.head())
print(df.shape) # we had appx 18k tokens earlier, so we should expect ~35+ distributions of 500 each
df.n_tokens.hist()

#####################################################################
# as you can see from histogram
# most of rows have about 450-500 tokens each

#### Step 3.3: Creating ChromaDB #trychroma vector DB

In [None]:
# !pip install pysqlite3-binary
# !pip show chromadb
# version 0.3.29
# there might be issues due to sqlite library, if so,
# replace sqlite with pysqlite3-binary

In [None]:
# if you wish to just experiment without have chromadb as persistent DB, 
# run this code and ignore next cell

# store sample documents in chromadb

# import ollama
# __import__('pysqlite3')
# import sys
# sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
# import chromadb
# client = chromadb.Client() # will create only a temp, non-persistent db

In [None]:
# running prod chromadb as persistent db
# store tokens|docs in a persisten db so that you don't need to store vectors everytime

# first make sure chromadb in installed, if not
# !pip install chromadb

# do not run this here, instead run this command on terminal
# !chroma run --host localhost --port 8080 --path ./OLVectorDB

# if you see errors, add below code to your ___init__.py file
# import ollama
# __import__('pysqlite3')
# import sys
# sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [None]:
# store sample documents in chromadb

import ollama
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
import chromadb
# use this client version to start a non-persistent experimental chromadb instance
# client = chromadb.Client() # will create only a temp, non-persistent db

# use this client version to start persistent chromadb instance
# from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
# client = chromadb.PersistentClient(
#     path="./vectordb",
#     settings=Settings(),
#     tenant=DEFAULT_TENANT,
#     database="OL",
# )

client = chromadb.PersistentClient(path="./OLVectorDB")

# use this client version to start chromadb http instance
# client = chromadb.HttpClient(
#     host="localhost",
#     port=8080,
#     ssl=False,
#     headers=None,
#     # settings=Settings(),
#     # tenant=DEFAULT_TENANT,
#     database="./vectordb",
# )

In [None]:
client.list_collections()

In [None]:
# uncomment this code,
# if collection is already created, do not create it over and over again

documents = df["text"].to_list()

# collection = client.create_collection(name="docs")
# # store each document in a vector embedding database
# for i, d in enumerate(documents):
#   response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
#   embedding = response["embedding"]
#   collection.add(
#     ids=[str(i)],
#     embeddings=[embedding],
#     documents=[d]
#   )

#### Step 3.4: Creating SQLLite vector DB

In [None]:
# !pip install sqlite_vec

In [None]:
# this code is copied form sqlite-vec GitRepo
import sqlite3
import sqlite_vec

from typing import List
import struct

def serialize_f32(vector: List[float]) -> bytes:
    """serializes a list of floats into a compact "raw bytes" format"""
    return struct.pack("%sf" % len(vector), *vector)

db = sqlite3.connect(":memory:")
db.enable_load_extension(True)
sqlite_vec.load(db)
db.enable_load_extension(False)

sqlite_version, vec_version = db.execute(
    "select sqlite_version(), vec_version()"
).fetchone()
print(f"sqlite_version={sqlite_version}, vec_version={vec_version}")

items = [
    (1, [0.1, 0.1, 0.1, 0.1]),
    (2, [0.2, 0.2, 0.2, 0.2]),
    (3, [0.3, 0.3, 0.3, 0.3]),
    (4, [0.4, 0.4, 0.4, 0.4]),
    (5, [0.5, 0.5, 0.5, 0.5]),
]
query = [0.3, 0.3, 0.3, 0.3]

db.execute("CREATE VIRTUAL TABLE vec_items USING vec0(embedding float[4])")

with db:
    for item in items:
        db.execute(
            "INSERT INTO vec_items(rowid, embedding) VALUES (?, ?)",
            [item[0], serialize_f32(item[1])],
        )

rows = db.execute(
    """
      SELECT
        rowid,
        distance
      FROM vec_items
      WHERE embedding MATCH ?
      and k=3
      ORDER BY distance
    """,
    [serialize_f32(query)],
).fetchall()

print(rows)

## Step 4: Using RAG and LLMs to query manual documents.
---

Retrieval-Augmented Generation using trychroma ChromaDB

query a embedding collection

In [None]:
collection = client.get_collection(name="docs")
# retreive data from vector store
prompt = "how are args and keyword arguments defined in python?"

# generate an embedding for the prompt and retrieve the most relevant doc
response = ollama.embeddings(
  prompt=prompt,
  model="mxbai-embed-large"
)
results = collection.query(
  query_embeddings=[response["embedding"]],
  n_results=1
)
data = results['documents'][0][0]
data

In [None]:
response = ollama.chat(model='phi3:mini', messages=[
  {
    'role': 'user',
    'content': prompt,
  },
])
print(response['message']['content']) # type: ignore

In [None]:
output = ollama.generate(
  model="llama3",
  prompt=f"Using this data: {data}. Respond to this prompt: {prompt}"
)

print(output['response']) # type: ignore

## Step 5: Using SQL queries | API with Tools | Functional calling
---

in later usecases, we will work with advance data topics and use function/tools calling extensively.

In [None]:
# !pip install ollama

In [10]:
import requests

def get_current_weather(city):
  # https://api.weather.gov/gridpoints/TOP/32,81/forecast
  # The API endpoint
  url = "https://api.weather.gov/gridpoints/TOP/32,81/forecast"
  response = requests.get(url)
  return response.json()["properties"]["periods"][0]["temperature"]

get_current_weather("Los Angeles")

89

In [4]:
import sqlite3

# Connect to the database (or create it)
conn = sqlite3.connect('example.db')
cursor = conn.cursor()

# Create a table
cursor.execute('''
CREATE TABLE IF NOT EXISTS users (
    id INTEGER PRIMARY KEY,
    name TEXT,
    age INTEGER
)
''')
conn.commit()

# Insert a record
cursor.execute('''
INSERT INTO users (name, age) VALUES (?, ?)
''', ('Alice Wonder', 30))
conn.commit()

# Retrieve records
cursor.execute('SELECT * FROM users')
rows = cursor.fetchall()
for row in rows:
    print(row)

# Close the connection
conn.close()


(1, 'Alice', 30)
(2, 'Alice Wonder', 30)
(3, 'Alice Wonder', 30)
(4, 'Alice Wonder', 30)
(5, 'Alice Wonder', 30)
(6, 'Alice Wonder', 30)
(7, 'Alice Wonder', 30)
(8, 'Alice Wonder', 30)
(9, 'Alice Wonder', 30)
(10, 'Alice Wonder', 30)
(11, 'Alice Wonder', 30)
(12, 'Alice Wonder', 30)
(13, 'Alice Wonder', 30)
(14, 'Alice Wonder', 30)
(15, 'Alice Wonder', 30)
(16, 'Alice Wonder', 30)


In [5]:
conn = sqlite3.connect('example.db')
cursor = conn.cursor()
def get_user(employee):
  # Insert a record
  # cursor.execute('''
  # INSERT INTO users (name, age) VALUES (?, ?)
  # ''', ('Alice Wonder', 30))
  # conn.commit()
  print(f"SELECT * FROM users where name = {employee}")
  cursor.execute(f"SELECT * FROM users where name = '{employee}'")
  rows = cursor.fetchall()
  # for row in rows:
  #   print(row)
  return rows

print(get_user("Alice"))

# Close the connection
# conn.close()

SELECT * FROM users where name = Alice
[(1, 'Alice', 30)]


In [6]:
import ollama

tools = [{
      'type': 'function',
      'function': {
        'name': 'get_current_weather',
        'description': 'Get the current weather for a city',
        'parameters': {
          'type': 'object',
          'properties': {
            'city': {
              'type': 'string',
              'description': 'The name of the city',
            },
          },
          'required': ['city'],
        },
      },
    },
    {
      'type': 'function',
      'function': {
        'name': 'get_user',
        'description': 'Get the current age of employee',
        'parameters': {
          'type': 'object',
          'properties': {
            'employee': {
              'type': 'string',
              'description': 'The name of the employee',
            },
          },
          'required': ['employee'],
        },
      },
    },
  ]

# creating a generic function to call appropriate tool based on tool input
def process_tool_call(tool_name, tool_input):
    if tool_name == "get_current_weather":
        return get_current_weather(tool_input["city"])
    if tool_name == "get_user":
        return get_user(tool_input["employee"])
  
# print(process_tool_call('get_current_weather', {'city': 'Los Angeles CA'}))
print(process_tool_call('get_user', {'employee': 'Alice'}))

SELECT * FROM users where name = Alice
[(1, 'Alice', 30)]


In [7]:
response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 
            'how old is Alice?'}],
		    # provide a weather checking tool to the model
        tools=tools # type: ignore
    )

# response 
print(f"\nInitial Response:")
print(f"Tool called: {response["message"]["tool_calls"][0]}")
print(f"Tool name: {response["message"]["tool_calls"][0]["function"]["name"]}")
print(f"Tool param: {response["message"]["tool_calls"][0]["function"]["arguments"]}")
print(f"Stop Reason: {response["done_reason"]}")
print(f"Content: {response["message"]["content"]}")


Initial Response:
Tool called: {'function': {'name': 'get_user', 'arguments': {'employee': 'Alice'}}}
Tool name: get_user
Tool param: {'employee': 'Alice'}
Stop Reason: stop
Content: 


In [8]:
response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 
            'What is the weather in Los Angeles CA today?'}],
		    # provide a weather checking tool to the model
        tools=tools # type: ignore
    )

# response 
print(f"\nInitial Response:")
print(f"Tool called: {response["message"]["tool_calls"][0]}")
print(f"Tool name: {response["message"]["tool_calls"][0]["function"]["name"]}")
print(f"Tool param: {response["message"]["tool_calls"][0]["function"]["arguments"]}")
print(f"Stop Reason: {response["done_reason"]}")
print(f"Content: {response["message"]["content"]}")


Initial Response:
Tool called: {'function': {'name': 'get_current_weather', 'arguments': {'city': 'Los Angeles, CA'}}}
Tool name: get_current_weather
Tool param: {'city': 'Los Angeles, CA'}
Stop Reason: stop
Content: 


In [35]:
def chatBot(user_message):
    print(f"\n{'='*50}\nUser Message: {user_message}\n{'='*50}")
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': user_message}],
		    # provide a weather checking tool to the model
        tools=tools # type: ignore
    )
    print(f"\nInitial Response:")
    print(f"Tool called: {response["message"]["tool_calls"][0]}")
    print(f"Stop Reason: {response["done_reason"]}")
    print(f"Content: {response["message"]["content"]}")

    if response["done_reason"] == "stop":
        # tool_use = next(block for block in response.content if block.type == "tool_use")
        tool_name = response["message"]["tool_calls"][0]["function"]["name"]
        tool_input = response["message"]["tool_calls"][0]["function"]["arguments"]
        tool_content = response["message"]["content"]

        tool_result = process_tool_call(tool_name, tool_input)
        print(f"Tool Result: {tool_result}")

        response = ollama.chat(
                model='llama3.2',
                messages=[
                    {"role": "user", "content": user_message},
                    # {"role": "assistant", "content": f"as per results from tools API, current data is {str(tool_result)} , based on this data, please answer this {user_message}."},
                    {
                    "role": "tool",
                    "content": str(tool_result) # type: ignore
                    },
                ],
                tools=tools # type: ignore
                )
        print(response)
    return response

In [36]:
chatBot("How is the weather in San Francisco today?")
# chatBot("How old is my employee name Alice?")


User Message: How is the weather in San Francisco today?

Initial Response:
Tool called: {'function': {'name': 'get_current_weather', 'arguments': {'city': 'San Francisco'}}}
Stop Reason: stop
Content: 
Tool Result: 54
{'model': 'llama3.2', 'created_at': '2024-10-03T05:37:55.7531787Z', 'message': {'role': 'assistant', 'content': 'The current temperature in San Francisco is 54°F. Would you like to know more about the weather forecast for San Francisco or any specific time of day?'}, 'done_reason': 'stop', 'done': True, 'total_duration': 6176065000, 'load_duration': 4932965100, 'prompt_eval_count': 77, 'prompt_eval_duration': 373857000, 'eval_count': 32, 'eval_duration': 866191000}


{'model': 'llama3.2',
 'created_at': '2024-10-03T05:37:55.7531787Z',
 'message': {'role': 'assistant',
  'content': 'The current temperature in San Francisco is 54°F. Would you like to know more about the weather forecast for San Francisco or any specific time of day?'},
 'done_reason': 'stop',
 'done': True,
 'total_duration': 6176065000,
 'load_duration': 4932965100,
 'prompt_eval_count': 77,
 'prompt_eval_duration': 373857000,
 'eval_count': 32,
 'eval_duration': 866191000}

## Step 6: build, host online UI app with llama3.2
## ollama ChromaDB SQLite RAG Q&A
---