# FireCrawl x LangChain Documentation RAG 🔥



<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
<p style="margin-right:10px;">
    <img  height="200px" src="https://raw.githubusercontent.com/mendableai/firecrawl/main/img/firecrawl_logo.png"> 
</p>
<p style="margin-right:10px;">
    <img  height="200px" src="https://images.contentstack.io/v3/assets/bltf2fca5bf44f5e817/blt34d9fdb635976e4a/669e80a79fecd86c50d59f6d/Lang_Square.png"> 
</p>


</div>

A simple and effective implementation of Naive RAG (Retrieval Augmented Generation) using FireCrawl and LangChain! 🚀

- Provide a link to your Python documentation. 

- FireCrawl crawls and scrapes the documentation.

- Use LangChain to embed the documents and store it in an in-memory vector store. 

- Use prompt templates to use a RAG template which has the 'query' and 'context'.

- Receive detailed answers from the LLM using your documentation.

# Step 1: Install Requirements ⚙️

The requirements are already specified in the `requirements.txt` file. We simply use our utility to install the requirements.

In [28]:
import os
requirements_installed = False
max_retries = 3
retries = 0

def install_requirements():
    """Installs the requirements from requirements.txt file"""
    global requirements_installed
    if requirements_installed:
        print('Requirements already installed.')
        return
 
    print('Installing requirements...')
    install_status = os.system('pip install -r requirements.txt')
    if install_status == 0:
        print('Requirements installed successfully.')
        requirements_installed = True
    else:
        print('Failed to install requirements.')
        if retries < max_retries:
            print('Retrying...')
            retries += 1
            return install_requirements()
        exit(1)
    return


In [29]:
install_requirements()

Installing requirements...
Requirements installed successfully.


# Step 2: Setup Environment Variables 🏕️

Make sure you have added `FIRECRAWL_API_KEY` and `OPENAI_API_KEY` to your `.env` file. 

In [30]:
from dotenv import load_dotenv

load_dotenv()

True

# Step 3: Initialize FireCrawl Client 🔥

This method intitializes the FireCrawl App with the API key.

In [31]:
from firecrawl import FirecrawlApp

def get_firecrawl_client():
    return FirecrawlApp(api_key=os.getenv('FIRECRAWL_API_KEY'))

# Step 4: Crawling & Scraping Utilities 🕷️

We fetch the documentation links and then scrape each page in markdown format. To achieve this, we define a set of utility methods that we will call in our runner.

In [32]:
from typing import List

cache = {}

def get_doc_links(app: FirecrawlApp, input_url: str) -> List[str]:
    """Gets the documentation links from the given URL."""
    cache_key = f"{input_url}_links"
    cached_links = cache.get(cache_key)
    if cached_links:
        print(f'Using cached links for URL: {input_url}')
        return cached_links
    
    app = get_firecrawl_client()
    crawl_result = app.map_url(input_url)

    success = crawl_result['success']

    if not success:
        raise RuntimeError(f'Failed to get links from URL: {input_url}')
    
    links = crawl_result['links']
    cache[cache_key] = links

    return links


def get_single_doc_from_link(app: FirecrawlApp, link: str) -> str:
    """Gets the documentation from the given link."""
    cached_doc = cache.get(link)
    if cached_doc:
        print(f'Using cached docs for URL: {link}')
        return cached_doc

    scrape_result = None
    try:
        scrape_result = app.scrape_url(link, params={'formats': ['markdown']})
    except Exception as e:
        print(f'Failed to get docs from URL: {link}')
        print(e)

    if not scrape_result:
        return None

    success = scrape_result['metadata']['statusCode'] == 200

    if not success:
       print(f'Failed to get docs from URL: {link}')
       return None
    
    markdown = scrape_result['markdown']
    cache[link] = markdown

    return markdown

def get_docs_from_links(app: FirecrawlApp, links: List[str], verbose = False) -> List[str]:
    """Gets the documentation from the given list of links."""
    docs = []

    for link in links:
        if verbose:
            print(f'Getting docs from URL: {link}')
        markdown = get_single_doc_from_link(app, link)
        if markdown:
            docs.append(markdown)
            if verbose:
                print(f'Fetched docs from URL: {link}')

    return docs

# Step 5: Vector Store & Embeddings 🔮

Setup the vector store and embeddings using Open AI embeddings. These methods will be used to embed our documentation and load it into the vector DB for retrieval.

In [33]:
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

DEFAULT_EMBEDDING_MODEL = "text-embedding-3-small"

vector_store = None


def get_embeddings() -> List[List[float]]:
    """Gets the embeddings for the given list of documents."""
    embeddings = OpenAIEmbeddings(model=DEFAULT_EMBEDDING_MODEL)
    return embeddings


def build_vectorstore(docs: List[str]) -> InMemoryVectorStore:
    """Builds a vector store from the given list of documents."""
    global vector_store
    if vector_store:
        print("Vector store already built. Return pre-computed vector store.")
        return vector_store
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1024,
        chunk_overlap=20,
        length_function=len,
        is_separator_regex=False,
    )
    vector_docs = text_splitter.create_documents(docs)
    embeddings = get_embeddings()
    vector_store = InMemoryVectorStore(embeddings)
    vector_store.add_documents(vector_docs)
    return vector_store

# Step 6: LLM interface and RAG prompt 🧠

We dynamically return our LLM client based on the model and provider. If you wish to use anthropic instead of openai, accordingly supply the `ANTHROPIC_API_KEY` in your `.env` file. For defaults, we use Open AI and GPT-4o. Additionally, we define a simple RAG prompt which holds the 'query' and 'context'.

In [34]:
from langchain_openai import ChatOpenAI
from langchain_anthropic import Anthropic
from langchain_core.prompts import ChatPromptTemplate

DEFAULT_PROVIDER = "openai"
DEFAULT_OPENAI_MODEL = "gpt-4o"
DEFAULT_ANTHROPIC_MODEL = "claude-3-5-sonnet-latest"

def get_llm(provider = DEFAULT_PROVIDER, model = DEFAULT_OPENAI_MODEL) -> ChatOpenAI:
    """Gets the language model."""
    if provider == "openai":
        return ChatOpenAI(model=model)
    elif provider == "anthropic":
        return Anthropic(model=model)
    
def get_rag_prompt() -> ChatPromptTemplate:
    """Gets the RAG prompt."""
    system = """
        You are an AI agent that can answer questions about software development.
        Given a query and context about the query, provide a factual, and detailed response. 
        The response should be relevant to the query and context.
        Simply respond with the answer to the query.
    """

    user = """
        Query: {query}
        Context: {context}
    """

    return ChatPromptTemplate.from_messages([
        ("system", system),
        ("user", user)
    ])


def rag_search(query: str, docs: List[str]) -> str:
    """Performs a RAG search."""
    llm = get_llm()
    vector_store = build_vectorstore(docs)
    matched_docs = vector_store.similarity_search(query, num_results=5)
    matched_docs = [doc.page_content for doc in matched_docs]
    context = "\n".join(matched_docs)
    prompt = get_rag_prompt()
    prompt_formatted = prompt.invoke(input={"query": query, "context": context})
    llm_response = llm.invoke(input=prompt_formatted)
    return llm_response.content
    



# Step 7: Runner! 🏃🏻

Define our 'entry point' — from this point all our previously defined methods will be called!

In [35]:

input_url = "https://pyparsing-docs.readthedocs.io/en/latest/"

def run(query = None, verbose = False, user_input = False):
    """Runs the program."""
    print("Starting FireCrawl x LangChain Documentation RAG! 🚀")
    app = get_firecrawl_client()
    if verbose:
        print(f'Getting links from URL: {input_url}')

    if query and user_input:
        print("User input is enabled. Ignoring the query parameter.")

    links = get_doc_links(app, input_url)
    if verbose:
        print(f"Fetched {len(links)} links.")
        print("Fetching docs from links...")
    docs = get_docs_from_links(app, links, verbose=verbose)
    if verbose:
        print(f"Fetched {len(docs)} docs.")
    if verbose:
        print("Built vector store.")
    if user_input:
        query = input("Enter your query: ")
    response = rag_search(query, docs)
    print("\nAI Response:", response)
    print("Done! ✨")

# Step 8: Run ⚡️

Run the program. Feel free to adjust the input parameters as per your liking. 

In [36]:
# Input Parameters: Adjust the query and user_input parameters as needed
query = "How to parse a string in Python?"
user_input = False # If set to 'true' ignores 'query' parameter and takes user input

# Run the program! 🚀
run(query=query, verbose=False, user_input=user_input)

Starting FireCrawl x LangChain Documentation RAG! 🚀

AI Response: To parse a string in Python using the `pyparsing` module, follow these steps:

1. **Define Tokens and Patterns**: First, define the tokens and patterns you want to match by using the appropriate `pyparsing` classes like `Word`, `Literal`, etc. You can assign these to a variable and optionally specify result names or parse actions to process matched tokens.

2. **Parse the String**: Use the `parse_string()` method on the pattern variable you created. This method takes the input string as an argument and parses it according to the defined pattern.

3. **Handle Results**: The `parse_string()` method returns a `ParseResults` object, which can be accessed like a list, dictionary, or object with named attributes, depending on how you defined your parser.

Here is an example to parse dates in the format `YYYY/MM/DD`:

```python
from pyparsing import Word, nums, ParseException

# Parse action to convert tokens from str to int
de

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)