### 🚀 WaterCrawl × FLARE — the perfect duo for RAG playgrounds!

Welcome to this **step‑by‑step Jupyter Notebook** where we:

1. 🕷 **Crawl & clean** any website with **WaterCrawl** – turning raw HTML into markdown/JSON that’s ready for embeddings.   
2. 🔍 **Retrieve on‑the‑fly** with **FLARE (Forward‑Looking Active REtrieval)** – an “always‑be‑fact‑checking” wrapper that pulls extra docs *only* when the LLM shows low confidence.  
3. 🛠 **Tie it all together** with **LangChain**, **Tavily Search API** & a few helper utils so you can remix the pipeline to your heart’s content.

---

#### What’s inside?

| 🔧 Component | 💡 Why we’re using it |
|--------------|----------------------|
| **WaterCrawl** | Point‑&‑shoot crawling with sitemap visualizer, duplicate detection, and markdown/JSON exports – perfect for vector DB ingestion. :contentReference[oaicite:0]{index=0} |
| **LangChain** | Glue layer that lets us chain the crawl → embed → FLARE retrieval steps with a few lines of code. |
| **Tavily Search API** | Fast, inexpensive web search that slots into `TavilyRetriever`; great complement to your own crawled corpora. |
| **FLARE** | Re‑checks the model’s “next sentence” for shaky tokens; if confidence is low, it auto‑generates a smart query and fetches fresh docs before writing. :contentReference[oaicite:1]{index=1} |

---

#### Notebook flow 🗺️

1. **Setup**: grab your API keys from https://watercrawl.dev/, spin up your own `watercrawl` from: https://github.com/watercrawl/watercrawl. To run WaterCrawl API you need to install the Python SDK, which we will do in the following steps
2. **FLARE chain**: initialize `FlareChain(llm_answer, llm_question, retriever)` with **Tavily** + your newly‑minted vector store.  
3. **Ask away!**: watch FLARE pause, retrieve, and resume writing—as many times as needed—to give rock‑solid answers.  
4. **Extras**: show off the visual sitemap PNG WaterCrawl generated and link each node to its vector IDs.  

---

### Why you’ll ❤️ this combo

- **Less hallucination, more citation**: WaterCrawl hands FLARE pristine, source‑mapped text, so every sentence can be traced back to a URL.  
- **Pay only for what you need**: FLARE calls Tavily *selectively*, not on every token—so your search bill stays tiny.  
- **Drop‑in for any stack**: swap Tavily for your own BM25/Elastic/Weaviate retriever, or point WaterCrawl at authenticated intranet sites.  
- **Open‑source all the way**: MIT‑style licences on both projects mean you can fork, tweak, and ship to prod. :contentReference[oaicite:2]{index=2}

> **Tip:** if you’re new to WaterCrawl, follow: https://github.com/watercrawl/watercrawl?tab=readme-ov-file#-quick-start  hit `http://localhost` after `docker compose up -d` and explore the Playground UI—selector testing & screenshot capture included! 🎨

---

Ready? Let’s spin up containers and start crawling! 🏁


##### ➡️ **Lets install 📦all the dependencies:** 


In [None]:
!pip install --upgrade pip
!pip install  langchain-community langchain-core langchain-openai notebook watercrawl-py tavily-python


### ➡️ 🔑 **API keys you’ll need (grab these first!)** 

| Service | What it’s for | Where to generate |
|---------|---------------|-------------------|
| **WaterCrawl** | Auth for crawling endpoints | <https://app.watercrawl.dev/dashboard/api-keys> |
| **OpenAI** | LLM + embeddings | <https://platform.openai.com/api-keys> |
| **Tavily Search** | Web search for FLARE | <https://app.tavily.com/home> |

---

Option 1 –keep it clean: use a `.env` file ⚠️


Create the file **once**, store your keys, and everything else “just works”.

```python
# ── create_env.py ──
env_text = """
OPENAI_API_KEY= ***put your APi key here *** 
TAVILY_API_KEY= ***put your APi key here *** 
WATERCRAWL_API_KEY=* **put your APi key here *** 
""".strip()

with open(".env", "w") as f:
    f.write(env_text)
print(".env file created — now edit it with your real keys ✏️")

-------------------------------------------------
Option 2 – quick‑and‑dirty: hard‑code in the notebook ⚠️

OPENAI_API_KEY= ***put your APi key here *** 
TAVILY_API_KEY= ***put your APi key here *** 
WATERCRAWL_API_KEY=* **put your APi key here *** 

Not recommended — anyone who sees or commits the notebook can read your keys.





##### ➡️ **If you’re using a `.env` file load the API keys with dotenv** 


In [7]:

from dotenv import load_dotenv
import os

load_dotenv()  # pulls everything from .env

OPENAI_API_KEY   = os.environ.get("OPENAI_API_KEY")
TAVILY_API_KEY   = os.environ.get("TAVILY_API_KEY")
WATERCRAWL_API_KEY = os.environ.get("WATERCRAWL_API_KEY")


##### ➡️ **Import our packages**:

In [12]:
from typing import Any, List
from langchain.callbacks.manager import (
    AsyncCallbackManagerForRetrieverRun,
    CallbackManagerForRetrieverRun,
)
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from watercrawl import WaterCrawlAPIClient
from pydantic import BaseModel
from langchain_openai import ChatOpenAI, OpenAI
import requests
from langchain.chains import FlareChain


#### ➡️ **Lets build WaterCrawl Retriever**:

In [8]:

def search_tool(query: str, api_key: str, max_results: int = 3) -> List[str]:
    url = "https://api.tavily.com/search"
    payload = {
        "query": query,
        "topic": "general",
        "search_depth": "basic",
        "max_results": max_results,
        "include_answer": False,
        "include_raw_content": False,
        "include_domains": [],
        "exclude_domains": []
    }
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    response = requests.post(url, json=payload, headers=headers)
    response.raise_for_status()
    results = response.json()
    print([item.get("url") for item in results.get("results", []) if item.get("url")])
    return [item.get("url") for item in results.get("results", []) if item.get("url")]


class WaterCrawlRetriever(BaseRetriever, BaseModel):
    client: WaterCrawlAPIClient
    tavily_api_key: str
    page_options: dict = {
        "exclude_tags": ["nav", "footer", "aside"],
        "include_tags": ["article", "main"],
        "wait_time": 100,
        "include_html": False,
        "only_main_content": True,
        "include_links": False
    }

    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun, **kwargs: Any
    ) -> List[Document]:
        documents = []
        try:
            urls = search_tool(query, self.tavily_api_key, max_results=3)
            for url in urls:
                try:
                    result = self.client.scrape_url(url=url, page_options=self.page_options, sync=True, download=True)
                    content = result.get("content", "")
                    if content:
                        documents.append(Document(page_content=content, metadata={"source": url}))
                except Exception as e:
                    print(f"Failed to fetch content from {url}: {e}")
        except Exception as e:
            print(f"Tavily search failed: {e}")
        return documents

    async def _aget_relevant_documents(
        self,
        query: str,
        *,
        run_manager: AsyncCallbackManagerForRetrieverRun,
        **kwargs: Any,
    ) -> List[Document]:
        raise NotImplementedError()



#### ➡️ **Create the langchain retriever obect using WaterCrawlRetriever we have built above**:

In [9]:

retriever = WaterCrawlRetriever(client=WaterCrawlAPIClient(api_key=WATERCRAWL_API_KEY),
    tavily_api_key=TAVILY_API_KEY)

#### ➡️ **FLARE Chain**

In [10]:
# We set this so we can see what exactly is going on
from langchain.globals import set_verbose
set_verbose(True)

In [14]:
llm = ChatOpenAI( model="gpt-4o", temperature=0)
flare = FlareChain.from_llm(
    llm,
    retriever=retriever,
    max_generation_len=164,
    min_prob=0.3,
)

In [18]:
query = "Explain what is watercrawl tool and how I can improve the LLM performance?"

In [19]:
flare.invoke(query)



[1m> Entering new FlareChain chain...[0m
[36;1m[1;3mCurrent Response: [0m
[33;1m[1;3mGenerated Questions: ['What type of software is the Watercrawl tool?', 'How can you optimize the data extraction process to improve the LLM performance?'][0m
['https://github.com/watercrawl/watercrawl', 'https://docs.watercrawl.dev/intro', 'https://watercrawl.dev/']
['https://techcommunity.microsoft.com/blog/azure-ai-services-blog/maximizing-data-extraction-precision-with-dual-llms-integration-and-human-in-the/4236728', 'https://medium.com/intel-tech/four-data-cleaning-techniques-to-improve-large-language-model-llm-performance-77bee9003625', 'https://www.turing.com/resources/understanding-data-processing-techniques-for-llms']
[36;1m[1;3mCurrent Response:  The Watercrawl tool is a software tool used for web crawling and data extraction. It helps in collecting data from websites for various purposes such as research, analysis, or monitoring. To improve the LLM (Large Language Model) performan

{'user_input': 'Explain what is watercrawl tool and how I can improve the LLM performance?',
 'response': 'The Watercrawl tool is a software tool used for web crawling and data extraction. It helps in collecting data from websites for various purposes such as research, analysis, or monitoring. To improve the LLM (Large Language Model) performance, you can consider optimizing the data collection process using Watercrawl to ensure high-quality and relevant data inputs for training the model. Additionally, you can also focus on fine-tuning the hyperparameters of the LLM and increasing the training data size to enhance its performance. '}

#### ➡️ **Now lets see a simple Open AI chain so we can see the value of the FLARE Chain**
#### for the test query we provided, the answer of the same LLM is completely wrong!!!

In [24]:
llm.invoke(query)

"\n\nWatercrawl is a web performance testing tool that helps in analyzing the load and stress on a website or web application. It simulates real-world user traffic and measures the website's response time, throughput, and server performance under different load conditions.\n\nTo improve the LLM (Load, Latency, and Memory) performance using Watercrawl, the following steps can be taken:\n\n1. Identify bottlenecks: Watercrawl helps in identifying the areas of the website that are causing performance issues. It provides detailed reports on page load times, HTTP requests, and server response times, which can help in identifying the bottlenecks.\n\n2. Optimize website code: Based on the reports generated by Watercrawl, developers can optimize the website's code to reduce page load times and improve server response times. This can include techniques like minimizing HTTP requests, optimizing images, and using caching mechanisms.\n\n3. Test under different load conditions: Watercrawl allows tes

##### 🚨⚠️ As you have noted, for the test query we provided, the answer from the **same LLM** is **completely wrong** ❌🤯‼️

> 💬 It confidently gives a **wrong answer** — showing **why refinement and retrieval matter** so much in real-world usage.


#### ➡️ **For further information**:
##### 📘 Introduction to FlareChain in LangChain

**FlareChain** is an advanced chain in the LangChain framework 🧠⚙️ designed to *iteratively refine answers* from a language model. It improves response quality by:

🔍 Identifying **low-confidence** spans  
❓ Generating **clarifying questions**  
📚 Retrieving **relevant context**  
🔁 Updating the answer in a loop

---

#### 🧩 Key Arguments of `FlareChain`

##### 🗣 2. `response_chain`
Generates the actual response using user input + context.

##### 🧾 3. `output_parser`
Checks whether the current answer is “good enough” to stop refinement.

##### 📡 4. `retriever`
Fetches documents to provide factual backup for refining the answer.

##### 📉 5. `min_prob`
Low-confidence threshold (default: `0.2`) – tokens below this are flagged for review.

##### ↔️ 6. `min_token_gap`
Ensures separation between two flagged spans (default: `5` tokens).

##### 🧷 7. `num_pad_tokens`
Adds context tokens around flagged spans (default: `2`).

##### 🔁 8. `max_iter`
Max number of refinement cycles (default: `10`).

##### 🧭 9. `start_with_retrieval`
If `True`, starts by retrieving context even before generating the first draft.

---

#### 🧾 Inputs and Outputs

- 📥 **Input Key**: `user_input`  
- 📤 **Output Key**: `response`  

💡 The chain processes a single user prompt and returns an *improved, confident, and context-aware response*.

---

#### 📚 References

- [LangChain FlareChain Documentation](https://api.python.langchain.com/en/latest/langchain/chains/langchain.chains.flare.base.FlareChain.html)
