#### NOTE:
It is recommended to set a USER_AGENT environment variable, so that web search requests are identified, reducing the chance of being blocked when webscrapping OpenCV.

The environment variable is created in  the `.env` file in the root project.

In [1]:
from dotenv import load_dotenv
import os
load_dotenv()
user_agent = os.getenv("USER_AGENT", "DefaultUserAgent")
print(user_agent)

DefaultUserAgent


Syncrhonously load all documents starting from OpenCV's root URL. We use BeautifulSoup's extractor to parse the HTML into a LLM-friendly format.

In [2]:
import re
from langchain.document_loaders import RecursiveUrlLoader
from bs4 import BeautifulSoup

def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()

loader = RecursiveUrlLoader(
    "https://docs.opencv.org/4.x/", 
    extractor=bs4_extractor,
    max_depth=5)

This may take a while, as each URL is looking for links inside it in order to scrap the whole documentation. For future tests, it is important to play around with `RecursiveUrlLoader`'s parameters, specially `max_depth`, in order to achieve a balance between completeness and speed.

In [3]:

docs = loader.load()

In [4]:
len(docs)

1919

Check the titles of each document to make sure that all (or most) of the documentation has been extracted.

In [5]:
for doc in docs:
    print(doc.metadata["title"])

OpenCV: OpenCV modules
OpenCV: Deformable Part-based Models
OpenCV: Bibliography
OpenCV: cv::dpm::DPMDetector Class Reference
OpenCV: Basic structures
OpenCV: cv::MatIterator_< _Tp > Class Template Reference
OpenCV: cv::MatConstIterator Class Reference
OpenCV: cv::SparseMat Class Reference
OpenCV: cv::UMatData Struct Reference
OpenCV: cv::SparseMatIterator_< _Tp > Class Template Reference
OpenCV: cv::ParamType< uchar > Struct Reference
OpenCV: cv::ParamType< float > Struct Reference
OpenCV: opencv2/core/matx.hpp File Reference
OpenCV: cv::NAryMatIterator Class Reference
OpenCV: cv::Scalar_< _Tp > Class Template Reference
OpenCV: cv::_InputOutputArray Class Reference
OpenCV: opencv2/core.hpp File Reference
OpenCV: cv::Formatter Class Reference
OpenCV: cv::TermCriteria Class Reference
OpenCV: cv::Point_< _Tp > Class Template Reference
OpenCV: cv::Mat_< _Tp > Class Template Reference
OpenCV: opencv2/core/mat.hpp File Reference
OpenCV: cv::Mat Class Reference
OpenCV: cv::Range Class Refere

The document splitting is done through token count in order to adapt the chunks to the model constraints.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import Optional, List, Tuple
from transformers import AutoTokenizer
from langchain.docstore.document import Document as LangchainDocument

EMBEDDING_MODEL_NAME = "thenlper/gte-small"
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]


def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=MARKDOWN_SEPARATORS,
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique


docs_processed = split_documents(
    512,  # We choose a chunk size adapted to our model
    docs,
    tokenizer_name=EMBEDDING_MODEL_NAME,
)

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
len(docs_processed)

10722

Embedding is done with the help of FAISS, using Cosine similarity.

In [None]:
from langchain.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
EMBEDDING_MODEL_NAME = "thenlper/gte-small"
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
)



  embedding_model = HuggingFaceEmbeddings(


In [None]:
KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(
    docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

Given that the document extraction took a long time, the vector DB is saved in the `/data` folder. This will allow us to load it in the future without the need of extracting the documentation again.

In [None]:
KNOWLEDGE_VECTOR_DATABASE.save_local("../data/ExtractedDocuments/Exploratory/InitialExplorationVectorDB")

The following block has redundant information regarding previous blocks. This is because we need certain information regarding how the vector DB was built if we are loading the information from a previous session. **Note:** The following block can be safely ignored if the whole process was done in a single session.

In [None]:
from langchain.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
EMBEDDING_MODEL_NAME = "thenlper/gte-small"
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
)
KNOWLEDGE_VECTOR_DATABASE = FAISS.load_local(
    "../data/ExtractedDocuments/Exploratory/InitialExplorationVectorDB", embedding_model, allow_dangerous_deserialization=True
)


  from tqdm.autonotebook import tqdm, trange


In [3]:
user_query="What modules can I use to detect faces without the need of additional files?"
print(f"\nStarting retrieval for {user_query=}...")

retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=5)

print("\n==================================Top document==================================")

print(retrieved_docs[0].page_content)

print("==================================Metadata==================================")

print(retrieved_docs[0].metadata)


Starting retrieval for user_query='What modules can I use to detect faces without the need of additional files?'...

Run face detection network to detect faces on input image. function detectFaces(img) {
  netDet.setInputSize(new cv.Size(img.cols, img.rows));
  var out = new cv.Mat();
  netDet.detect(img, out);
  var faces = [];
  for (var i = 0, n = out.data32F.length; i < n; i += 15) {
    var left = out.data32F[i];
    var top = out.data32F[i + 1];
    var right = (out.data32F[i] + out.data32F[i + 2]);
    var bottom = (out.data32F[i + 1] + out.data32F[i + 3]);
    left = Math.min(Math.max(0, left), img.cols - 1);
    top = Math.min(Math.max(0, top), img.rows - 1);
    right = Math.min(Math.max(0, right), img.cols - 1);
    bottom = Math.min(Math.max(0, bottom), img.rows - 1);
 
    if (left < right && top < bottom) {
      faces.push({
        x: left,
        y: top,
        width: right - left,
        height: bottom - top,
        x1: out.data32F[i + 4] < 0 || out.data32F[i + 4

In [4]:
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

READER_MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(READER_MODEL_NAME, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME)

READER_LLM = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=500,
)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [5]:
prompt_in_chat_format = [
    {
        "role": "system",
        "content": """Using the information contained in the context,
give a comprehensive answer to the question delimited by <>.
Respond only to the question asked, response should be concise and relevant to the question.
Never include code in your answers. Do not include implementation examples.
If the answer cannot be deduced from the context, do not give an answer.""",
    },
    {
        "role": "user",
        "content": """Context:
{context}
---

<{question}>""",
    },
]
RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(
    prompt_in_chat_format, tokenize=False, add_generation_prompt=True
)

In [6]:
context = "\nExtracted documents:\n"
context += "".join([f"Document {str(i)}:::\n" + doc.page_content for i, doc in enumerate(retrieved_docs)])

final_prompt = RAG_PROMPT_TEMPLATE.format(question="What are the best methods to detect faces? For each one, display needed external files.", context=context)

# Redact an answer
answer = READER_LLM(final_prompt)[0]["generated_text"]
print(answer)

There are several methods for detecting faces, and the choice of method depends on the specific application and requirements. Here are some popular techniques:

1. Haar Cascade Classifiers: This is a widely used technique that involves training a classifier using positive and negative samples. The classifier is then applied to an image to determine whether it contains a face or not. External files required: Haar Cascade XML file (such as "haarcascade_frontalface_alt.xml")

2. Convolutional Neural Networks (CNN): CNNs are deep learning algorithms that can learn to recognize faces through a large number of training images. They can also be fine-tuned for specific tasks such as facial landmark detection or age estimation. External files required: Trained CNN weights (such as "resnet50.hdf5")

3. Local Binary Patterns Histograms (LBPH): LBPH is a feature extraction technique that converts an image into a fixed-length vector of features. These vectors can then be compared to a database of k