### Document extraction

This notebook's purpose is to make an in-depth exploration on the best ways to extract information regarding OpenCV. During the first exploratory notebook, documents were extracted from OpenCV's documentation using a recursive URL loader. One of the main challenges of this RAG is to make information Python-specific, which might be difficult given that OpenCV is written in C++, with `opencv-python` being the library available as a wrapper with bindings for Python. Most of the solutions to overcome this challenge may come in the preprocessing and prompt engineering sections of this RAG, but an exploratory task is done during this extraction phase in order to include relevant metadata that may ignore C++-centric code and documentation. Taking this issue into account, in addition to the initial exploration made in the [ExploringOpenCV](/../Notebooks/ExploringOpenCV.ipynb) notebook, this notebook has the following goals:

1. Further explore the best way to extract information from OpenCV's documentation.
    1. Explore different values for the most relevant `RecursiveUrlLoader`'s parameters. Analyze document number vs extraction velocity.
    2. Explore the effectiveness of using multiple loaders instead of a single one, starting from more specific root URLs.
    3. For each new extraction idea, compare URL title with [OpenCV Documentation's index](https://docs.opencv.org/4.x/index.html), in order to identify relevance.
    4. Record and analyze failed or slow requests, to better refine parameters in `RecursiveUrlLoader`. 
2. Find new sources for information regarding OpenCV's features.
    1. Explore the best way to extract information from [OpenCV tutorials](https://docs.opencv.org/4.x/d9/df8/tutorial_root.html) (probably `RecursiveUrlLoader` is again the best approach). Explore different approaches.
    2. Extract information from [LearnOpenCV](https://learnopencv.com). Specially, from its three main guides: 
        - [Getting Started with OpenCV](https://learnopencv.com/getting-started-with-opencv/)
        - [Getting Started with PyTorch](https://learnopencv.com/getting-started-with-pytorch/)
        - [Getting Started with Keras & Tensorflow](https://learnopencv.com/getting-started-with-tensorflow-keras/)
    In this case, `RecursiveUrlLoader` might not be the best idea. Instead, an initial approach will be to have a tutorial list for each guide, which is easily accessible.
    3. Explore the usefulness of including [FreeCodeCamp's OpenCV Course](https://www.youtube.com/watch?v=oXlwWbU8l2o) as part of the knowledge base. This would require a transcript of the video, which might be computationally expensive and there's risk of inaccuracies.
3. Create a relevance measurement for each created loader. This tool would compare each document's title and summary to a base document (to be defined) and flag possible irrelevant documents. This step is important for two main concerns regarding data extraction:
    - `RecursiveUrlLoader` may extract documents that are not relevant to a person looking for documentation, such as a contact page.
    - Some documents might be too C++-oriented, such as the use of CMake to install OpenCV. These should be discarded.
4. For each document in each loader, extract relevant metadata that might indicate that the content is not relevant for Python, so that it can be pre-processed in a later stage and converted to the Python equivalent if needed.
5. Measure performance indicators for each loader, such as extraction speed, completeness and failures, which will be the final judge regarding the loaders to include in the final version. 
