This notebook has all the embedding features

In [1]:
import logging
import spacy
from pymongo import MongoClient
import uuid

In [23]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


Here we first fetch all the data stored in mongodb

In [2]:
def fetch_raw_data():
    """Fetch raw data from MongoDB."""
    try:
        client = MongoClient("mongodb://localhost:27017/")
        db = client["RAG_DB"]
        collection = db["raw_data"]
        raw_data = list(collection.find())
        logging.info(f"Fetched {len(raw_data)} documents from MongoDB.")
        return raw_data
    except Exception as e:
        logging.error(f"Error fetching data from MongoDB: {e}")
        return []

In [3]:
raw_data = fetch_raw_data()

In [4]:
raw_data

[{'_id': ObjectId('6756585b1f13d22a30cd0406'),
  'file_name': 'README.md',
  'content': 'ros 2 documentation this repository contains the sources for the ros 2 documentation that is hosted at [https://docs.ros.org/en](https://docs.ros.org/en). the sources from this repository are built and uploaded to the site nightly by a [jenkins job](https://build.ros.org/job/doc_ros2doc). contributing to the documentation contributions to this site are most welcome. please see the [contributing to ros 2 documentation](https://docs.ros.org/en/rolling/the ros2 project/contributing/contributing to ros 2 documentation.html) page to learn more. contributing to ros 2 to contribute to the ros 2 source code project please refer to the [ros 2 contributing guidelines](https://docs.ros.org/en/rolling/the ros2 project/contributing.html). prerequisites to build this you need to install make graphviz with [venv](https://docs.python.org/3/library/venv.html) pinned versions for development we currently use noble a

Then we are splitting the data into chunks of size 200.
Here word-wise chunking is done.

In [5]:
def chunk_data(data, chunk_size=200):
    """Chunk data into smaller pieces by words."""
    words = data.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunks.append(" ".join(words[i:i + chunk_size]))
    return chunks

In [6]:
for doc in raw_data:
    print(doc['file_name'])

README.md
constraints.txt
make_sitemapindex.py
requirements.txt
setup.py
conf.py
global_substitutions.txt
plugins/sphinx_sitemap_ros.py
source/Installation.rst
source/How-To-Guides.rst
source/index.rst
source/Citations.rst
source/Glossary.rst
source/Concepts.rst
source/Related-Projects.rst
source/Contact.rst
source/Package-Docs.rst
source/The-ROS2-Project.rst
source/Releases.rst
source/Tutorials.rst
source/Tutorials/Beginner-CLI-Tools.rst
source/Tutorials/Beginner-Client-Libraries.rst
source/Tutorials/Demos.rst
source/Tutorials/Advanced.rst
source/Tutorials/Intermediate.rst
source/Tutorials/Miscellaneous.rst
source/Tutorials/Intermediate/Monitoring-For-Parameter-Changes-Python.rst
source/Tutorials/Intermediate/Monitoring-For-Parameter-Changes-CPP.rst
source/Tutorials/Intermediate/Rosdep.rst
source/Tutorials/Intermediate/Creating-an-Action.rst
source/Tutorials/Intermediate/Composition.rst
source/Tutorials/Intermediate/Writing-a-Composable-Node.rst
source/Tutorials/Intermediate/URDF/Expo

In [7]:
constraints_text = raw_data[1]['content']
constraints_text

'alabaster==0.7.12 babel==2.14.0 certifi==2020.6.20 chardet==4.0.0 doc8==1.1.1 docutils==0.20.1 idna==2.10 imagesize==1.3.0 jinja2==3.0.3 markupsafe==2.0.1 packaging==21.3 pbr==5.8.0 polib==1.2.0 pygments==2.17.2 pyparsing==2.4.7 pytz==2022.1 requests==2.25.1 restructuredtext_lint==1.3.2 snowballstemmer==2.2.0 sphinx==7.2.6 sphinx copybutton==0.5.2 sphinx lint==0.9.1 sphinx multiversion==0.2.4 sphinx rtd theme==2.0.0 sphinx tabs==3.4.5 sphinxcontrib applehelp==1.0.4 sphinxcontrib devhelp==1.0.2 sphinxcontrib htmlhelp==2.0.1 sphinxcontrib jquery==4.1 sphinxcontrib jsmath==1.0.1 sphinxcontrib mermaid==0.9.2 sphinxcontrib qthelp==1.0.3 sphinxcontrib serializinghtml==1.1.10 stevedore==3.5.0 urllib3==1.26.5'

In [9]:
chunks_constr = chunk_data(constraints_text, 100)
print(chunks_constr)

['alabaster==0.7.12 babel==2.14.0 certifi==2020.6.20 chardet==4.0.0 doc8==1.1.1 docutils==0.20.1 idna==2.10 imagesize==1.3.0 jinja2==3.0.3 markupsafe==2.0.1 packaging==21.3 pbr==5.8.0 polib==1.2.0 pygments==2.17.2 pyparsing==2.4.7 pytz==2022.1 requests==2.25.1 restructuredtext_lint==1.3.2 snowballstemmer==2.2.0 sphinx==7.2.6 sphinx copybutton==0.5.2 sphinx lint==0.9.1 sphinx multiversion==0.2.4 sphinx rtd theme==2.0.0 sphinx tabs==3.4.5 sphinxcontrib applehelp==1.0.4 sphinxcontrib devhelp==1.0.2 sphinxcontrib htmlhelp==2.0.1 sphinxcontrib jquery==4.1 sphinxcontrib jsmath==1.0.1 sphinxcontrib mermaid==0.9.2 sphinxcontrib qthelp==1.0.3 sphinxcontrib serializinghtml==1.1.10 stevedore==3.5.0 urllib3==1.26.5']


Then, we will generate embedding vectors of size 300 for the chunked data.

In [10]:
def generate_embeddings(raw_data):
    """Generate embeddings using SpaCy's large model."""
    embeddings = []
    nlp = spacy.load("en_core_web_lg")

    for doc in raw_data:
        if "content" in doc:
            chunks = chunk_data(doc['content'])
            for chunk in chunks:
                doc_chunk = nlp(chunk)
                embedding = doc_chunk.vector
                payload = {
                    "source": doc.get("source", ""),
                    "url": doc.get("url", ""),
                    "file_name": doc.get("file_name", ""),
                    "repo_name": doc.get("repo_name", ""),
                    "content": chunks,
                    "chunk": chunk,
                }
                embeddings.append({
                    'id': str(uuid.uuid4()),
                    'embedding': embedding.tolist(),
                    'payload': payload
                })
        else:
            logging.warning(f"Skipping document without content: {doc}")

    logging.info(f"Generated embeddings for {len(embeddings)} documents.")
    return embeddings

In [11]:
embeddings = generate_embeddings(raw_data)

In [49]:
for embedding in embeddings:
    print(embedding['payload']['chunk'])
    print("*" * 50)

ros 2 documentation this repository contains the sources for the ros 2 documentation that is hosted at [https://docs.ros.org/en](https://docs.ros.org/en). the sources from this repository are built an
**************************************************
d uploaded to the site nightly by a [jenkins job](https://build.ros.org/job/doc_ros2doc). contributing to the documentation contributions to this site are most welcome. please see the [contributing to
**************************************************
 ros 2 documentation](https://docs.ros.org/en/rolling/the ros2 project/contributing/contributing to ros 2 documentation.html) page to learn more. contributing to ros 2 to contribute to the ros 2 sourc
**************************************************
e code project please refer to the [ros 2 contributing guidelines](https://docs.ros.org/en/rolling/the ros2 project/contributing.html). prerequisites to build this you need to install make graphviz wi
*******************************************