# INTEL GETI Docs Chatbot

## Using SuperduperDB to Connect to Database

In [1]:
from superduperdb import superduper
import os
mongodb_uri = os.getenv("SUPERDUPERDB_DATA_BACKEND","mongomock://test")
db = superduper(mongodb_uri)
db.drop(force=True)

  from .autonotebook import tqdm as notebook_tqdm
2024-03-07 11:56:03,656	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


[32m 2024-Mar-07 11:56:03.66[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.build[0m:[36m65  [0m | [1mData Client is ready. MongoClient(host=['mongodb:27017'], document_class=dict, tz_aware=False, connect=True, serverselectiontimeoutms=5000)[0m
[32m 2024-Mar-07 11:56:03.67[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.build[0m:[36m38  [0m | [1mConnecting to Metadata Client with engine:  MongoClient(host=['mongodb:27017'], document_class=dict, tz_aware=False, connect=True, serverselectiontimeoutms=5000)[0m
[32m 2024-Mar-07 11:56:03.68[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.build[0m:[36m148 [0m | [1mConnecting to compute client: local[0m
[32m 2024-Mar-07 11:56:03.68[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.datalayer[0m:[36m85  [0m | [1mBuilding Data Layer[0m


## Build a rag data processing workflow step by step

### Step1: Crawling Pages

**Crawl pages based on the provided links.**

In [2]:
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin
from superduperdb.misc.retry import Retry
from superduperdb import logging


def process_code_snippets(text):
    soup = BeautifulSoup(text, "html.parser")
    pre_tags = soup.find_all("pre")

    for pre in pre_tags:
        processed_text = str(pre.text)
        new_content = "CODE::" + soup.new_string(processed_text)
        pre.clear()
        pre.append(new_content)
    return str(soup)


def process_py_class(source_html):
    soup = BeautifulSoup(source_html, "html.parser")
    dl_tags = soup.find_all("dl", class_="py class")

    for dl in dl_tags:
        dt_tag = dl.find("dt", class_="sig sig-object py")
        if not dt_tag:
            continue
        last_headerlink = dt_tag.find_all("a", class_="headerlink")[-1]
        href = last_headerlink["href"] if last_headerlink else ""
        id = dt_tag.attrs["id"]
        new_h3 = soup.new_tag("h3")
        new_a_inside_h3 = soup.new_tag("a", href=href)
        new_a_inside_h3.string = f"Class: {id}"
        new_h3.append(new_a_inside_h3)

        new_code = soup.new_tag("a")
        new_code.string = dt_tag.text
        dt_tag.insert_before(new_h3)
        dt_tag.insert_before(new_code)
        dt_tag.decompose()

    return str(soup)

def parse_url(seed_url):
    retry = Retry(exception_types=(Exception))

    @retry
    def get_response(url):
        response = requests.get(seed_url)
        return response
        
    print(f"parse {seed_url}")
    response = get_response(seed_url)
    # Parse the HTML content
    source_html = response.text
    source_html = process_code_snippets(source_html)
    source_html = process_py_class(source_html)

    return source_html


def url2html(url):
    try:
        html = parse_url(url)
    except Exception as e:
        logging.error(e)
        html = ""
    return html

**Now we can test the `url2html` function**

In [3]:
page = url2html("https://openvinotoolkit.github.io/geti-sdk/getting_started.html")

parse https://openvinotoolkit.github.io/geti-sdk/getting_started.html


**After we confirm that this function is working properly, we can add it as a model**

In [4]:
from superduperdb import Model, Listener, Schema
from superduperdb.backends.mongodb import Collection

url_model = Model(
    identifier='url2html',
    object=url2html,
    model_update_kwargs={"document_embedded": False},
)
db.add(url_model)

[32m 2024-Mar-07 11:56:04.36[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m333 [0m | [1mInitializing DataType : dill[0m
[32m 2024-Mar-07 11:56:04.36[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m336 [0m | [1mInitialized  DataType : dill successfully[0m


([],
 ObjectModel(identifier='url2html', signature='*args,**kwargs', datatype=None, output_schema=None, flatten=False, model_update_kwargs={'document_embedded': False}, metrics=(), validation_sets=None, predict_kwargs={}, object=<function url2html at 0x162cf9990>, num_workers=0))

### Step2: Parse html and chunk

**Use unstructured to extract elements of html page**

In [5]:
from unstructured.partition.html import partition_html

def page2elements(page):
    elements = partition_html(text=page, html_assemble_articles=True)
    return elements

In [6]:
elements = page2elements(page)
print('\n\n'.join([e.text for e in elements[:5]]))

[2024-03-07 11:56:05] unstructured INFO Reading document from string ...
[2024-03-07 11:56:05] unstructured INFO Reading document ...


Introduction

Welcome to the Intel® Geti™ SDK! The Intel® Geti™ platform enables
teams to rapidly develop AI models. The platform reduces the time needed to build
models by easing the complexities of model development and harnessing greater
collaboration between teams. Most importantly, the platform unlocks faster
time-to-value for digitization initiatives with AI.

The Intel® Geti™ SDK is a python package which contains tools to interact with an
Intel® Geti™ server via the REST API. It provides functionality for:

Project creation from annotated datasets on disk

Project downloading (images, videos, configuration, annotations, predictions and models)


## In this application, we use titles to segment text, so we first define a function for title recognition

In [7]:
from unstructured.documents.elements import ElementType

def get_title_data(element):
    data = {}
    if element.category != ElementType.TITLE:
        return data
    if 'link_urls' not in element.metadata.to_dict():
        return data

    if 'category_depth' not in element.metadata.to_dict():
        return data

    [link_text, *_] = element.metadata.link_texts

    if not link_text:
        return data

    link_urls = element.metadata.link_urls
    if not link_urls:
        return data
    category_depth = element.metadata.category_depth
    return {'link': link_urls[0], 'category_depth':category_depth}

In [8]:
print(get_title_data(elements[0]))

{'link': '#introduction', 'category_depth': 0}


**Define a function that converts element to text, and handles different types of elements differently.**

In [9]:
import pandas as pd
from io import StringIO
def element2text(element):
    title_message = get_title_data(element)
    text = element.text
    if title_message:
        title_tags = '#' * (title_message['category_depth'] + 1)
        text = title_tags + ' ' + text
        text = text.rstrip('#')

    elif element.category == ElementType.LIST_ITEM:
        text = '- ' + text

    elif element.category == ElementType.TABLE:
        html = element.metadata.text_as_html
        html = html.replace('|', '')
        df = pd.read_html(StringIO(html))[0]
        text = df.to_markdown(index=False)
        text = text + '  \n'

    if text.startswith("CODE::"):
        text = f"```\n{text[6:]}\n```"

    return text

In [10]:
print(element2text(elements[1]))

Welcome to the Intel® Geti™ SDK! The Intel® Geti™ platform enables
teams to rapidly develop AI models. The platform reduces the time needed to build
models by easing the complexities of model development and harnessing greater
collaboration between teams. Most importantly, the platform unlocks faster
time-to-value for digitization initiatives with AI.


**Define the chunk function, input all elements of a page, and chunk them**

In [11]:
def get_chunk_texts(text, chunk_size=1000, overlap_size=300):
    chunks = []
    start = 0

    while start < len(text):
        if chunks:
            start -= overlap_size
        end = start + chunk_size
        end = min(end, len(text))
        chunks.append(text[start:end])
        start = end
        if start >= len(text):
            break

    return chunks

from collections import defaultdict
def get_chunks(elements):
    chunk_tree = defaultdict(list)
    now_depth = -1
    now_path = 'root'
    for element in elements:
        title_data = get_title_data(element)
        if not title_data:
            chunk_tree[now_path].append(element)
        else:
            link = title_data['link']
            depth = title_data['category_depth']
            if depth > now_depth:
                now_path = now_path + "::" +link
            else:
                now_path = '::'.join(now_path.split("::")[:depth+1] + [link])
            now_depth = depth
            chunk_tree[now_path].append(element)
     
    chunks = []
    for node_path, node_elements in chunk_tree.items():
        new_elements = []
        nodes = node_path.split("::")
        parent_elements = []
        for i in range(1, len(nodes) - 1):
            [parent_element, *_] = chunk_tree["::".join(nodes[:i+1])] or [None]
            if parent_element:
                parent_elements.append(parent_element)
        node_elements = [*parent_elements, *node_elements]
        content = '\n\n'.join(map(lambda x: element2text(x), node_elements))
        for chunk_text in get_chunk_texts(content):
            # The url field is used to save the jump link
            # The text field is used for vector search
            # The content field is used to submit to LLM for answer
            chunk = {"href": nodes[-1], 'text': chunk_text, 'content': content}
            chunks.append(chunk)
    return chunks

In [12]:
chunks = get_chunks(elements)

In [13]:
for chunk in chunks[:3]:
    print(chunk)

{'href': '#introduction', 'text': '# Introduction\uf0c1\n\nWelcome to the Intel® Geti™ SDK! The Intel® Geti™ platform enables\nteams to rapidly develop AI models. The platform reduces the time needed to build\nmodels by easing the complexities of model development and harnessing greater\ncollaboration between teams. Most importantly, the platform unlocks faster\ntime-to-value for digitization initiatives with AI.\n\nThe Intel® Geti™ SDK is a python package which contains tools to interact with an\nIntel® Geti™ server via the REST API. It provides functionality for:\n\n- Project creation from annotated datasets on disk\n\n- Project downloading (images, videos, configuration, annotations, predictions and models)\n\n- Project creation and upload from a previous download\n\n- Deploying a project for local inference with OpenVINO\n\n- Getting and setting project and model configuration\n\n- Launching and monitoring training jobs\n\n- Media upload and prediction\n\nThis repository also conta

**Now we finally define a function that converts html pages into chunks, so that we can connect it to the page output by the model we defined above.**

In [14]:
def page2chunks(page):
    elements = page2elements(page)
    chunks = get_chunks(elements)
    return chunks

In [15]:
chunks = page2chunks(page)
chunks[0]

[2024-03-07 11:56:06] unstructured INFO Reading document from string ...
[2024-03-07 11:56:06] unstructured INFO Reading document ...


{'href': '#introduction',
 'text': '# Introduction\uf0c1\n\nWelcome to the Intel® Geti™ SDK! The Intel® Geti™ platform enables\nteams to rapidly develop AI models. The platform reduces the time needed to build\nmodels by easing the complexities of model development and harnessing greater\ncollaboration between teams. Most importantly, the platform unlocks faster\ntime-to-value for digitization initiatives with AI.\n\nThe Intel® Geti™ SDK is a python package which contains tools to interact with an\nIntel® Geti™ server via the REST API. It provides functionality for:\n\n- Project creation from annotated datasets on disk\n\n- Project downloading (images, videos, configuration, annotations, predictions and models)\n\n- Project creation and upload from a previous download\n\n- Deploying a project for local inference with OpenVINO\n\n- Getting and setting project and model configuration\n\n- Launching and monitoring training jobs\n\n- Media upload and prediction\n\nThis repository also cont

**After we confirm that this function is working properly, we can add it as a model**

In [16]:
from superduperdb import Model, Listener, Schema

chunk_model = Model(
    identifier='chunk',
    object=page2chunks,
    flatten=True,
    model_update_kwargs={"document_embedded": False},
)
db.add(chunk_model)

([],
 ObjectModel(identifier='chunk', signature='*args,**kwargs', datatype=None, output_schema=None, flatten=True, model_update_kwargs={'document_embedded': False}, metrics=(), validation_sets=None, predict_kwargs={}, object=<function page2chunks at 0x29614c820>, num_workers=0))

### Step3: Embedding

**We will embedding all chunk**

In [17]:
from superduperdb.ext.openai import OpenAIEmbedding
from superduperdb import VectorIndex

In [18]:
openai_emb_model = OpenAIEmbedding(
    identifier='text-embedding-ada-002',
    model="text-embedding-ada-002",
)
db.add(openai_emb_model)

[2024-03-07 11:56:07] httpx INFO HTTP Request: GET https://api.openai.com/v1/models "HTTP/1.1 200 OK"


([],
 OpenAIEmbedding(identifier='text-embedding-ada-002', datatype=DataType(identifier='vector[1536]', encoder=None, decoder=None, info=None, shape=(1536,), directory=None, encodable='native', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>), output_schema=None, flatten=False, model_update_kwargs={}, metrics=(), validation_sets=None, predict_kwargs={}, model='text-embedding-ada-002', client_kwargs={}, shape=(1536,), batch_size=100))

In [19]:
print(len(openai_emb_model.predict_one(chunk["content"])))

[2024-03-07 11:56:08] httpx INFO HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


1536


**In order to be compatible with the database’s dict format and the application’s string format data, we add a preprocessing model**

In [20]:
content_model = Model(
    identifier="get_content",
    object=lambda x:x['text'] if isinstance(x, dict) else x,
)

print(content_model.predict_one(chunk))

[32m 2024-Mar-07 11:56:08.19[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m333 [0m | [1mInitializing ObjectModel : get_content[0m
[32m 2024-Mar-07 11:56:08.19[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m336 [0m | [1mInitialized  ObjectModel : get_content successfully[0m
# Getting started


**We can use A to easily connect multiple models in series**

In [21]:
from superduperdb.components.model import SequentialModel
embed_model = SequentialModel(identifier="embedding", predictors=[content_model, openai_emb_model])

In [22]:
print(len(embed_model.predict_one(chunk)))

  0%|                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s][2024-03-07 11:56:08] httpx INFO HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.03it/s]

1536





### Step4: LLM

In [23]:
from superduperdb.ext.openai import OpenAIChatCompletion
prompt = """
As an Intel GETI assistant, based on the provided documents and the question, answer the question.
If the document does not provide an answer, offer a safe response without fabricating an answer.

Documents:
{context}

Question: """

llm = OpenAIChatCompletion(identifier='gpt-3.5-turbo', prompt=prompt)

db.add(llm)

print(db.show('model'))

['chunk', 'gpt-3.5-turbo', 'text-embedding-ada-002', 'url2html']


In [24]:
context = chunks[0]['text']
context

'# Introduction\uf0c1\n\nWelcome to the Intel® Geti™ SDK! The Intel® Geti™ platform enables\nteams to rapidly develop AI models. The platform reduces the time needed to build\nmodels by easing the complexities of model development and harnessing greater\ncollaboration between teams. Most importantly, the platform unlocks faster\ntime-to-value for digitization initiatives with AI.\n\nThe Intel® Geti™ SDK is a python package which contains tools to interact with an\nIntel® Geti™ server via the REST API. It provides functionality for:\n\n- Project creation from annotated datasets on disk\n\n- Project downloading (images, videos, configuration, annotations, predictions and models)\n\n- Project creation and upload from a previous download\n\n- Deploying a project for local inference with OpenVINO\n\n- Getting and setting project and model configuration\n\n- Launching and monitoring training jobs\n\n- Media upload and prediction\n\nThis repository also contains a set of (tutorial style) Jupy

In [25]:
llm.predict_one("Introduce Geti™ SDK!", context=context)

[2024-03-07 11:56:12] httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'The Intel Geti™ SDK is a platform that enables teams to rapidly develop AI models. It reduces the time needed to build models by simplifying the complexities of model development and promoting greater collaboration between teams. Moreover, the platform unlocks faster time-to-value for digitization initiatives with AI. The SDK is a Python package that includes tools to interact with an Intel Geti™ server via the REST API. It provides functionality for project creation from annotated datasets on disk, project downloading (images, videos, configuration, annotations, predictions, and models), project creation and upload from a previous download, deploying a project for local inference with OpenVINO, getting and setting project and model configuration, launching and monitoring training jobs, and media upload and prediction. Additionally, the repository contains a set of tutorial-style Jupyter notebooks that demonstrate the capabilities of the Geti™ SDK.'

## Concatenate the above data workflow and add it to the CDC service

### Step1: Crawling Pages

In [26]:
url_listener = Listener(
    model=url_model,
    select=Collection("url").find(),
    key="url",
)
db.add(url_listener)
print(url_listener.identifier, url_listener.outputs)


[32m 2024-Mar-07 11:56:12.53[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m32  [0m | [1mSubmitting job. function:<function method_job at 0x1246dda20>[0m


0it [00:00, ?it/s]

[32m 2024-Mar-07 11:56:12.54[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m333 [0m | [1mInitializing ObjectModel : url2html[0m
[32m 2024-Mar-07 11:56:12.55[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m333 [0m | [1mInitializing DataType : dill[0m
[32m 2024-Mar-07 11:56:12.55[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m336 [0m | [1mInitialized  DataType : dill successfully[0m
[32m 2024-Mar-07 11:56:12.55[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m336 [0m | [1mInitialized  ObjectModel : url2html successfully[0m
[32m 2024-Mar-07 11:56:12.55[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.model[0m:[36m649 [0m | [1mAdding 0 model outputs to `db`[0m
[32m 2024-Mar-07 11:56:12.55[0m| [32m[1mSUCCESS [0m | [36mzhouhaha-2.lo




### Step2: Parse html and chunk

In [27]:
chunk_listener = Listener(
    model=chunk_model,
    select=Collection("_outputs.url.url2html").find(),
    key=f'_outputs.url.url2html.{url_listener.model.version}',
)

db.add(chunk_listener)

print(chunk_listener.identifier, chunk_listener.outputs)


[32m 2024-Mar-07 11:56:12.70[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m32  [0m | [1mSubmitting job. function:<function method_job at 0x1246dda20>[0m


0it [00:00, ?it/s]

[32m 2024-Mar-07 11:56:12.70[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m333 [0m | [1mInitializing ObjectModel : chunk[0m
[32m 2024-Mar-07 11:56:12.70[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m336 [0m | [1mInitialized  ObjectModel : chunk successfully[0m
[32m 2024-Mar-07 11:56:12.70[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.model[0m:[36m649 [0m | [1mAdding 0 model outputs to `db`[0m
[32m 2024-Mar-07 11:56:12.71[0m| [32m[1mSUCCESS [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m38  [0m | [32m[1mJob submitted.  function:<function method_job at 0x1246dda20> future:bafb1b8e-dbfd-4205-b713-beeb80605c3e[0m
chunk/_outputs.url.url2html.0 _outputs._outputs.url.url2html.0.chunk.0





### Step3: Embedding

In [28]:
embed_listener = Listener(
    select=Collection("_outputs.url.chunk").find(),
    key=f'_outputs.url.chunk.{chunk_listener.model.version}',  # Key for the documents
    model=embed_model,  # Specify the model for processing
    predict_kwargs={"max_chunk_size": 64},
)
print(embed_listener.identifier, embed_listener.outputs)
db.add(embed_listener)

embedding/_outputs.url.chunk.0 _outputs._outputs.url.chunk.0.embedding.None
[32m 2024-Mar-07 11:56:13.73[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m32  [0m | [1mSubmitting job. function:<function method_job at 0x1246dda20>[0m


0it [00:00, ?it/s]


[32m 2024-Mar-07 11:56:13.76[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m333 [0m | [1mInitializing ObjectModel : get_content[0m
[32m 2024-Mar-07 11:56:13.77[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.component[0m:[36m336 [0m | [1mInitialized  ObjectModel : get_content successfully[0m


0it [00:00, ?it/s]

[32m 2024-Mar-07 11:56:13.77[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.components.model[0m:[36m649 [0m | [1mAdding 0 model outputs to `db`[0m
[32m 2024-Mar-07 11:56:13.77[0m| [32m[1mSUCCESS [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m38  [0m | [32m[1mJob submitted.  function:<function method_job at 0x1246dda20> future:88e17098-2978-4d13-8d91-f827e0f80142[0m





([<superduperdb.jobs.job.ComponentJob at 0x1629eea10>],
 Listener(identifier='embedding/_outputs.url.chunk.0', key='_outputs.url.chunk.0', model=SequentialModel(identifier='embedding', signature='*args,**kwargs', datatype=DataType(identifier='vector[1536]', encoder=None, decoder=None, info=None, shape=(1536,), directory=None, encodable='native', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>), output_schema=None, flatten=False, model_update_kwargs={}, metrics=(), validation_sets=None, predict_kwargs={}, predictors=[ObjectModel(identifier='get_content', signature='*args,**kwargs', datatype=None, output_schema=None, flatten=False, model_update_kwargs={}, metrics=(), validation_sets=None, predict_kwargs={}, object=<function <lambda> at 0x29614dea0>, num_workers=0), OpenAIEmbedding(identifier='text-embedding-ada-002', datatype=DataType(identifier='vector[1536]', encoder=None, decoder=None, info=None, shape=(1536,), directory=None, encodable='native', bytes_encoding=<BytesEncoding.BYTES: 'By

## Create a vector index

In [29]:
vector_index = VectorIndex(
    identifier="vector_index",
    indexing_listener=embed_listener,)
db.add(vector_index)

[32m 2024-Mar-07 11:56:13.81[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m32  [0m | [1mSubmitting job. function:<function callable_job at 0x1246ddcf0>[0m
[32m 2024-Mar-07 11:56:13.82[0m| [32m[1mSUCCESS [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.backends.local.compute[0m:[36m38  [0m | [32m[1mJob submitted.  function:<function callable_job at 0x1246ddcf0> future:76925104-f71f-46dc-8ff3-7d514ebfbced[0m


([<superduperdb.jobs.job.FunctionJob at 0x2973c93f0>],
 VectorIndex(identifier='vector_index', indexing_listener=Listener(identifier='embedding/_outputs.url.chunk.0', key='_outputs.url.chunk.0', model=SequentialModel(identifier='embedding', signature='*args,**kwargs', datatype=DataType(identifier='vector[1536]', encoder=None, decoder=None, info=None, shape=(1536,), directory=None, encodable='native', bytes_encoding=<BytesEncoding.BYTES: 'Bytes'>), output_schema=None, flatten=False, model_update_kwargs={}, metrics=(), validation_sets=None, predict_kwargs={}, predictors=[ObjectModel(identifier='get_content', signature='*args,**kwargs', datatype=None, output_schema=None, flatten=False, model_update_kwargs={}, metrics=(), validation_sets=None, predict_kwargs={}, object=<function <lambda> at 0x29614dea0>, num_workers=0), OpenAIEmbedding(identifier='text-embedding-ada-002', datatype=DataType(identifier='vector[1536]', encoder=None, decoder=None, info=None, shape=(1536,), directory=None, enco

## Create a Rag application

**Insert a web page**

In [30]:
from superduperdb import Document
url = "https://openvinotoolkit.github.io/geti-sdk/getting_started.html"
db.execute(Collection("url").insert_one(Document(**{"url": url})), refresh=True)

[32m 2024-Mar-07 11:56:13.87[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.datalayer[0m:[36m384 [0m | [1mCDC active, skipping refresh[0m


([ObjectId('65e93add75c2fdf140a0a91a')], None)

**Wait a moment**
- the CDC service will run the data pipeline
- the vector search service will update the new vector index.

In [31]:
import time
time.sleep(5)

### Vector Search

In [32]:
def vector_search(query):
    outs = db.execute(
        Collection("_outputs.url.chunk")
        .like(Document({"_outputs.url.chunk.0": query}), vector_index="vector_index", n=3)
        .find()
    )
    if outs:
        outs = sorted(outs, key=lambda x: x["score"], reverse=True)
    for out in outs:
        print("-" * 20, "\n")
        data = out.outputs("url", "chunk")
        url = data["href"]
        print(url, out["score"])
        print(data["text"])


In [33]:
vector_search("How to install python sdk")

[32m 2024-Mar-07 11:56:18.98[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.datalayer[0m:[36m1047[0m | [1m{}[0m


  0%|                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s][2024-03-07 11:56:19] httpx INFO HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.00s/it]

-------------------- 

#installation 0.8307849168777466
# Getting started

## Installation

Using an environment manager such as
Anaconda or
venv to create a new
Python environment before installing the Intel® Geti™ SDK and its requirements is
highly recommended.

NOTE: If you have installed multiple versions of Python,
use py -3.8 venv -m <env_name> when creating your virtual environment to specify
a supported version (in this case 3.8). Once you activate the
virtual environment <venv_path>/Scripts/activate, make sure to upgrade pip
to the latest version python -m pip install --upgrade pip wheel setuptools.





### QA

In [34]:
def qa(query, vector_search_top_k=5):
    collection = Collection("_outputs.url.chunk")
    output, sources = db.predict(
        model_name="gpt-3.5-turbo",
        input=query,
        context_select=collection.like(
            Document({"_outputs.url.chunk.0": query}),
            vector_index="vector_index",
            n=vector_search_top_k,
        ).find({}),
        context_key="_outputs.url.chunk.0.text",
    )
    if sources:
        sources = sorted(sources, key=lambda x: x["score"], reverse=True)
    print(output.unpack())
    for out in sources:
        print("-" * 20, "\n")
        data = out.outputs("url", "chunk")
        url = data["href"]
        print(url, out["score"])
        print(data["text"])


In [35]:
qa("How to install python sdk")

[32m 2024-Mar-07 11:56:20.17[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.datalayer[0m:[36m1047[0m | [1m{}[0m


  0%|                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s][2024-03-07 11:56:20] httpx INFO HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.44it/s]
[2024-03-07 11:56:23] httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


To install the Python SDK, it is recommended to use an environment manager such as Anaconda or venv to create a new Python environment. Then, you can follow the steps mentioned in the "Installation" section of the provided documents, depending on your needs:

- For base installation: Navigate to the root directory of the repository and install the SDK using `pip install .`.
- For notebooks installation (optional): If you want to run notebooks, install extra requirements using `pip install .[notebooks]`.
- For development installation (optional): To run tests or build documentation, install the package extra requirements by using `pip install -e .[dev]`. 

Follow these steps to install the Python SDK.
-------------------- 

#installation 0.8307616710662842
# Getting started

## Installation

Using an environment manager such as
Anaconda or
venv to create a new
Python environment before installing the Intel® Geti™ SDK and its requirements is
highly recommended.

NOTE: If you have insta

**Now we crawl all web page URL collections of geti and add them to the database**

In [36]:
import requests
from requests.adapters import HTTPAdapter
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

retry = Retry(exception_types=(Exception))

def is_toctree_class(tag):
    classes = tag.get('class', [])
    return any(re.match('toctree-l\d+', cls) for cls in classes)

def filter_sub_urls(all_urls):
    # remove the URL with #, for example: http://xxxx.com/xxx#P1
    base_urls_set = {url for url in all_urls if '#' not in url}
    new_urls = []
    for url in all_urls:
        if '#' in url and url.split('#')[0] in base_urls_set:
            continue
        else:
            new_urls.append(url)
    return new_urls

@retry
def get_documentation_links(seed_url):
    response = requests.get(seed_url)
    # Parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    page_urls = []
    for l in soup.find_all(is_toctree_class):
        page_name = l.find('a').text.strip()
        href = l.find('a')['href'] if l.find('a') else ''
        if href:
            url = urljoin(seed_url, href)
            page_urls.append(url)

    page_urls = filter_sub_urls(page_urls)
            
    return page_urls


In [37]:
get_documentation_links(url)

['https://openvinotoolkit.github.io/geti-sdk/getting_started.html',
 'https://openvinotoolkit.github.io/geti-sdk/notebooks.html',
 'https://openvinotoolkit.github.io/geti-sdk/contributing_to_the_sdk.html',
 'https://openvinotoolkit.github.io/geti-sdk/api_reference.html']

In [38]:
# URL of the page to scrape
url_sets = set()
url_sets.add("https://openvinotoolkit.github.io/geti-sdk/index.html")
url_sets.add("https://docs.geti.intel.com/on-prem/1.8/guide/get-started/introduction.html")
url_waiting_list = url_sets.copy()
while url_waiting_list:
    url = url_waiting_list.pop()
    print(f'The number to be check {len(url_waiting_list)}. ', url)
    new_urls = get_documentation_links(url)
    new_urls ={url for url in new_urls if url not in url_sets}
    url_waiting_list.update(new_urls)
    url_sets.update(new_urls)
    
# Delete this data because we added it in the beginning
url_sets.remove("https://openvinotoolkit.github.io/geti-sdk/index.html")

The number to be check 1.  https://openvinotoolkit.github.io/geti-sdk/index.html
The number to be check 4.  https://openvinotoolkit.github.io/geti-sdk/api_reference.html
The number to be check 13.  https://openvinotoolkit.github.io/geti-sdk/geti_sdk.rest_clients.html
The number to be check 12.  https://docs.geti.intel.com/on-prem/1.8/guide/get-started/introduction.html
The number to be check 68.  https://openvinotoolkit.github.io/geti-sdk/getting_started.html
The number to be check 67.  https://docs.geti.intel.com/on-prem/1.8/guide/release-notes/1.0-beta/release-1.0-beta.html
The number to be check 66.  https://docs.geti.intel.com/on-prem/1.8/guide/additional-resources/openvino/test-optimize-deploy-openvino.html
The number to be check 65.  https://docs.geti.intel.com/on-prem/1.8/guide/installation-guide/installation.html
The number to be check 64.  https://docs.geti.intel.com/on-prem/1.8/guide/additional-resources/ai-fundamentals/detection-project.html
The number to be check 63.  https

In [39]:
qa("What features are released in version 1.8?")

[32m 2024-Mar-07 11:57:32.26[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.datalayer[0m:[36m1047[0m | [1m{}[0m


  0%|                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s][2024-03-07 11:57:32] httpx INFO HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.48it/s]
[2024-03-07 11:57:35] httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In version 1.8, the supported features include model upload, prediction upload, and exporting datasets to COCO/YOLO/VOC format. Additionally, the export functionality from the Intel® Geti™ user interface can be used for exporting datasets. The features not yet supported in version 1.8 but will be added in future releases are fetching the active dataset, triggering model optimization, running model tests, and creating datasets and retrieving dataset statistics.
-------------------- 

#supported-features 0.7988542914390564
# Supported features
-------------------- 

#what-is-not-supported 0.7762789130210876
# Supported features

## What is not supported

- Model upload

- Prediction upload

- Exporting datasets to COCO/YOLO/VOC format: For this, you can use the export
functionality from the Intel® Geti™ user interface instead.

The following features are not supported yet but will be added to the SDK in future
releases:

- Fetching the active dataset

- Triggering (post-training) mode

In [40]:
datas = [Document(**{"url": url}) for url in url_sets]
db.execute(Collection("url").insert_many(datas), refresh=True)

[32m 2024-Mar-07 11:57:35.23[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.datalayer[0m:[36m384 [0m | [1mCDC active, skipping refresh[0m


([ObjectId('65e93b2f75c2fdf140a0a91b'),
  ObjectId('65e93b2f75c2fdf140a0a91c'),
  ObjectId('65e93b2f75c2fdf140a0a91d'),
  ObjectId('65e93b2f75c2fdf140a0a91e'),
  ObjectId('65e93b2f75c2fdf140a0a91f'),
  ObjectId('65e93b2f75c2fdf140a0a920'),
  ObjectId('65e93b2f75c2fdf140a0a921'),
  ObjectId('65e93b2f75c2fdf140a0a922'),
  ObjectId('65e93b2f75c2fdf140a0a923'),
  ObjectId('65e93b2f75c2fdf140a0a924'),
  ObjectId('65e93b2f75c2fdf140a0a925'),
  ObjectId('65e93b2f75c2fdf140a0a926'),
  ObjectId('65e93b2f75c2fdf140a0a927'),
  ObjectId('65e93b2f75c2fdf140a0a928'),
  ObjectId('65e93b2f75c2fdf140a0a929'),
  ObjectId('65e93b2f75c2fdf140a0a92a'),
  ObjectId('65e93b2f75c2fdf140a0a92b'),
  ObjectId('65e93b2f75c2fdf140a0a92c'),
  ObjectId('65e93b2f75c2fdf140a0a92d'),
  ObjectId('65e93b2f75c2fdf140a0a92e'),
  ObjectId('65e93b2f75c2fdf140a0a92f'),
  ObjectId('65e93b2f75c2fdf140a0a930'),
  ObjectId('65e93b2f75c2fdf140a0a931'),
  ObjectId('65e93b2f75c2fdf140a0a932'),
  ObjectId('65e93b2f75c2fdf140a0a933'),


**We need to sleep longer because the CDC service needs to run for a long time, and crawling dozens of web pages is time-consuming.**

In [41]:
time.sleep(120)

In [42]:
qa("What features are released in version 1.8?")

[32m 2024-Mar-07 11:59:35.39[0m| [1mINFO    [0m | [36mzhouhaha-2.local[0m| [36msuperduperdb.base.datalayer[0m:[36m1047[0m | [1m{}[0m


  0%|                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s][2024-03-07 11:59:36] httpx INFO HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.26it/s]
[2024-03-07 11:59:39] httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Based on the provided documents, the features released in Intel® Geti™ 1.8.0 include:

- Enhanced labeling experience with the Automatic Segmentation tool
- Sample datasets available
- New storage tab
- Project size display
- Video player improvements
- Removal of Filter Pruning
- Download individual media
- Active model architecture indication

These updates and feature enhancements are part of the Intel® Geti™ 1.8.0 release.
-------------------- 

#release-details 0.8630382418632507
# IntelÂ® Getiâ¢ 1.8.0

## Release Details

This section covers additional details on the new functionality available with IntelÂ® Getiâ¢ 1.8.0.
-------------------- 

#intel-geti-1-8-0 0.8435623049736023
# IntelÂ® Getiâ¢ 1.8.0
-------------------- 

#release-summary 0.826368510723114
# IntelÂ® Getiâ¢ 1.8.0

## Release Summary

IntelÂ® Getiâ¢ 1.8.0 contains several updates and feature enhancements, including key highlights:

- Enhanced labeling experience with the Automatic Segmentation tool — Segmen