This notebook will help us to digest work instructions from a variety of online work sites. We will use RAG to query these often dense documents.

In [3]:
from tqdm import tqdm
import google.generativeai as palm
from IPython.display import display, Markdown

### Document Loading
You will place your document text in the `document.txt` file in the `data` subdirectory. This will be loaded in without need for a prompt. This is crude but it will work for now.

In [4]:
with open("data/document.txt", encoding='utf-8') as f:
    doc = f.read()
doc



### Chunking

In [2]:
from chunker import get_chunks

In [6]:
# Make type suitable for chunker function
docs = []
docs.append(doc)

In [7]:
texts = get_chunks(docs)
texts

['\ufeffAnteater 3D VPTW, FWP : Spec doc  \nLIVE\nLast update: 10/26/2023\n________________\n\n\nThis document contains instructions for annotating vehicles, pedestrians and two-wheelers for validation.\n\n\nThe document is divided into objects to annotate, vehicles, pedestrians, two-wheelers and nondescript. Objects to annotate describe the general limitations of where a vehicle, pedestrian, and two-wheelers should be annotated. \n\n\nVehicles, pedestrian and two-wheelers annotation starts with a table of all properties that makes up a vehicle, pedestrian or two-wheeler annotation and continues with subchapters further describing all properties. Nondescript vehicles, pedestrians and two-wheelers describe how to annotate vehicles, pedestrians and two-wheelers that are outside the regions where we want to annotate two-wheelers and some special cases. \nKey Sections\n* Summary of task\n* Workflow\n* When to annotate objects?\n* Annotation Rules\n* Table of labels \n* Table of attributes\

### Embeddings

In [9]:
from uuid import uuid4
from embeddings_palm import get_palm_embeddings

In [10]:
chunks = []
for text in tqdm(texts):
    
    chunks.append(
        {
            'id': str(uuid4()),
            'values': get_palm_embeddings(text),
            'metadata': {
                'text': text
                }
        }
    )

100%|██████████| 46/46 [02:45<00:00,  3.61s/it]


## Pinecone

In [26]:
import os
import pinecone

### Credentials

In [27]:
pinecone_api_key = os.getenv('PINECONE_API_KEY')

### Creating an Index

In [28]:

index_name = 'work-documents'

# initialize connection to pinecone
pinecone.init(
    api_key= pinecone_api_key,
    environment="gcp-starter"  # next to API key in console
)

In [29]:
# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=len(chunks[0]['values']),
        metric='dotproduct'
    )
# connect to index
index = pinecone.GRPCIndex(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.00046,
 'namespaces': {'': {'vector_count': 46}},
 'total_vector_count': 46}

### Populating the Index

In [15]:
index.upsert(vectors=chunks)

upserted_count: 46

### Retrieval

#### Create an Embedding for the Query
This comes in the form of a query vector we will name `xq`.

In [69]:
query = "Is there a minimum sizing for objects? Like when there are not enough points?"

xq = get_palm_embeddings(query)


#### Query Vector Database

In [70]:
res = index.query(xq, top_k=5, include_metadata=True)
res

{'matches': [{'id': '48f65dbe-e006-465e-82a9-86e4f88bb640',
              'metadata': {'text': 'The distance and size conditions are '
                                   'summarized in the following.\n'
                                   '\t\n'
                                   '\n'
                                   '\tObject type\n'
                                   '\tDistance Pixel | height\n'
                                   '\tVehicle\n'
                                   '\n'
                                   '\n'
                                   'Car\n'
                                   'Van\n'
                                   'Suv\n'
                                   'Kcar\n'
                                   'AutoRickshaw\n'
                                   'Pickup\n'
                                   'Trailer\n'
                                   'OtherVehicle\n'
                                   '\t\n'
                                   '\n'
                

## Retreival Augmented Generation

### Stuff Method
We concatenate the text metadata from the vector store directly with our query. This is simple and it works for small prompts.

In [71]:
# get list of retrieved text
contexts = [item['metadata']['text'] for item in res['matches']]

# Concatenate retrieved texts from vector database with the query
## May exceed context length if too many
augmented_query = "\n\n---\n\n".join(contexts)+"\n\n-----\n\n"+query 

### Primer Template

In [72]:
# system message to 'prime' the model
primer = f"""
You are a tasker working on a labelling project on an online site called Remotasks \
These task types are categorized under 3D and are lidar labelling tasks \
You will be provided with parts of the guidelines describing how to label various objects in the task \
You MUST be detail-oriented because slight the smallest misreadings of the instructions could lead to very wrong labelling \
If for some reason you cannot find text relevant to this kind of task labelling say 'I DONT KNOW' \
Do not make guesses or speculate, only say things that have factual basis in the guidelines \
Please quote snippets from the guidelines to support your points in each of your responses \
The response should look as follows, whenever possible: \

`
Your summary of the response
"<Quote from the guidelines in double quotation marks>"
`

A query will be provided after the task guidelines for you to answer \
"""

In [73]:
res = palm.chat(context=primer, messages=augmented_query, temperature=0.0)

In [74]:
display(Markdown(res.last))

Yes, there is a minimum sizing for objects. Objects that are smaller than a certain size will not be annotated. This is because objects that are too small are difficult to identify and track accurately. The minimum size for objects varies depending on the type of object. For example, the minimum size for vehicles is 15 pixels, while the minimum size for pedestrians is 10 pixels.

If an object is smaller than the minimum size, it will not be annotated. However, if an object is larger than the minimum size, it will be annotated even if it does not have enough points. This is because objects that are larger than the minimum size are more likely to be identified and tracked accurately.

The minimum size for objects is determined by the resolution of the lidar data. The higher the resolution of the lidar data, the smaller the minimum size for objects can be. This is because lidar data with a higher resolution can capture more detail, which makes it easier to identify and track small objects.

### Debugging

In [None]:
for message in res.messages:
    display(Markdown((message['content'])))