## Installing Requirements

In [1]:
!pip install --quiet PyPDF2 sentence-transformers

In [1]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Pip install necessary package
!pip install --upgrade --quiet  pgvector psycopg2-binary

## Processing Data

### Loading Embeddings Model

In [1]:
from utils import load_embeddings, embed_docs

MODEL_NAME = 'avsolatorio/GIST-small-Embedding-v0'
# MODEL_NAME = 'amazon.titan-embed-text-v1'

load_embeddings(MODEL_NAME)

<embedding_utils.TextEmbedder at 0x7f470d900ac0>

### Creating Document embeddings

In [10]:
chunks, embeddings = embed_docs("docs/2304.02643.pdf")

In [16]:
chunks[0]

'Segment Anything\nAlexander Kirillov1;2;4Eric Mintun2Nikhila Ravi1;2Hanzi Mao2Chloe Rolland3Laura Gustafson3\nTete Xiao3Spencer Whitehead Alexander C. Berg Wan-Yen Lo Piotr Doll ´ar4Ross Girshick4\n1project lead2joint ﬁrst author3equal contribution4directional lead\nMeta AI Research, FAIR\n(b) Model: Segment Anything Model (SAM)promptimagevalid maskimage encoderprompt encoderlightweight mask decoder\n(a) Task: promptable segmentationsegmentation promptimagemodelcat withblack earsvalid mask\n(c) Data: data engine (top) & dataset (bottom)•1+ billion masks•11 million images •privacy respecting•licensed imagesannotatetraindatamodelSegment Anything 1B (SA-1B):\nFigure 1: We aim to build a foundation model for segmentation by introducing three interconnected components: a prompt-\nable segmentation task, a segmentation model (SAM) that powers data annotation and enables zero-shot transfer to a range\nof tasks via prompt engineering, and a data engine for collecting SA-1B, our dataset of ove

In [17]:
embeddings[0]

[-0.0713343545794487,
 -0.08012951165437698,
 0.011772824451327324,
 -0.024659188464283943,
 0.03870829567313194,
 0.021776752546429634,
 0.0551050528883934,
 -0.019230831414461136,
 0.02544032409787178,
 0.03552670776844025,
 0.008509576320648193,
 -0.06481850147247314,
 0.004323433618992567,
 0.04762005805969238,
 0.016339866444468498,
 0.04421736299991608,
 0.0414411686360836,
 0.029023990035057068,
 0.004141478333622217,
 0.0060348426923155785,
 0.023001842200756073,
 0.0025717592798173428,
 -0.015507927164435387,
 -0.04470757022500038,
 -0.05198215693235397,
 0.028300737962126732,
 0.022453276440501213,
 -0.04430174082517624,
 -0.06998388469219208,
 -0.24375535547733307,
 0.014675606042146683,
 -0.0349581316113472,
 0.06653100252151489,
 0.019334441050887108,
 0.02904212847352028,
 0.02059408649802208,
 -0.03977029025554657,
 0.030305657535791397,
 -0.05156799033284187,
 -0.04760480672121048,
 0.025877326726913452,
 0.018653346225619316,
 -0.0022801461163908243,
 -0.01739807426929

In [11]:
print('All Done!')

All Done!


## Ingesting data to RDS

### Creating Connection

In [12]:
from utils import create_connection, insert_data_into_database, get_secret

secret_name = "RDS-SECRET-NAME"
region = "SECRET-REGION"
secret = get_secret(secret_name, region)

In [13]:
auth = {
    "host": 'DB-ENDPOINT',
    "port": 'DB-PORT',
    "database": 'DB-NAME',
    "user": secret['username'],
    "password": secret['password']
}

create_connection(auth)

<knowledgeBase2.DatabaseManager at 0x7f461ec43a60>

### Ingesting Data

In [14]:
metadata = {"file_location": "docs/2304.02643.pdf"}
course_name = 'Course102' ## Filter criterea

insert_data_into_database(chunks, embeddings, metadata, course_name)

Data inserted successfully.


## Creating a Chatbot

In [21]:
from utils import chatbot

query = 'where can I find organic potatoes'
# query = 'Segment Anything Task'
course = 'Course102'
# course = 'Course101'

Response = chatbot(query, course)

In [22]:
Response['Chat']

' Unfortunately I do not see any information directly related to where one can find organic potatoes. The passages discuss various image segmentation datasets, regularization methods, zooplankton biomass measurements, iterative segmentation training, database statistics, synthetic indoor scene datasets, parking lot vehicle segmentation, hand-object segmentation, etc. There is no mention of organic potatoes or where to find them.'

In [23]:
Response['Documents']

[('released SA-1B.\nCropping. Masks were generated from a regular grid of\n32\x0232 points on the full image and 20 additional zoomed-\nin image crops arising from 2 \x022 and 4\x024 partially over-\nlapping windows using 16 \x0216 and 8\x028 regular point grids,\nrespectively. The original high-resolution images were used\nfor cropping (this was the only time we used them). We re-\nmoved masks that touch the inner boundaries of the crops.\nWe applied standard greedy box-based NMS (boxes were\nused for efﬁciency) in two phases: ﬁrst within each crop and\nsecond across crops. When applying NMS within a crop,\nwe used the model’s predicted IoU to rank masks. When\napplying NMS across crops, we ranked masks from most\nzoomed-in ( i.e., from a 4\x024 crop) to least zoomed-in ( i.e.,\nthe original image), based on their source crop. In both\ncases, we used an NMS threshold of 0.7.\nFiltering. We used three ﬁlters to increase mask qual-\nity. First, to keep only conﬁdent masks we ﬁltered by 