### Clear memory

In [1]:
%reset -f
import gc
gc.collect()

0

### Import

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import pymupdf, json, requests, re, sys
from sentence_transformers import SentenceTransformer
from pathlib import Path
from tqdm.auto import tqdm
from typing import List, Tuple, Dict, Any, Optional
import chromadb
from chromadb.config import Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
sns.set_style('whitegrid')
%matplotlib inline

In [3]:
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

In [4]:
sys.path.append('..')
from src.data_utils import *

# Data Loading
Load chunks, prepared in the first notebook

In [5]:
chunks = pd.read_json('../data/processed/chunks.json', orient='records')

chunks.head(3)

Unnamed: 0,chunk_id,text,page_num,char_count,start_char,end_char
0,0,User Guide AWS Toolkit for Microsoft Azure Dev...,1,134,0,134
1,1,AWS Toolkit for Microsoft Azure DevOps User Gu...,2,582,0,582
2,2,AWS Toolkit for Microsoft Azure DevOps User Gu...,3,983,0,983


In [6]:
print('Number of chunks:', chunks.shape[0])

Number of chunks: 300


# Embeddings Creation
### Model Selection
For this project `all-MiniLM-L6-v2` model is chosen, due to high performance and relatively small weight. 


In [7]:
model = SentenceTransformer('all-MiniLM-L6-v2')  # all-mpnet-base-v2

### Text Encoding
Convert each chunk into an embedding vector 

In [8]:
embeddings = model.encode(
    sentences = chunks['text'].tolist(),
    show_progress_bar = True,
)

print('Dimension of each embedding:', embeddings.shape[1])

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Dimension of each embedding: 384


### Save Embeddings
Add embeddings to dataframe with chunks

In [9]:
chunks['embedding'] = list(embeddings)

chunks.head(3)

Unnamed: 0,chunk_id,text,page_num,char_count,start_char,end_char,embedding
0,0,User Guide AWS Toolkit for Microsoft Azure Dev...,1,134,0,134,"[-0.013883891, 0.039782137, -0.03466231, 0.017..."
1,1,AWS Toolkit for Microsoft Azure DevOps User Gu...,2,582,0,582,"[-0.056738224, 0.02857147, -0.036580697, -0.01..."
2,2,AWS Toolkit for Microsoft Azure DevOps User Gu...,3,983,0,983,"[0.011543587, 0.018591559, -0.085423455, 0.008..."


# Vector Database Setup
Create a database to store embeddings and speed up similarity search. 
### Initialize Client
In this project ChromaDB client is used, because it is free and does not require external server. Use `PersistentClient` to store embeddings on a disk (not in RAM). 

In [10]:
client = chromadb.PersistentClient(
    path='../data/chromadb',
    settings=Settings(
        anonymized_telemetry=False,
        allow_reset=True,
    )
)

### Create Collection
Create collection called `aws_docs` inside database and also delete the old one if exists. Cosine similarity is used as metric. 

In [11]:
try:
    client.delete_collection(name='aws_docs')
except:
    pass

collection = client.create_collection(
    name = 'aws_docs',
    configuration = {'hnsw': {'space': 'cosine'}},
    metadata = {'description': 'AWS VSTS documentation chunks'},
)

### Fill Collection

First, unpack ids, embeddings, text chunks and chunks metadata

In [12]:
ids = list(map(str, chunks['chunk_id'].tolist()))
embeddings_list = chunks['embedding'].tolist()
documents = chunks['text'].tolist()
metadatas = chunks[['page_num', 'char_count', 'start_char', 'end_char']].to_dict('records')

Add to collection

In [13]:
collection.add(
    ids=ids,
    embeddings=embeddings_list,
    documents=documents,
    metadatas=metadatas,
)

### Verify Storage

Print random document from the database to verify that everything is stored correctly 

In [14]:
sample = collection.get(
    ids=['99'],
    include=['documents', 'metadatas', 'embeddings']
)

print('ID:', *sample['ids'])
print('Page:', sample['metadatas'][0]['page_num'])
print('Char quantity:', sample['metadatas'][0]['char_count'])
print('Embedding shape:', sample['embeddings'][0].shape[0])
print('Text preview:', sample['documents'][0][:200])

ID: 99
Page: 42
Char quantity: 1000
Embedding shape: 384
Text preview: entials to be available from EC2 instance metadata the instance must have been started with an instance proﬁle referencing a role granting permissions to the task to make calls to AWS on your behalf. 


Everything is stored fine!

# Semantic Search

Semantic search is a process of finding the closest chunk in terms of meaning to a query where distance is usually measured by cosine similarity.

### Formulate a Search Query

In [22]:
query = 'If I do not have an AWS account, what do I do?'

### Perform Semantic Search
Use custom function that returns 3 best chunks by default 

In [23]:
results = semantic_search(
    query,
    n_results=3,
    model=model,
    collection=collection,
)

### Print Results
Use custom function for pretty print

In [24]:
print_search_results(results=results, query=query)

Query: If I do not have an AWS account, what do I do?
----------------------------------------------------------------------------------------------------
Rank 1 | Similarity: 0.696 | Page: 10 | Chunk ID: 39 | Text preview below (first 300 chars):
al.aws.amazon.com/billing/signup. 2. Follow the online instructions. Part of the sign-up procedure involves receiving a phone call or text message and entering a veriﬁcation code on the phone keypad. When you sign up for an AWS account, an AWS account root user is created. The root user has access t...

Rank 2 | Similarity: 0.650 | Page: 109 | Chunk ID: 275 | Text preview below (first 300 chars):
ministrator – If you're in charge of AWS resources at your company, you probably have full access to AWS. It's your job to determine which AWS features and resources your service users should access. You must then submit requests to your IAM administrator to change the permissions of your service us...

Rank 3 | Similarity: 0.596 | Page: 109 | Chunk 

### Test It One More Time

In [27]:
query = 'What if I want to allow people outside of my AWS account to access my AWS resources?'
results = semantic_search(
    query,
    n_results=3,
    model=model,
    collection=collection,
)
print_search_results(results=results, query=query)

Query: What if I want to allow people outside of my AWS account to access my AWS resources?
----------------------------------------------------------------------------------------------------
Rank 1 | Similarity: 0.700 | Page: 114 | Chunk ID: 290 | Text preview below (first 300 chars):
r organization can use to access your resources. You can specify who is trusted to assume the role. For services that support resource-based policies or access control lists (ACLs), you can use those policies to grant people access to your resources. To learn more, consult the following: • To learn ...

Rank 2 | Similarity: 0.692 | Page: 109 | Chunk ID: 275 | Text preview below (first 300 chars):
ministrator – If you're in charge of AWS resources at your company, you probably have full access to AWS. It's your job to determine which AWS features and resources your service users should access. You must then submit requests to your IAM administrator to change the permissions of your service us...

Rank 3 