# Module 2, Activity 1: Initialize, Build, and Populate Vectors

In [1]:
import boto3
import json
import time

from langchain_aws import BedrockEmbeddings
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import InMemoryVectorStore
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [2]:
def get_data_from_s3(bucket_name, key):
    s3 = boto3.client(
        's3',
        region_name='us-west-2',
    )
    response = s3.get_object(Bucket=bucket_name, Key=key)
    data = response['Body'].read().decode('utf-8')

    return data

In [3]:
session = boto3.session.Session()
region = session.region_name
bedrock_runtime = boto3.client("bedrock-runtime", region_name='us-west-2')

## Creating embeddings

Embeddings (AKA vectors) are a numeric representation of a thing, usually as a list of floating point values ranging from -1.0 to 1.0.  In the case of LLMs, we need to convert text to numbers so we can do fancy math inside the LLM.  The LLM then derives its answers as embeddings and converts them back to text for us.  So we have to start by telling our app which embedding model we would like to use.

In [4]:
embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0", client=bedrock_runtime)

Now let's see the embedding algorithm work on a simple bit of text...

In [5]:
vector = embeddings.embed_query("The quick brown fox jumps over the lazy dog.")
print(vector)

[-0.03825681656599045, 0.00719795748591423, -0.03788420557975769, 0.0037202551029622555, 0.022750595584511757, -0.06565647572278976, 0.035656485706567764, 0.025682972744107246, 0.02723112888634205, -0.06698131561279297, -0.02849867381155491, 0.041108064353466034, -0.03940925747156143, 0.0385926254093647, -0.04858579486608505, 0.02316324971616268, -0.08972940593957901, 0.00561718363314867, 0.047065552324056625, 0.004710956942290068, 0.027569446712732315, 0.0596788115799427, 0.04921810328960419, 0.002292540157213807, -0.03212191164493561, -0.03145447000861168, -0.014937476254999638, -0.01930885948240757, 0.06831284612417221, -0.0229103472083807, 0.00578446127474308, 0.02930871769785881, 0.020311452448368073, -0.017731430009007454, 0.03661038726568222, 0.07913696765899658, 0.03177940845489502, 0.021328318864107132, 0.0034074457362294197, 0.004702174570411444, 0.024750402197241783, 0.047506432980298996, -0.0020175776444375515, 0.03374910354614258, 0.036725807934999466, 0.07485925406217575,

In [6]:
print("Vector length: ", len(vector))

Vector length:  1024


## A note about vector length

We see here that our text was converted to a 1024-dimensional vector.  Any text that we send to be embedded will result in a vector of this size.  It should be noted that different models return different vector lengths.  Many of the better models use more dimensions, with 1536 being a common one.

So now let's get some more text...

In [7]:
s3_data = get_data_from_s3("dpgenaitraining", "q2_results.txt")
s3_data[0:200]

'BILL Reports Second Quarter Fiscal Year 2025 Financial Results\nFebruary 6, 2025\n\n\t•\tQ2 Core Revenue Increased 16% Year-Over-Year\n\t•\tQ2 Total Revenue Increased 14% Year-Over-Year\nSAN JOSE, Calif.--(BUS'

In [8]:
print("Length of text: ", len(s3_data))

Length of text:  33865


## Splitting and chunking

If we took that entire text block and put it into the embedding algorithm we would wind up with a single vector back.  However, this is probably not very useful because that vector would essentially be the average for a very large document.  It would not capture things like how different paragraphs have different meaning because they talk about different things.  What we really want to have is a series of vectors where each vector captures the meaning of a portion of the overall text.

This is where splitting and chunking come in.  These terms are frequently used interchangeably, but actually mean slightly different things.  Chunking refers to breaking large amount of text into pieces of a fixed size, usually specified by the number of characters (`chunk_size`).  Splitting, on the other hand, typically refers to creating smaller pieces of text made by breaking the full text at logical boundaries like `\n`.  

However, we can run into trouble with splitting and chunking if we were to just say that the chunk size is simply 1000 characters.  In doing so, we will frequently wind up with chunks where meaningful information is present in the next chunk.  So if we just fix a chunk size, we will wind up creating vectors that are not independent of each other.  Vector `i+1` might still have some of the information needed in vector `i`.  Therefore, we also can include a term called `chunk_overlap`, which determines the number of overlapping characters in common between vectors `i` and `i+1`.

There are many options for splitting and chunking available in LangChain.  For this workshop, we will largely use the `RecursiveCharacterTextSplitter`, which recursively tries to split the input using a hierarchy of separators (like `["\n\n", "\n", ".", " ", ""]`) until it produces chunks that are no longer than `chunk_size`, and have `chunk_overlap` between them to preserve context across boundaries.

This concept can be difficult to visualize.  I would encourage you to try out [**this application**](https://chunkviz.up.railway.app/) to visualize chunks in a variety of different chunking/splitting approaches.

In [9]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.create_documents([s3_data])
print(f"Total chunks created: {len(chunks)}")

Total chunks created: 43


## Vector databases

There are a multitude of databases out there that you can use to store you vectors once they are created, including many populate ones like PostGreSQL, MongoDB, and Pinecone to name a few.  It is beyond the scope of this course to get into setting up such a database, but we still want to have the experience of working with one.  To that end, LangChain provides an in-memory vector store that we can use to try all of this out.  

We will now initialize that in-memory vector store for use in further queries.

(For the interested student, this workshop has access to AWS OpenSearch Serverless (AOSS).  There is an [optional notebook](./OPTIONAL_AOSS_demo.ipynb) that you can explore to see how you might use this service for your vector store.)

In [10]:
vector_store = InMemoryVectorStore.from_documents(chunks, embeddings)

## Vector similarity

We have now populated our vector database with both the vectors and their originating chunks from the raw text such that we can return the original text matching our queries.  So we will now do a similarity search on that database for a starting question.

Here, our sentence is passed through the embedding algorithm and turned into a vector.  Then, a cosine similarity is done on this vector to all others in the database to find the 3 most similar vectors (`k=3`).  Then, the chunk of text corresponding to those 3 most similar vectors is returned.

In [11]:
docs = vector_store.similarity_search("What were the subscription fees?", k=3)

for i, doc in enumerate(docs, start=1):
    print(f"=== Document {i} ===")
    print("Content:")
    print(doc.page_content)
    print('\n\n')

=== Document 1 ===
Content:
“In Q2, we delivered strong financial results, expanded our non-GAAP operating margin, and continued our track record of execution across the company,” said John Rettig, BILL President and CFO. “We are executing on our strategic priorities and are confident that our strong business model will allow us to drive years of durable growth, an attractive long-term profitability profile, and sustained value generation for shareholders.”
Financial Highlights for the Second Quarter of Fiscal 2025:
	•	Total revenue was $362.6 million, an increase of 14% year-over-year.
	•	Core revenue, which consists of subscription and transaction fees, was $319.6 million, an increase of 16% year-over-year. Subscription fees were $67.7 million, up 7% year-over-year. Transaction fees were $251.9 million, up 19% year-over-year.
	•	Float revenue, which consists of interest on funds held for customers, was $42.9 million.



=== Document 2 ===
Content:
31,595



40,443

Depreciation of pr

## Concluding thoughts

We have now seen how we can convert a text document into a series of vectors, store those vectors and their original text in a database, and identify similar bits of text to a question.  Next, we will turn this into a more robust question-answering system using an LLM.