In [1]:
import os
from llama_index import ServiceContext, LLMPredictor, OpenAIEmbedding, PromptHelper
from llama_index.llms import OpenAI
from llama_index.text_splitter import TokenTextSplitter, SentenceSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import set_global_service_context
import tiktoken

In [11]:
api_key = os.environ['OPEN_API_KEY']

In [4]:
data_dir = 'C:\\Users\\aaron\\Personal\\Job Docs\\Cover Letters 2023'
documents = SimpleDirectoryReader(input_dir=data_dir).load_data()

To build a simple vector store index using non-OpenAI LLMs, e.g. Llama 2 hosted on Replicate, where you can easily create a free trial API token

In [None]:
# import os
# os.environ["REPLICATE_API_TOKEN"] = "YOUR_REPLICATE_API_TOKEN"

# from llama_index.llms import Replicate
# llama2_7b_chat = "meta/llama-2-7b-chat:8e6975e5ed6174911a6ff3d60540dfd4844201974602551e10e9e87ab143d81e"
# llm = Replicate(
#     model=llama2_7b_chat,
#     temperature=0.01,
#     additional_kwargs={"top_p": 1, "max_new_tokens":300}
# )

# from llama_index.embeddings import HuggingFaceEmbedding
# from llama_index import ServiceContext
# embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

# from llama_index import VectorStoreIndex, SimpleDirectoryReader
# documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
# index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Often, the data extracted from knowledge sources are lengthy, exceeding the context window of LLMs. If we send texts longer than the context window, the Chatgpt API will shrink the data, leaving out crucial information. One way to solve this is text chunking. In text chunking, longer texts are divided into smaller chunks based on separators.

Text chunking has other benefits besides making it possible to fit texts into a large language model’s context window.

Smaller text chunks result in better embedding accuracy, subsequently improving retrieval accuracy.
Precise context: Narrowing down information will help in getting better information.

In [5]:
text_splitter = TokenTextSplitter(
  separator=" ",
  chunk_size=1024,
  chunk_overlap=20,
  backup_separators=["\n"],
  tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode

)

SimpleNodeParser creates nodes out of text chunks, and the text chunks are created using Llama Index’s TokenTextSplitter. 

In [6]:

node_parser = SimpleNodeParser.from_defaults(
  text_splitter = text_splitter )

We can use a SentenceSplitter as well.

In [7]:
text_splitter = SentenceSplitter(
  separator=" ",
  chunk_size=1024,
  chunk_overlap=20,
  paragraph_separator="\n\n\n",
  secondary_chunking_regex="[^,.;。]+[,.;。]?",
  tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode

)

The texts extracted from the knowledge sources need to be stored somewhere. But in RAG-based applications, we need the embeddings of the data. These embeddings are floating point numbers representing data in a high-dimensional vector space. To store and operate on them, we need vector databases. Vector Databases are purpose-built data stores for storing and querying vectors.

This is how embeddings work. Two semantically related texts will be in proximity in the vector space, while dissimilar texts are far away. Embeddings have an extraordinary ability to map analogies between different data points.

Embeddings generated from capable deep-learning models can efficiently capture the semantic meaning of text chunks. When a user sends a text query, we convert it to embeddings using the same model, compare the distances of the text embeddings stored in the vector database, and retrieve the closest “n” text chunks. These chunks are the most semantically similar chunks to the queried text.

To customize the embedding model, we need to use ServiceContext and PromptHelper.

In [12]:
# Using an Open AI LLM 
# with Prompt Helper and Service Context

llm = OpenAI(model='gpt-3.5-turbo',
             temperature=0, 
             max_tokens=256,
             api_key=os.environ['OPEN_API_KEY']
            )

embed_model = OpenAIEmbedding(api_key=api_key)

prompt_helper = PromptHelper(

  context_window=4096, 

  num_output=256, 

  chunk_overlap_ratio=0.1, 

  chunk_size_limit=None

)

service_context = ServiceContext.from_defaults(

  llm=llm,

  embed_model=embed_model,

  node_parser=node_parser,

  prompt_helper=prompt_helper

)

This uses the Llama Index’s default vector store. It is an in-memory vector database. You can also go with other vector stores such as Chroma,  Weaviate, Qdrant, Milvus, etc.

In [15]:
index = VectorStoreIndex.from_documents(
    documents, 
    service_context = service_context,
    show_progress=True
    )

Parsing documents into nodes:   0%|          | 0/45 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/45 [00:00<?, ?it/s]

The final step is to query from the index and get a response from the LLM. Llama Index provides a query engine for querying and a chat engine for a chat-like conversation. The difference between the two is the chat engine preserves the history of the conversation, and the query engine does not.

In [17]:
query_engine = index.as_query_engine(service_context=service_context)
response = query_engine.query("What companies are looking for VP level?")
print(response)

MetLife is looking for a Vice President for the Data Integration and Third Party Activation position.


In [23]:
question = input()
response = query_engine.query(question)
print(response)

Who is hiring for an AI VP?
Upwork is hiring for a VP, AI & ML.


In [27]:
loop = True
while loop:
    question = input("Type q, quit, exit, or stop to leave.")
    if question in ("quit","q","exit","stop"):
        loop=False
        break
    response = query_engine.query(question)
    print(response)
    print('\n\n')

Type q, quit, exit, or stop to leave.q
