<a href="https://colab.research.google.com/github/Ashish-Soni08/Playground/blob/main/LlamaIndex/Customize_RAG_with_LlamaIndex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture

!pip install llama-index transformers

# Setup

#### Download Data

In [2]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2023-10-28 05:45:38--  https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2023-10-28 05:45:38 (4.75 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



#### Load Data

In [3]:
from llama_index import SimpleDirectoryReader

reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()

# Building QA System with OpenSource LLM

In [4]:
from llama_index.llms.anyscale import Anyscale
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
import openai

ANYSCALE_ENDPOINT_TOKEN = "esecret_whbayj13i6y9xs1qj78rgx4ztm"
openai.api_key = ''  # OPENAI_API_KEY

# Define LLM
llm = Anyscale(model = "meta-llama/Llama-2-13b-chat-hf",
                 api_key=ANYSCALE_ENDPOINT_TOKEN)

# Define Embedding Model
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

# Abstract llm, embedding model
service_context = ServiceContext.from_defaults(
    llm = llm,
    embed_model = embed_model,
)

# Create index
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
# Setup Query Engine
query_engine = index.as_query_engine()

In [6]:
response = query_engine.query("why did paul graham start YC?")

In [7]:
from IPython.display import display, HTML
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

# Building QA System with OpenSource LLM and Embeddings.

In [9]:
from llama_index.embeddings import HuggingFaceEmbedding

# Define LLM
llm = Anyscale(model = "meta-llama/Llama-2-13b-chat-hf",
                 api_key=ANYSCALE_ENDPOINT_TOKEN)

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Abstract llm, embedding model
service_context = ServiceContext.from_defaults(
    llm = llm,
    embed_model = embed_model,
)

# Create index
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Downloading (…)lve/main/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [10]:
# Setup Query Engine
query_engine = index.as_query_engine()

In [11]:
%%timeit -r 1 -n 1
response = query_engine.query("why did paul graham start YC?")

from IPython.display import display, HTML
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

3.41 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Let's use Optimum Embeddings from HuggingFace

You can install the dependencies with `pip install transformers optimum[exporters]`.


In [13]:
%%capture
!pip install transformers optimum[exporters]

First, we need to create the ONNX model. ONNX models provide improved inference speeds, and can be used across platforms (i.e. in TransformersJS)

In [16]:
from llama_index.embeddings import OptimumEmbedding

OptimumEmbedding.create_and_save_optimum_model("BAAI/bge-small-en-v1.5", "./bge_onnx")

RuntimeError: ignored

In [None]:
# load the embedding model
embed_model = OptimumEmbedding(folder_name="./bge_onnx")

In [None]:
# Define LLM
llm = Anyscale(model = "meta-llama/Llama-2-13b-chat-hf",
                 api_key=ANYSCALE_ENDPOINT_TOKEN)

# Abstract llm, embedding model
service_context = ServiceContext.from_defaults(
    llm = llm,
    embed_model = embed_model,
)

# Create index
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

In [None]:
# Create Query Engine
query_engine = index.as_query_engine()

In [None]:
response = query_engine.query("why did paul graham start YC?")

from IPython.display import display, HTML
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

9.35 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


# Customizing the chunk size, chunk overlap and LLM context window, number of output tokens.

In [20]:
from llama_index import ServiceContext, LLMPredictor, PromptHelper, VectorStoreIndex
from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms.anyscale import Anyscale

# Define LLM
ANYSCALE_ENDPOINT_TOKEN = "esecret_whbayj13i6y9xs1qj78rgx4ztm"
llm = Anyscale(model = "meta-llama/Llama-2-13b-chat-hf",
                 api_key=ANYSCALE_ENDPOINT_TOKEN)

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Create Node Parser
node_parser = SimpleNodeParser.from_defaults(chunk_size=2000, chunk_overlap=100)

# Create PromptHelper
prompt_helper = PromptHelper(
  context_window=4096,
  num_output=512,
  chunk_overlap_ratio=0.1,
)

# Customise LLM, Embedding model, Node parser and Prompthelper
service_context = ServiceContext.from_defaults(
  llm=llm,
  embed_model=embed_model,
  node_parser=node_parser,
  prompt_helper=prompt_helper
)

# Create Index
index = VectorStoreIndex.from_documents(documents, service_context = service_context)

In [21]:
# Setup Query Engine
query_engine = index.as_query_engine()

In [22]:
response = query_engine.query("why did paul graham start YC?")

from IPython.display import display, HTML
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

# Saving and Loading the Index

In [23]:
from llama_index import StorageContext, load_index_from_storage
from llama_index.node_parser import SimpleNodeParser

# create parser and parse document into nodes
node_parser = SimpleNodeParser.from_defaults(chunk_size=2000, chunk_overlap=100)
nodes = node_parser.get_nodes_from_documents(documents)

# create storage context using default stores
storage_context = StorageContext.from_defaults()

# # build index
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context = service_context)

# save index
index.storage_context.persist(persist_dir="storage")

In [24]:
# to load index later, make sure you setup the storage context
# this will loaded the persisted stores from persist_dir
storage_context = StorageContext.from_defaults(persist_dir="storage")

# then load the index object
# if loading multiple indexes from a persist dir
loaded_index = load_index_from_storage(storage_context = storage_context, service_context=service_context)

# setup query engine
query_engine = loaded_index.as_query_engine(similarity_top_k=3)
response = query_engine.query("why did paul graham start YC?")

# print the synthesized response.
display(HTML(f'<p style="font-size:20px">{response.response}</p>'))

# Count Prompt Tokens and Checking underlying Prompt

In [25]:
from llama_index import set_global_service_context
from llama_index.callbacks import CallbackManager, TokenCountingHandler
import tiktoken

token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)

callback_manager = CallbackManager([token_counter])

llm = OpenAI(model='gpt-3.5-turbo')

service_context = ServiceContext.from_defaults(
    llm=llm, callback_manager=callback_manager
)

# set the global default!
set_global_service_context(service_context)

In [26]:
index = VectorStoreIndex.from_documents(documents)

In [27]:
print(token_counter.total_embedding_token_count)

16662


Let's reset embedding count.



In [28]:
token_counter.reset_counts()

In [29]:
print(token_counter.total_embedding_token_count)

0


In [30]:
query_engine = index.as_query_engine(similarity_top_k=4)
response = query_engine.query("Why did author start YC?")

In [31]:
print(
    "Embedding Tokens: ",
    token_counter.total_embedding_token_count,
    "\n",
    "LLM Prompt Tokens: ",
    token_counter.prompt_llm_token_count,
    "\n",
    "LLM Completion Tokens: ",
    token_counter.completion_llm_token_count,
    "\n",
    "Total LLM Token Count: ",
    token_counter.total_llm_token_count,
    "\n",
)

Embedding Tokens:  7 
 LLM Prompt Tokens:  4504 
 LLM Completion Tokens:  227 
 Total LLM Token Count:  4731 



In [32]:
print("prompt: ", token_counter.llm_token_counts[0].prompt, "...\n")
print(
    "prompt token count: ", token_counter.llm_token_counts[0].prompt_token_count, "\n"
)

print("completion: ", token_counter.llm_token_counts[0].completion, "...\n")
print(
    "completion token count: ",
    token_counter.llm_token_counts[0].completion_token_count,
    "\n",
)

print("total token count", token_counter.llm_token_counts[0].total_token_count)

prompt:  system: You are an expert Q&A system that is trusted around the world.
Always answer the query using the provided context information, and not prior knowledge.
Some rules to follow:
1. Never directly reference the given context in your answer.
2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines.
user: Context information is below.
---------------------
[17]

As well as HN, I wrote all of YC's internal software in Arc. But while I continued to work a good deal in Arc, I gradually stopped working on Arc, partly because I didn't have time to, and partly because it was a lot less attractive to mess around with the language now that we had all this infrastructure depending on it. So now my three projects were reduced to two: writing essays and working on YC.

YC was different from other kinds of work I've done. Instead of deciding for myself what to work on, the problems came to me. Every 6 months there was a new batch