# Table of content
- [1. Retrieval Augmented Generation (RAG)](#1)
    - [1.1 Stages within RAG](#1.1)
    - [1.2 Components within RAG](#1.2)
- [2. What is the LlamaIndex?](#2)
- [3. Build a RAG System using LlamaIndex](#3)
    - [3.1 Load Documents](#3.1)
    - [3.2 Creating Text Chunks](#3.2)
    - [3.2 Building Knowledge Bases](#3.3)
    - [3.4 Query Index](#3.4)
- [4. Build a RAG System with any LLM](#4)
- [5. Build a RAG System from VinaLLaMA](#5)
- [References](#6)

**Note:** This notebook run on a single GPU - V100 16GB

In [None]:
# Install packages
!pip install llama-index openai tiktoken pypdf accelerate bitsandbytes

Collecting llama-index
  Downloading llama_index-0.9.23-py3-none-any.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-1.6.1-py3-none-any.whl (225 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.4/225.4 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m85.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-3.17.4-py3-none-any.whl (278 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7

<a name='1' ></a>
# 1. Retrieval Augmented Generation (RAG)
LLMs undergo training on extensive datasets, excluding specific user data. Retrieval-Augmented Generation (RAG) tackles this limitation by dynamically integrating user data into the generation process. This is achieved without modifying the training data of LLMs; instead, the model gains access to and utilizes user data in real-time to offer more personalized and contextually appropriate responses.

Within the RAG framework, user data is loaded and prepared for queries, essentially "indexed." User queries interact with this index, refining the user data to the most pertinent context. The refined context and user query are then forwarded to the LLM, accompanied by a prompt, and the LLM generates a response.

Whether you are constructing a chatbot or an agent, understanding RAG techniques for incorporating data into your application is essential.

<a name='1.1' ></a>
## 1.1 Stages within RAG

![](https://i.imgur.com/JU101gO.png)

There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:

- Loading: this refers to getting your data from where it lives – whether it’s text files, PDFs, another website, a database, or an API – into your pipeline. LlamaHub provides hundreds of connectors to choose from.
- Indexing: this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
- Storing: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.
- Querying: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies
- Evaluation: a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.

<a name='1.2' ></a>
## 1.2 Components within RAG
In a typical RAG process, we have a few components.

- Text Splitter: Splits documents to accommodate context windows of LLMs.
- Embedding Model: The deep learning model used to get embeddings of documents.
- Vector Stores: The databases where document embeddings are stored and queried along with their metadata.
- LLM: The Large Language Model responsible for generating answers from queries.
- Utility Functions: This involves additional utility functions such as Webretriver and document parsers that aid in retrieving and pre-processing files.


<a name='2' ></a>
# 2. What is the LlamaIndex?

LlamaIndex (formerly GPT Index), is a Python-based framework designed for constructing LLM applications. This framework serves as a straightforward and adaptable data solution, linking custom data sources to expansive language models. It offers specialized tools for seamless data ingestion from diverse sources, employs vector databases for efficient data indexing, and incorporates query interfaces tailored for handling extensive documents. In essence, The Llama Index stands as a comprehensive solution for developing retrieval augmented generation applications. Furthermore, it facilitates effortless integration with various applications such as Langchain, Flask, Docker, and more. For additional details, please visit the official GitHub repository at [https://github.com/run-llama/llama_index](https://github.com/run-llama/llama_index).

<a name='3' ></a>
# 3. Build a RAG System using LlamaIndex

In [None]:
import os
from llama_index import ServiceContext, LLMPredictor, OpenAIEmbedding, PromptHelper
from llama_index.llms import OpenAI
from llama_index.text_splitter import TokenTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import set_global_service_context

os.environ['OPENAI_API_KEY'] = "YOUR_OPENAI_API_KEY"

<a name='3.1' ></a>
## 3.1 Load Documents
As we know, LLMs lack updated knowledge of the world and information about internal documents. To enhance the capabilities of LLMs, it is necessary to provide them with pertinent information sourced from knowledge repositories. These repositories may comprise structured data like CSV, Spreadsheets, or SQL tables, unstructured data such as texts, Word Docs, Google Docs, PDFs, or PPTs, and semi-structured data like Notion, Slack, Salesforce, etc.

This notebook focuses on utilizing PDFs as knowledge sources. The Llama Index incorporates a class called SimpleDirectoryReader, designed to read stored documents from a specified directory. It automatically chooses a parser based on the file extension for efficient processing.

In the below code, we use a RAG pipeline system to question and answering on ebook [`How to Build a Career in AI`](https://wordpress.deeplearning.ai/wp-content/uploads/2022/10/eBook-How-to-Build-a-Career-in-AI.pdf)

In [None]:
documents = SimpleDirectoryReader(
    input_files=["./eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()

In [None]:
print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[0]))
print(documents[0])

<class 'list'> 

41 

<class 'llama_index.schema.Document'>
Doc ID: 85fae675-6f1d-4232-a371-a0ccdbc2ce7f
Text: PAGE 1Founder, DeepLearning.AICollected Insights from Andrew Ng
How to  Build Your Career in AIA Simple Guide


<a name='3.2' ></a>
## 3.2 Creating Text Chunks
Frequently, data extracted from knowledge sources surpasses the context window of LLMs. When texts longer than the context window are transmitted, the ChatGPT API trims the data, leading to the exclusion of essential information. Text chunking presents a solution to this challenge, wherein longer texts are divided into smaller chunks based on separators.

Apart from facilitating the fitting of texts into the context window of large language models, text chunking offers additional advantages:

- Enhanced embedding accuracy: Smaller text chunks contribute to improved embedding accuracy, subsequently elevating retrieval accuracy.
- Precision in context: Refining information through text chunking enhances the accuracy of the context, leading to better retrieval of information.

The Llama Index incorporates built-in tools specifically designed for text chunking. Here is the process of implementing text chunking using the Llama Index.

In [None]:
import tiktoken

node_parser = SimpleNodeParser.from_defaults(
  separator=" ",
  chunk_size=1024,
  chunk_overlap=20,
  tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode

)

<a name='3.3' ></a>
## 3.3 Building Knowledge Bases
The texts extracted from the knowledge sources need to be stored somewhere. But in RAG-based applications, we need the embeddings of the data. These embeddings are floating point numbers representing data in a high-dimensional vector space. To store and operate on them, we need vector databases. Vector Databases are purpose-built data stores for storing and querying vectors.

In [None]:
# Embeddings
llm = OpenAI(model='gpt-3.5-turbo', temperature=0.7, max_tokens=256)
embed_model = OpenAIEmbedding()

prompt_helper = PromptHelper(
  context_window=4096,
  num_output=256,
  chunk_overlap_ratio=0.1,
  chunk_size_limit=None
)

service_context = ServiceContext.from_defaults(
  llm=llm,
  embed_model=embed_model,
  node_parser=node_parser,
  prompt_helper=prompt_helper
)

In [None]:
# Vector Database
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

<a name='3.4' ></a>
## 3.4 Query Index
The final step is to query from the index and get a response from the LLM. Llama Index provides a query engine for querying and a chat engine for a chat-like conversation. The difference between the two is the chat engine preserves the history of the conversation, and the query engine does not.

In [None]:
query_engine = index.as_query_engine(service_context=service_context)
response = query_engine.query("What are steps to take when finding projects to build your experience?")
print(response)

Consider the technical growth potential of the project and ensure it is challenging but not too difficult. Also, assess whether there are good teammates or people to discuss ideas with, as collaborators can greatly impact your growth. Additionally, determine if the project can act as a stepping stone to larger projects based on its technical complexity and business impact. Finally, avoid spending excessive time on project selection and instead focus on taking action and refining your thinking as you work on multiple projects throughout your career.


In [None]:
response.response

'Consider the technical growth potential of the project and ensure it is challenging but not too difficult. Also, assess whether there are good teammates or people to discuss ideas with, as collaborators can greatly impact your growth. Additionally, determine if the project can act as a stepping stone to larger projects based on its technical complexity and business impact. Finally, avoid spending excessive time on project selection and instead focus on taking action and refining your thinking as you work on multiple projects throughout your career.'

<a name='4' ></a>
# 4. Build a RAG System with any LLM

LlamaIndex supports using LLMs from HuggingFace directly. Note that for a completely private experience, also setup a local embeddings model.

Many open-source models from HuggingFace require either some preamble before each prompt, which is a `system_prompt`. Additionally, queries themselves may need an additional wrapper around the `query_str` itself. All this information is usually available from the HuggingFace model card for the model you are using.

In [None]:
import transformers
import torch

from transformers import AutoTokenizer
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate


model_name = "berkeley-nest/Starling-LM-7B-alpha"
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="cuda")
system_prompt = """System: You are usefull LLM to build a RAG System.<|end_of_turn|>"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("User:{query_str} <|end_of_turn|>\nAssistant: ")

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": True},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=model_name,
    model_name=model_name,
    device_map="cuda",
    stopping_ids=[tokenizer.eos_token_id],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16}
)

service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

query_engine = index.as_query_engine(service_context=service_context)
response = query_engine.query("What are steps to take when finding projects to build your experience?")
print(response)

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


Here are some steps to take when finding projects to build your experience:

1. **Identify your interests and goals**: Think about what topics or industries you're passionate about and what career goals you have. This will help you find projects that align with your interests and goals.

2. **Research different types of projects**: Look for projects in your areas of interest that span various levels of difficulty and scope. Some projects might be part of a class or a competition, while others could be personal or professional side projects. Consider the potential impact and technical complexity of each project.

3. **Look for opportunities to collaborate**: Working with a team can help you learn from others and develop your skills more effectively. Connect with people who share your interests and goals, and consider joining a club, attending workshops, or participating in online forums to find potential collaborators.

4. **Evaluate the potential of each project**: Before committing to

<a name='5' ></a>
# 5. Build a RAG System from VinaLLaMA
In this part, we use a RAG pipeline system built from `VinaLLaMA - State-of-the-art Vietnamese LLMs` to question and answering on document[`AI tạo sinh: Sức bật giúp doanh nghiệp Việt Nam về đích tăng trưởng`](https://vinbigdata.com/document)

In [None]:
# Note: You need to restart kernel to avoid OutOfMemoryError before loading VinaLLaMA
import os
from llama_index import ServiceContext
from llama_index import VectorStoreIndex, SimpleDirectoryReader

os.environ['OPENAI_API_KEY'] = "YOUR_OPENAI_API_KEY"

In [None]:
documents = SimpleDirectoryReader(
    input_files=["./23127_VBDI_Ebook-AI-tao-sinh-Final.pdf"]
).load_data()

In [None]:
print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[20]))
print(documents[20])

<class 'list'> 

24 

<class 'llama_index.schema.Document'>
Doc ID: 4e4d6223-968b-4274-a1c7-9497bc06e2c9
Text: PHẦN 5 VINBIGDATA/colon.uc TIÊN PHONG PHÁT TRIỂN MÔ HÌNH NGÔN
NGỮ LỚN TIẾNG VIỆT PHẦN 5 /hyphen.uc TIÊN PHONG PHÁT TRIỂN MÔ HÌNH
NGÔN NGỮ LỚN TIẾNG VIỆT 21


In [None]:
import transformers
import torch

from transformers import AutoTokenizer
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate


model_name = "vilm/vinallama-7b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="cuda")
system_prompt = """<|im_start|>system
Bạn là một trợ lí AI hữu ích. Hãy trả lời người dùng một cách chính xác.
<|im_end|>"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|im_start|>user\n{query_str} <|im_end|>\n<|im_start|>assistant")

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": True},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=model_name,
    model_name=model_name,
    device_map="cuda",
    stopping_ids=[tokenizer.eos_token_id],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16}
)

service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    # Uncomment this if using a embedding model on local
    # embed_model="local"
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

query_engine = index.as_query_engine(service_context=service_context)
response = query_engine.query("Làm thế nào để lựa chọn LLM phù hợp?")
print(response)


Để lựa chọn LLM phù hợp, doanh nghiệp nên xem xét các tiêu chí sau và cân nhắc chúng dựa trên chiến lược và chính sách của họ:

1. Hiệu suất hoặc chi phí triển khai: Tùy thuộc vào nhu cầu cụ thể của doanh nghiệp, ưu tiên các tiêu chí này hơn tiêu chí khác.
2. Mô hình công nghệ: Chọn mô hình phù hợp nhất với nhu cầu của doanh nghiệp và chiến lược kinh doanh của họ.
3. Bảo mật dữ liệu: Tập trung vào các mô hình do nước ngoài phát triển có dữ liệu lưu trữ tại các máy chủ bên ngoài Việt Nam hoặc sử dụng các dịch vụ đám mây, vì điều này có thể tạo ra nguy cơ mất dữ liệu và xâm phạm quyền riêng tư.
4. Tính chính xác của thông tin mang tính bản địa: Chọn các mô hình sử dụng nguồn dữ liệu mang tính bản địa cao, vì điều này sẽ đảm bảo mô hình trả về phản hồi phù hợp với bối cảnh văn hóa, kinh tế và xã hội của Việt Nam.
5. Ngân sách và lợi suất dự kiến: Hiểu rõ ngân sách của doanh nghiệp và lợi suất dự kiến, và lựa chọn mô hình phù hợp nhất với những yếu tố này.
6. Phản hồi chính xác: Tập trung

<a name='6' ></a>
# References
- [Evaluate RAG with LlamaIndex](https://cookbook.openai.com/examples/evaluation/evaluate_rag_with_llamaindex)
- [Build a RAG Pipeline With the LLama Index](https://www.analyticsvidhya.com/blog/2023/10/rag-pipeline-with-the-llama-index/)
- [Customizing LLMs within LlamaIndex Abstractions](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html)
- [RAG with LlamaIndex and DeciLM: A Step-by-Step Tutorial](https://deci.ai/blog/rag-with-llamaindex-and-decilm-a-step-by-step-tutorial/)

See more detail at my github - [QuyAnh2005](https://github.com/QuyAnh2005/RAG-LlamaIndex)