In [None]:
%pip install llama-index
%pip install transformers accelerate bitsandbytes
%pip install llama-index-readers-web
%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-huggingface
# %pip install llama-index-program-openai
# %pip install llama-index-agent-openai

## Setup

### Loading stage

In [17]:
from llama_index.core.response.notebook_utils import display_response
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()

### Indexing Stage
####LLM


In [2]:
# huggingface api token for downloading llama2
hf_token = "Token KEY"

In [3]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.core.prompts import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-2-7b-chat-hf",
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    query_wrapper_prompt=PromptTemplate("<s> [INST] {query_str} [/INST] "),
    context_window=3900,
    model_kwargs={"token": hf_token, "quantization_config": quantization_config},
    tokenizer_kwargs={"token": hf_token},
    device_map="auto",
)




config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

####Embedding Model

In [4]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = "local:BAAI/bge-small-en-v1.5"

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Storing Stage

In [5]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

## Query Stage

In [18]:
query_engine = vector_index.as_query_engine()

response = query_engine.query("What is Transformer?")

display_response(response)

**`Final Response:`** Based on the context information provided, the Transformer is a sequence transduction model that is entirely based on attention. It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. The model is trained significantly faster than architectures based on recurrent or convolutional layers and achieves a new state of the art on both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks. The Transformer can be extended to problems involving input and output modalities other than text and can be used to make generation less sequential.

In [19]:
query_engine = vector_index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What is Transformer?")

display_response(response)

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


**`Final Response:`** Given the new context, the original answer "Transformer" can be further refined to better answer the query. The Transformer is a neural network architecture introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017, which solely relies on attention mechanisms instead of recurrent or convolutional neural networks. The authors proposed a new simple network architecture based on attention mechanisms, dispensing with recurrence and convolutions entirely, and demonstrated its superiority in quality, parallelizability, and training time compared to existing sequence transduction models.

In the context provided, the Transformer is a neural network architecture that replaces recurrent or convolutional neural networks with attention mechanisms. This allows the model to attend to different parts of the input sequence simultaneously and weigh their importance when generating the output sequence. The Transformer has been shown to be effective in various sequence-to-sequence tasks, including machine translation and constituency parsing.

In summary, the Transformer is a neural network architecture that relies solely on attention mechanisms to process input sequences, and it has been shown to be superior in various sequence transduction tasks.

# Response Mode
 Reference link : https://docs.llamaindex.ai/en/stable/module_guides/deploying/query_engine/response_modes/

### Refine

Note : Good for more detailed answers.

In [20]:
query_engine = vector_index.as_query_engine(response_mode="refine")

response = query_engine.query("What is Transformer?")

display_response(response)

**`Final Response:`** Based on the new context provided, the Transformer model is a sequence transduction model that is entirely based on attention, and it replaces recurrent layers commonly used in encoder-decoder architectures with multi-headed self-attention. The Transformer is designed to handle input and output modalities other than text and can be extended to other tasks such as images, audio, and video. The model is also capable of handling generation less sequential, which is a research goal of the authors.

In more detail, the Transformer model follows an overall architecture that consists of stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. The encoder is composed of a stack of 6 identical layers, each with two sub-layers: a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network. The decoder is also composed of a stack of 6 identical layers, with an additional sub-layer that performs multi-head attention over the output of the encoder stack.

To improve the attention mechanism, the Transformer model employs residual connections around each of the sub-layers, followed by layer normalization.

### Compact (default)

Note : it is like refine, but with less LLM calls.

In [10]:
query_engine = vector_index.as_query_engine(response_mode="compact")

response = query_engine.query("What is Transformer?")

display_response(response)

**`Final Response:`** Based on the context information provided, the answer to the query is:

Transformer is a sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. It was presented in the paper "Attention is All You Need" by Vaswani et al. in 2017. The Transformer model is composed of an encoder and a decoder, each consisting of a stack of identical layers. The encoder layer has two sub-layers: a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network. The decoder layer inserts a third sub-layer that performs multi-head attention over the output of the encoder stack. The model uses residual connections and layer normalization to facilitate the attention mechanisms. The Transformer model achieves state-of-the-art results on machine translation tasks and has been widely adopted in natural language processing tasks.

### Tree Summarize

Note : Good for summarization purposes.

In [12]:
query_engine = vector_index.as_query_engine(response_mode="tree_summarize")

response = query_engine.query("What is Transformer?")

display_response(response)

**`Final Response:`** Based on the information provided in the context, the Transformer is a sequence transduction model that is entirely based on attention. It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. The Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. It has achieved a new state of the art on both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks. The Transformer model architecture is shown in Figure 1.