# RAG Application using Llama2 and LlamaIndex

- Developing a RAG application to query against the study and lecture material from the CMPE258 Deep Learning Class at SJSU.
- The LLM model used for querying is Llama2 7b parameters from Meta distributed by HuggingFace along with the LlamaIndex Framework


In [10]:
!pip install -q pypdf transformers einops accelerate langchain bitsandbytes sentence_transformers llama_index llama-index-llms-huggingface llama-index-embeddings-langchain

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM

In [6]:
documents = SimpleDirectoryReader('/content/drive/MyDrive/CMPE258 Slides/1 - 13 merged').load_data()
print(documents)

[Document(id_='7cf2ac8f-9e7a-46c9-b78b-01aa6dd0f393', embedding=None, metadata={'page_label': '1', 'file_name': 'CMPE258 1-13 merged.pdf', 'file_path': '/content/drive/MyDrive/CMPE258 Slides/1 - 13 merged/CMPE258 1-13 merged.pdf', 'file_type': 'application/pdf', 'file_size': 93505339, 'creation_date': '2024-03-10', 'last_modified_date': '2024-03-10'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='CMPE 258 -01 \nDeep Learning\nDr. Kaikai Liu, Ph.D. Associate Professor\nDepartment of Computer Engineering\nSan Jose State University \nEmail: kaikai.liu@sjsu.edu\nWebsite: https://www.sjsu.edu/cmpe/faculty/tenure -\nline/kaikai -liu.phpSpring 2024', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}:

In [7]:
system_prompt = """
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""

##Default format supported by LLama2
query_wrapper_prompt = PromptTemplate('<|USER|>{query_str}<|ASSISTANT|>')

In [8]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store

In [9]:
import torch

llm = HuggingFaceLLM(
    context_window = 4096,
    max_new_tokens = 256,
    generate_kwargs = {'temperature': 0.0, 'do_sample': False},
    system_prompt = system_prompt,
    query_wrapper_prompt = query_wrapper_prompt,
    tokenizer_name = 'meta-llama/Llama-2-7b-chat-hf',
    model_name = 'meta-llama/Llama-2-7b-chat-hf',
    device_map = 'auto',
    model_kwargs = {'torch_dtype': torch.float16, 'load_in_8bit': True}
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [12]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.core import ServiceContext
from llama_index.embeddings.langchain import LangchainEmbedding

embed_model = LangchainEmbedding(
    HuggingFaceEmbeddings(model_name = 'sentence-transformers/all-mpnet-base-v2')
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
service_context = ServiceContext.from_defaults(
    chunk_size = 1024,
    llm = llm,
    embed_model = embed_model
)

  service_context = ServiceContext.from_defaults(


In [14]:
service_context

ServiceContext(llm_predictor=LLMPredictor(system_prompt=None, query_wrapper_prompt=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>), prompt_helper=PromptHelper(context_window=4096, num_output=256, chunk_overlap_ratio=0.1, chunk_size_limit=None, separator=' '), embed_model=LangchainEmbedding(model_name='sentence-transformers/all-mpnet-base-v2', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7814b048c2e0>), transformations=[SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7814b048c2e0>, id_func=<function default_id_func at 0x7815996da5f0>, chunk_size=1024, chunk_overlap=200, separator=' ', paragraph_separator='\n\n\n', secondary_chunking_regex='[^,.;。？！]+[,.;。？！]?')], llama_logger=<llama_index.core.service_context_elements.llama_logger.LlamaLogger object at 0x78146f8a7f40>, callback_manager=<llama_index.core.callbacks.ba

In [16]:
index = VectorStoreIndex.from_documents(documents, service_context = service_context)

In [17]:
index

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x78148ec62440>

In [18]:
query_engine = index.as_query_engine()

In [19]:
print(query_engine.query('What is Resnet'))



ResNet is a deep neural network architecture that was introduced in 2015 by Kaiming He et al. in the paper "Deep Residual Learning for Image Recognition". The ResNet architecture addresses the problem of vanishing gradients in deep neural networks, which can make it difficult to train deep models.

In ResNets, a "shortcut" or "skip connection" is used to allow the gradient to be directly backpropagated to earlier layers, which helps to alleviate the vanishing gradient problem. This allows ResNets to achieve better performance than previous architectures, such as VGG, in extracting features from images.

ResNets have become a popular choice for many computer vision tasks, including image classification, object detection, and segmentation. They are often used as a base network for other architectures, such as Inception ResNet and ResNeXt, which have also been shown to be effective in image recognition tasks.


In [20]:
print(query_engine.query('What are Auxiliary Classifiers and where are they used'))

Auxiliary classifiers are additional classifiers added to the architecture of a neural network, specifically in the intermediate layers, to address the problem of vanishing gradient descent. They are used during training to perform a classification based on the inputs within the network's midsection and add the loss calculated during training back to the total loss of the network. They are only utilized during training and removed during inference.


In [21]:
print(query_engine.query('What is the core idea in densenet'))

The core idea behind DenseNet is feature reuse, which leads to very compact models. As a result, it requires fewer parameters than other CNNs, as there are no repeated feature maps.
