## We shall use llama index here to index our documents/data for better retrieval and use that data to augment the response of Mistral LLM (loaded in memory here)

In [1]:
# installing the required packages
!pip install llama-index-llms-huggingface llama-index

Collecting llama-index-llms-huggingface
  Downloading llama_index_llms_huggingface-0.1.4-py3-none-any.whl (7.2 kB)
Collecting llama-index
  Downloading llama_index-0.10.29-py3-none-any.whl (6.9 kB)
Collecting llama-index-core<0.11.0,>=0.10.1 (from llama-index-llms-huggingface)
  Downloading llama_index_core-0.10.29-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m50.0 MB/s[0m eta [36m0:00:00[0m
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.2-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.11-py3-none-any.whl (26 kB)
Collecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.7-py3-none-any.whl (6.0 kB)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_manage

In [2]:
!nvidia-smi

Tue Apr 16 06:10:24 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0              23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [4]:
# now lets check our folder of research pdf documents post mounting gdrive
!ls ./drive/MyDrive/llms/research_papers

attention.pdf  lora.pdf		peft.pdf	    rag_for_intensive_nlp.pdf  yolo.pdf
llama2.pdf     one_bit_llm.pdf	rag_evaluation.pdf  rag_vs_finetune.pdf


In [5]:
# llama index package for reading the directory of files
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

data = SimpleDirectoryReader("./drive/MyDrive/llms/research_papers").load_data()

In [8]:
type(data), len(data)

(list, 197)

In [9]:
# Setting prompt template using llama index package

from llama_index.core import PromptTemplate

system_prompt = """<|SYSTEM|># You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

In [10]:
# lets load the Mistral 7B model from Hugging face to our RAM
import torch
from llama_index.llms.huggingface import HuggingFaceLLM

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="mistralai/Mistral-7B-Instruct-v0.1",
    model_name="mistralai/Mistral-7B-Instruct-v0.1",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16}
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [11]:
!pip install llama-index-embeddings-huggingface

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.2.0-py3-none-any.whl (7.1 kB)
Collecting sentence-transformers<3.0.0,>=2.6.1 (from llama-index-embeddings-huggingface)
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-transformers, llama-index-embeddings-huggingface
Successfully installed llama-index-embeddings-huggingface-0.2.0 sentence-transformers-2.6.1


In [12]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model =HuggingFaceEmbedding(model_name="sentence-transformers/all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
from llama_index.core import VectorStoreIndex, ServiceContext

service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=llm,
    embed_model=embed_model
)

  service_context = ServiceContext.from_defaults(


In [15]:
index = VectorStoreIndex.from_documents(data, service_context=service_context)
query_engine = index.as_query_engine()

In [16]:
response1 = query_engine.query("What is an attention model?")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [19]:
response1

Response(response='An attention model is a type of machine learning model that allows a system to focus on certain parts of the input while processing it. It does this by assigning different levels of importance to different parts of the input, based on their relevance to the task at hand. Attention models are commonly used in natural language processing tasks, such as language translation and text summarization.', source_nodes=[NodeWithScore(node=TextNode(id_='14c3ec79-b696-4441-bb4a-c0c97d0117b5', embedding=None, metadata={'page_label': '17', 'file_name': 'llama2.pdf', 'file_path': '/content/drive/MyDrive/llms/research_papers/llama2.pdf', 'file_type': 'application/pdf', 'file_size': 12490211, 'creation_date': '2024-04-15', 'last_modified_date': '2024-04-04'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modifie

In [20]:
response1.response

'An attention model is a type of machine learning model that allows a system to focus on certain parts of the input while processing it. It does this by assigning different levels of importance to different parts of the input, based on their relevance to the task at hand. Attention models are commonly used in natural language processing tasks, such as language translation and text summarization.'

In [21]:
response2 = query_engine.query("What is llm finetuning?")
response2.response

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


"LLM fine-tuning is a process of adjusting the parameters of a large language model (LLM) to improve its performance on specific tasks. This process involves training the LLM on a smaller dataset that is relevant to the task at hand, and using techniques such as instruction tuning and RLHF to align the model's responses more closely with human expectations and preferences. The goal of LLM fine-tuning is to improve the model's ability to generate useful and engaging content that is tailored to the needs of the user."

In [22]:
response3 = query_engine.query("What is one bit embedding?")
response3.response

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'One bit embedding is a type of embedding model used in large language models (LLMs) that uses ternary weights (-1, 0, 1) instead of full-precision weights (such as FP16 or BF16). This allows for a significant reduction in inference cost (latency, throughput, and energy consumption) while maintaining model performance. The one bit embedding model introduced in the paper BitNet b1.58 is an example of this type of embedding model.'

In [23]:
response = query_engine.query("What are the different ways to evaluate an llm?")
response.response

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'There are different ways to evaluate an LLM. One way is to assess its knowledge by measuring its accuracy in answering factual questions. This can be done using a standard accuracy score based on a set of multiple-choice factual questions. Another way is to evaluate its reasoning abilities by considering its performance on unfamiliar knowledge-intensive tasks. Additionally, LLMs can be evaluated based on their safety capabilities by measuring their truthfulness, toxicity, and bias. This can be done using benchmarks such as TruthfulQA, ToxiGen, and BOLD.'

In [24]:
response = query_engine.query("What is better - a rag system or finetuning an llm?")
response.response

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'It is difficult to determine which is better, a RAG system or fine-tuning an LLM, as the performance of each approach depends on the specific task and dataset. The results presented in the paper show that RAG can be effective in some cases, such as Anatomy with K=2, but it is not a stable hyperparameter and its performance can vary greatly. Fine-tuning an LLM can also be effective, as seen in the results for various tasks and datasets. Ultimately, the choice between the two approaches will depend on the specific requirements and constraints of the task at hand.'