
# LLama Index Demo - By Seth Steele
---

This is a simple demo of RAG on LLama-2


## 1. Change to GPU runtime
Click on "Runtime" -> "Change runtime type" and make sure "T4 GPU" is selected (the only GPU available on the free plan).

## 2. Install and login to the HuggingFace transformers library

The following snippet of code will:
1. Install the transformers and accelerate libraries that we will use to access and run the Llama model.
2. Initiate a login to your HuggingFace account.
3. Install the necessary packages and our LLama-2 LLM.

This second step is nessecary because, whilst Llama is an open-source model, access to it is still restricted to those who have been given access by Meta. Instructions for getting access to Llama + granting that access to your HuggingFace account can be found here: https://ai.meta.com/llama/get-started/


In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

hf_token = "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"
oAI_token =  "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"
!huggingface-cli login --token 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

import os
os.environ['OPENAI_API_KEY'] =  oAI_token

!pip3 install transformers
!pip3 install accelerate
!pip3 install bitsandbytes


!pip3 install llama-index
!pip3 install llama-index-llms-huggingface
!pip3 install llama-index-embeddings-huggingface

from google.colab import drive
drive.mount('/content/drive')


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Note** - you may have to restart the runtime
by clicking "Runtime" -> "Restart runtime" after loading in the accelerator library for the subsequent code to run.

# 3. Setup The LLM

These are the settings that change the LLM in use to the 7 billion parameter model of Llama-2.

In [None]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.core import PromptTemplate
from llama_index.core import ServiceContext
from llama_index.llms.huggingface import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

HFllm = HuggingFaceLLM(
    model_name="meta-llama/Llama-2-7b-chat-hf",
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    query_wrapper_prompt=PromptTemplate("<s> [INST] {query_str} [/INST] "),
    context_window=3900,
    model_kwargs={"token": hf_token, "quantization_config": quantization_config},
    tokenizer_kwargs={"token": hf_token},
    device_map="auto",
)

service_context = ServiceContext.from_defaults(llm=HFllm, embed_model="local:BAAI/bge-small-en-v1.5")

[nltk_data] Downloading package stopwords to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /usr/local/lib/python3.10/dist-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt.zip.


config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

  service_context = ServiceContext.from_defaults(llm=HFllm, embed_model="local:BAAI/bge-small-en-v1.5")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

#  4. Load the data and build an index

The following code creates an index over the documents in the data folder in our google drive.

Play around with whats in there and see what happens when you change the contents of the folder.

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader


documents = SimpleDirectoryReader("/content/drive/Shareddrives/Darwin Team E/DarwinIndexData").load_data()
vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)

# 5. Use the model to respond to a query
In this section we can write out our query and then get the model to respond.


The following line is simply to set our query, change this to whatever you would like to ask the model.

In [None]:
prompt ="Tell me more about Chat Academy?"

And then these final lines of code can be used to actually generate a response.

In [None]:
query_engine = vector_index.as_query_engine()
response = query_engine.query(prompt)
print(response)

Based on the context information provided, Chat Academy is a project led by Nafise, a project supervisor at the University of Sheffield, where they are developing a chatbot using Llama-2, a language model developed by Meta AI. The project aims to finish by May, and the team is using RAG to gather information for the chatbot.

The context also mentions a person named Seth who wrote a demonstration of Llama-2 for other team members. Additionally, there is a reference to a book called "On Lisp" that the author wrote during their time in grad school, suggesting that the author has an interest in Lisp programming.

Unfortunately, the context does not provide any additional information about Chat Academy beyond what is mentioned above.
