
# LLama Index Demo - By Seth Steele
---

This is a simple demo of RAG on LLama-2


## 1. Change to GPU runtime
Click on "Runtime" -> "Change runtime type" and make sure "T4 GPU" is selected (the only GPU available on the free plan).

## 2. Install and login to the HuggingFace transformers library

The following snippet of code will:
1. Install the transformers and accelerate libraries that we will use to access and run the Llama model.
2. Initiate a login to your HuggingFace account.
3. Install the necessary packages and our LLama-2 LLM.

This second step is nessecary because, whilst Llama is an open-source model, access to it is still restricted to those who have been given access by Meta. Instructions for getting access to Llama + granting that access to your HuggingFace account can be found here: https://ai.meta.com/llama/get-started/


In [None]:
hf_token = "INSERT HUGGING FACE KEY HERE"

!huggingface-cli login --token #INSERT HUGGING FACE KEY HERE

!pip3 install llama-index-llms-anthropic
!pip3 install transformers
!pip3 install accelerate
!pip3 install bitsandbytes
!pip3 install datasets
!pip3 install peft
!pip3 install trl

!pip3 install llama-index
!pip3 install llama-index-llms-anthropic
!pip3 install llama-index-llms-huggingface
!pip3 install llama-index-embeddings-huggingface

from peft import LoraConfig
from datasets import load_dataset
from trl import SFTTrainer
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from llama_index.core import PromptTemplate
from llama_index.core import ServiceContext
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.anthropic import Anthropic

from google.colab import drive
drive.mount('/content/drive')
import torch

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Collecting llama-index-llms-anthropic
  Downloading llama_index_llms_anthropic-0.1.4-py3-none-any.whl (4.4 kB)
Collecting anthropic<0.18.0,>=0.17.0 (from llama-index-llms-anthropic)
  Downloading anthropic-0.17.0-py3-none-any.whl (848 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m848.2/848.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-core<0.11.0,>=0.10.1 (from llama-index-llms-anthropic)
  Downloading llama_index_core-0.10.15-py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from anthropic<0.18.0,>=0.17.0->llama-index-llms-anthropic)
  Downloading httpx-0.27.0-py3-non

**Note** - you may have to restart the runtime
by clicking "Runtime" -> "Restart runtime" after loading in the accelerator library for the subsequent code to run.

# 3. Setup The LLM

These are the settings that change the LLM in use to the 7 billion parameter model of Llama-2.

In [None]:
compute_dtype = getattr(torch, "float16")

baseModel = "meta-llama/Llama-2-7b-chat-hf"

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    load_in_8bit=False,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

llm = AutoModelForCausalLM.from_pretrained(
    baseModel,
    quantization_config=quant_config,
    device_map={"": 0}
)

llm.config.use_cache = False
llm.config.pretraining_tp = 1


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
datasetName = "ApoAlquaary/sau_university"
dataset = load_dataset(datasetName , split="train")

new_model = "llama-2-7b-chat-academy-test"

tokenizer = AutoTokenizer.from_pretrained(baseModel, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="QUESTION_ANS",
)

training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=llm,
    train_dataset=dataset,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)

trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

HFllm = HuggingFaceLLM(
    model_name= new_model,
    tokenizer_name= new_model,
    query_wrapper_prompt=PromptTemplate("<s> [INST] {query_str} [/INST] "),
    context_window=3900,
    model_kwargs={"token": hf_token, "quantization_config": quant_config},
    tokenizer_kwargs={"token": hf_token},
    device_map="auto",
)

service_context = ServiceContext.from_defaults(llm=HFllm, embed_model="local:BAAI/bge-small-en-v1.5")



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  service_context = ServiceContext.from_defaults(llm=HFllm, embed_model="local:BAAI/bge-small-en-v1.5")


#  4. Load the data and build an index

The following code creates an index over the documents in the data folder in our google drive.

Play around with whats in there and see what happens when you change the contents of the folder.

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader


documents = SimpleDirectoryReader("/content/drive/Shareddrives/Darwin Team E/DarwinIndexData").load_data()
index = VectorStoreIndex.from_documents(documents, service_context = service_context)

# 5. Use the model to respond to a query
In this section we can write out our query and then get the model to respond.


The following line is simply to set our query, change this to whatever you would like to ask the model.

In [None]:
prompt ="What is Chat Academy?"

And then these final lines of code can be used to actually generate a response.

In [None]:
query_engine = index.as_query_engine(verbose = True)
response = query_engine.query(prompt)
print(response)
chat_engine = index.as_chat_engine(verbose = True)
response = chat_engine.chat(prompt)
print(response)



Based on the context information provided, Chat Academy is a project at the University of Sheffield that involves building a chatbot using a language model called Llama-2. The project is being led by Nafise and involves a team of members, including Seth, who wrote a demonstration of the language model for other team members. The project aims to finish by May and uses RAG to gather information for the chatbot. The context also mentions that the team hope to create a program that can understand and respond to natural language, but the author of the passage notes that the current approach to AI is a hoax and that they realized during their graduate studies that the traditional approach to AI, which involves explicit data structures representing concepts, is not going to work. Instead, the author decides to focus on Lisp, a programming language that they find interesting for its own sake, and goes on to write a book about Lisp hacking.
[1;3;38;5;200mThought: I need to use a tool to help m