<a href="https://colab.research.google.com/github/Squigspear/xCyberLLM/blob/main/Tutorial_2_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This tutorial showcases **Retrieval Augmentation Generation** with a LLM, where a document (CCOP) is referred to when the model is queried. Split into 2 parts, (1) Open-sourced LLAMA 2 7B 5 bit quantized mode (2) Microsoft Azure OpenAI.


**Before starting:**

1. **Click on Runtime (top) > Change Runtime type -> Python 3, T4 GPU**

2. **Click Connect (top right hand)**

**To Run**

**1. Click on play button (left) of each cell to run cell**


## Initialise notebook
Runtime: ~ 7mins

In [1]:
#%cd /content/drive/MyDrive/Colab Notebooks/Hackathon

# Download from google drive
from google_drive_downloader import GoogleDriveDownloader as gdd

#some file id/ list of file ids parsed from file urls.
google_fid_id = '1PFXHKUfiDupWV_K_MLmSWBxajTPvj_B4'
destination = 'dir/dir/utils_tutorial_2_RAG.ipynb'
#if zip file ad kwarg unzip=true

gdd.download_file_from_google_drive(file_id=google_fid_id,dest_path = destination)

Downloading 1PFXHKUfiDupWV_K_MLmSWBxajTPvj_B4 into dir/dir/utils_tutorial_2_RAG.ipynb... Done.


In [2]:
%cd /content/dir/dir
%run 'utils_tutorial_2_RAG.ipynb'

/content/dir/dir
Installing and importing packages
--2024-01-23 09:31:12--  https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf
Resolving huggingface.co (huggingface.co)... 18.172.134.124, 18.172.134.88, 18.172.134.24, ...
Connecting to huggingface.co (huggingface.co)|18.172.134.124|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/b0/ca/b0cae82fd4b3a362cab01d17953c45edac67d1c2dfb9fbb9e69c80c32dc2012e/e0b99920cf47b94c78d2fb06a1eceb9ed795176dfa3f7feac64629f1b52b997f?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-7b-chat.Q5_K_M.gguf%3B+filename%3D%22llama-2-7b-chat.Q5_K_M.gguf%22%3B&Expires=1706261472&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNjI2MTQ3Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9iMC9jYS9iMGNhZTgyZmQ0YjNhMzYyY2FiMDFkMTc5NTNjNDVlZGFjNjdkMWMyZGZiOWZiYjllNjljODBjMzJkYzI



---

# Running LLAMA 2 on Colab
(Open-source LLM)

## **Initialise LLM (Llama-2-7b model)**
Optional: Change model and parameters as required




In [3]:
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    n_ctx = 4096,
    model_path="llama-2-7b-chat.Q5_K_M.gguf", # location of the model, llama-2-13b-chat.Q4_0.gguf
    temperature=0.2,                 # temperature
    max_tokens=2000,                 # Max. number of tokens to be generated
    top_p=0.9,                    # top_p = 0.9
    top_k=30,                     # top_k = 30
    n_gpu_layers=200,                 # number of layers to offload to GPU
    verbose=True, # Verbose is required to pass to the callback manager
    callback_manager=callback_manager,
    n_batch=200,          # number of token generation in parallel
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


## **Customise prompt template**
To edit: Change qa_template accordingly.

In [4]:
qa_template = """Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Only return the helpful concise answer below and nothing else.
Helpful answer:"""

prompt_template = create_prompt(qa_template)
print(prompt_template)
prompt = PromptTemplate(template=prompt_template,input_variables=['context', 'question'])

[INST]<>
You are a helpful, respectful and honest assistant.
<>

Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Only return the helpful concise answer below and nothing else.
Helpful answer:[/INST]


## Add DocumentQA using langchain for RAG
Optional: Change embedding model accordingly

In [5]:
model_name = "sentence-transformers/all-MiniLM-L12-v2"
model_kwargs = {'device': 'cuda'}      # you must have a gpu, otherwise change it to cpu
encode_kwargs = {'normalize_embeddings': True}
embedding_func = HuggingFaceEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Launch Gradio UI

To do:
1. Upload CCOP.pdf
2. Type question in chatbox
3. To end - click on stop button on top left of cell

(Make sure the top left icon of the cell is running)


Error: restart the cell by clicking stop -> play)

In [6]:
COUNT = 0
chat_history = []
qa = None

setup_gradio()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://5edb482ae590336bd4.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://5edb482ae590336bd4.gradio.live




---

# Using AzureOpenAI as LLM
(Closed-source LLM)

## **Initialise LLM (GPT3.5 / 4)**
Optional: Change model and parameters as required. Select either GPT3.5 or 4




## GPT3.5

In [7]:
# GPT 4
# Setting up API access
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2023-05-15"
os.environ["OPENAI_API_BASE"] = "https://dto-testing.openai.azure.com/"
os.environ["OPENAI_API_KEY"] = "ce8e5b0d756f4d83b4dfbbc4ccd08fec"


llm_OAI = AzureOpenAI( temperature = 0.1,
                   deployment_name="DTO_demo",  # This is the deployed GPT3 from Azure
                   model_name="text-davinci-003",
                            )

## Or GPT4


In [None]:
# GPT 4
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2023-03-15-preview"
os.environ["OPENAI_API_BASE"] = "https://ccm-3.openai.azure.com/"
os.environ["OPENAI_API_KEY"] = "6daf5804dd1e4d05a5d1079ac52e40a7"
os.environ["AZURE_ENDPOINT"] = "https://ccm-3.openai.azure.com/openai"


llm_OAI = AzureOpenAI( temperature = 0.1,
                   deployment_name="ccm-gpt4",  # This is the deployed GPT4 from Azure
                   model_name="gpt-4")



In [8]:
qa_template = """Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Only return the helpful concise answer below and nothing else.
Helpful answer:"""

prompt = PromptTemplate(template=qa_template,input_variables=['context', 'question'])

## Launch Gradio UI

To do:
1. Upload CCOP.pdf
2. Type question in chatbox
3. To end - click on stop button on top left of cell

(Make sure the top left icon of the cell is running)


Error: restart the cell by clicking stop -> play)

In [9]:
COUNT = 0
chat_history = []
setup_gradio_OAI()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://bd75dedd767de60306.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://bd75dedd767de60306.gradio.live
