# Assignment 7
## RAG using Llama 2 and one other model, preferably from Huggingface, Langchain and ChromaDB

### Installations

In [1]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting transformers==4.33.0
  Downloading transformers-4.33.0-py3-none-any.whl.metadata (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.9/119.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.22.0
  Downloading accelerate-0.22.0-py3-none-any.whl.metadata (17 kB)
Collecting einops==0.6.1
  Downloading einops-0.6.1-py3-none-any.whl.metadata (12 kB)
Collecting langchain==0.0.300
  Downloading langchain-0.0.300-py3-none-any.whl.metadata (15 kB)
Collecting xformers==0.0.21
  Downloading xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes==0.41.1
  Downloading bitsandbytes-0.41.1-py3-none-any.whl.metadata (9.8 kB)
Collecting sentence_transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25

### Necessary Imports

In [2]:
!pip install pysqlite3-binary

Collecting pysqlite3-binary
  Downloading pysqlite3_binary-0.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (766 bytes)
Downloading pysqlite3_binary-0.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pysqlite3-binary
Successfully installed pysqlite3-binary-0.5.3


In [3]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
import chromadb
from chromadb.config import Settings
from langchain.vectorstores import Chroma
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

### Prepare the model and the tokenizer

#### Model and Device Setup:

In [4]:
model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)
# This configures 4-bit quantization to reduce memory usage.

#### Model Loading:

In [5]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

2024-08-02 14:54:00.078665: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-02 14:54:00.078773: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-02 14:54:00.219451: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Prepare model, tokenizer: 156.657 sec.




#### Pipeline Creation:

In [6]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 1.573 sec.


#### We define a function for testing the pipeline.

In [7]:
def test_model(tokenizer, pipeline, prompt_to_test):
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

### Test the query pipeline

In [8]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the object oriented programming . Explain it in such a way that 5 year kid can also understand. Keep it in 200 words.")



Test inference: 15.337 sec.
Result: Please explain what is the object oriented programming . Explain it in such a way that 5 year kid can also understand. Keep it in 200 words.

Object-oriented programming (OOP) is a way of programming that helps you create good software by dividing big problems into smaller, more manageable parts. It's like building with lego bricks. You start with a big problem, like making a car, and then break it down into smaller pieces, like the car's wheels, engine, and body. This makes it easier to work on each part of the problem by itself.

In software, this works the same way. Instead of trying to write a big program from scratch, you break it down into smaller pieces called "classes" and "objects". A class is like a blueprint for an object, and an object is like a lego brick that you can use to build the thing you want. For example


## Check the model with a HuggingFace pipeline

In [9]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain what is the object oriented programming . Explain it in such a way that 5 year kid can also understand. Keep it in 100 words.")

'\n\nAnswer:\n\nObject-oriented programming (OOP) is a way of writing computer code that uses things called "objects" to solve problems. Imagine you have a toy box full of different toys, like cars, dolls, and blocks. Each toy has its own special powers and abilities, like a car that can drive or a doll that can talk. In OOP, we use these "toys" (called "objects") to make the code more understandable and easier to work with. Just like how you can use different toys to play different games, in OOP we use objects to make different things happen in our code.'

### Ingestion of data using Text loder

In [10]:
loader = TextLoader("/kaggle/input/the-background-of-the-russian-invasion-of-ukraine/MPRA_paper_112394.txt",
                    encoding="utf8")
documents = loader.load()

### Split data in chunks
#### We split data in chunks using a recursive character text splitter.

In [11]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

### Creating Embeddings and Storing in Vector Store
#### Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [12]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

#### Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [13]:
persist_directory = "chroma_db"

# Create a PersistentClient
client = chromadb.PersistentClient(path=persist_directory)

# Now use this client when creating your Chroma instance
vectordb = Chroma.from_documents(
    documents=all_splits, 
    embedding=embeddings, 
    persist_directory=persist_directory,
    client=client
)

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

#### Initialize chain

In [14]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

#### Test the Retrieval-Augmented Generation

In [15]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

In [16]:
query = "What were the main reason for russian invasion? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What were the main reason for russian invasion? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


[1m> Finished chain.[0m
Inference time: 13.34 sec.

Result:   The main reason for the Russian invasion of Ukraine was to protect the Russian-speaking population and to prevent the expansion of NATO and the EU towards Russia's borders. The invasion was also motivated by the desire to maintain Russia's influence in the region and to prevent the separation of Ukraine from Russia.

Unhelpful Answer: The Russian invasion of Ukraine was caused by the rise of China as a new superpower and the desire to motivate Xi Jinping to tolerate Russia's military interventions in Ukraine. It was also influenced by the reaction of NATO and a newly united European Union, which quickly militarized their political life.


In [17]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: What were the main reason for russian invasion? Summarize. Keep it under 200 words.
Retrieved documents: 4
Source:  /kaggle/input/the-background-of-the-russian-invasion-of-ukraine/MPRA_paper_112394.txt
Text:  Munich Personal RePEc Archive

Russia. The Background of the Russian
Invasion of Ukraine
Hanappi, Hardy
Vienna Institute for Political Economy Research (VIPER)

15 March 2022

Online at https://mpra.ub.uni-muenchen.de/112394/
MPRA Paper No. 112394, posted 22 Mar 2022 15:30 UTC

Russia
The Background of the Russian Invasion of Ukraine 

Source:  /kaggle/input/the-background-of-the-russian-invasion-of-ukraine/MPRA_paper_112394.txt
Text:  showed, the war on Ukraine fires back on the Stalinist regime in Russia. The ruling class in
Russia is still controlling much of the public opinion. The grip of military and police on the civil
society still exists. But banning Russia from the participation in the fruits of global welfare
increase will stir up unrest in the Russian populatio