### RAG Introduction with Langchain/Chromadb and LLAMACPP (OPENAI-API)

Here we built a RAG Application using library provided functions. 
We we will use 
- [LLAMA.CPP](https://github.com/ggml-org/llama.cpp) as the engine
  to run embedding and large language models.
- [ChromaDB](https://www.trychroma.com/) as an open source vector
  database
- [LangChain](https://python.langchain.com) as prompt engineering, RAG
  framework

We implement a chatbot that shall correspond with a PDF file of the 
following article about the evolution of the omicron COVID-19 variant
in France: [Retrospective analysis of SARS-CoV-2 omicron invasion over delta in French regions in 2021–22: a status-based multi-variant model](https://bmcinfectdis.biomedcentral.com/articles/10.1186/s12879-022-07821-5)

As such we have downloaded the PDF version of the Article and converted it
to plain text with `pdftotext` which comes with the [Poppler library](https://poppler.freedesktop.org/). The resulting file is provided as `s12879-022-07821-5.txt

For this course we need to run models using the [LLAMA.CPP](https://github.com/ggml-org/llama.cpp) engine in the background.
We shall serve the `all-MiniLM-L6-v2` embedding model:
```
llama-server -m all-MiniLM-L6-v2-Q8_0.gguf --host localhost --port 8081 --embedding
```
and the `Llama-3.2-3B-Instruct` LLM:
```
llama-server -m Llama-3.2-3B-Instruct-Q8_0.gguf -c 10000 --host localhost --port 8080
``` 
with a maximum context size of 10000.

Here we start with a general initialization of our system. As multiple participents of this course run their code on the same node we want to generate random port numbers to provide our REST services. Further we obtain the systems an IP address to bind our REST servers (embedding, llm) to.

In [13]:
import random
import socket
ip = socket.gethostbyname(socket.gethostname())
embedding_port = str(random.randint(40000,50000))
llm_port = str(random.randint(40000,50000))

We launch the rest servers (llama-cpp) in background.

In [14]:
from subprocess import Popen, DEVNULL

public_root = '/leonardo/pub/userexternal/thaschka/'
llama_cpp_server = public_root + 'llama.cpp/bin/llama-server'
llm_model = public_root + 'llm/llama-3.1-8B-I-Q8.gguf'
embedding_model = public_root + 'embed/all-MiniLM-L6-v2-Q8_0.gguf' 

embedding_process = Popen([llama_cpp_server, 
                           '-m', embedding_model, 
                           '--host', ip, 
                           '--port', embedding_port, 
                           '--embedding'],
                          stdout=DEVNULL,
                          stderr=DEVNULL)

llm_process = Popen([llama_cpp_server,
                     '-m', llm_model,
                     '--host', ip,
                     '--port', llm_port,
                     '-c', '10000',
                     '-ngl', '32'],
                     stdout=DEVNULL,
                     stderr=DEVNULL)

Just to assure that we have the right encoding on the machine we set everything to UTF-8

In [15]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [16]:
from textwrap import wrap
f = open(public_root + 's12879-022-07821-5.txt','r')
text = f.read()
wrapped_text = wrap(text,1000)
f.close()

We wrapped the text into chunks of containing a maximum of 1000 characters each. *wrap* stops at spaces, and hence the chunks are a bit smaller. We have obtained 53 such chunks.

Langchain provides us with the REST api to talk to our local llama.cpp models. 
As llama.cpp uses the same API as OPENAI we can use the OPENAI model. 
There are however some quirks with the embedding API, which should be compatible, 
but is not, as such we use langchain's LocalAIEmbeddings API for the embeddings,
which works together with our llama.cpp server. 

In [17]:
from langchain_openai import ChatOpenAI
from langchain_localai import LocalAIEmbeddings
llm = ChatOpenAI(openai_api_base='http://' + ip + ':' + llm_port + '/v1',
                openai_api_key="whatever")

Let us try thing out: 

In [18]:
print(llm.invoke("Hello how are you today?"))

content="I'm just a computer program, so I don't have feelings, but thank you for asking! I'm functioning properly and ready to help you with any questions or tasks you may have. How about you? How's your day going so far?" additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 51, 'prompt_tokens': 16, 'total_tokens': 67, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'b5209-d2b2031e', 'id': 'chatcmpl-AgcQJ1CAm1yUXuocB8O2EQRGAjmKG4cs', 'finish_reason': 'stop', 'logprobs': None} id='run-fd856549-2beb-4f18-9bd1-859f8c3f9301-0' usage_metadata={'input_tokens': 16, 'output_tokens': 51, 'total_tokens': 67, 'input_token_details': {}, 'output_token_details': {}}


In [19]:
vectors = LocalAIEmbeddings(openai_api_base="http://" + ip + ":" +embedding_port + "/v1",
                            openai_api_key="forget",
                            model="all-MiniLM-L6-v2",
                            openai_api_version='v1',)

In [20]:
len(vectors.embed_documents(wrapped_text))

53

In the next step we have to setup a vector database. 
Here as opposed to our first sample we use chromadb, but a large choice
of databases would exist. Only execute this once or destroy your presist_directory
and restart the notebooks' kernel. 

In [21]:
from langchain.vectorstores import Chroma
vectordb = Chroma.from_texts(texts=wrapped_text, 
                             embedding=vectors, 
                             persist_directory="chroma_db")
retriever = vectordb.as_retriever(search_kwargs={"k": 10})

Just as in the last case we have to deal with a chat template.
Here we can use langchains' tools in order include history and contexts. 

In [22]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
prompt = ChatPromptTemplate([
    ("system", "<|start_header_id|>system<|end_header_id|> \n\n" + \
    "You are a helpful AI assistent in answering prompt, " + \
    "taking the following contexts into account " + \
    "as good as you can as you answer. \n\n" + \
    "Contexts:\n"
    "{context}<|eot_id|>"),
    MessagesPlaceholder("chat_history"),
    ("human","<|start_header_id|>user<|end_header_id|> {input}<|eot_id|> \n\n" + \
    "<|start_header_id|>assistent<|end_header_id>")])
                            

Here we build our retrieval chain, with the template above. For each question we retrieve corresponding documents from the chromadb vector database.

In [23]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

question_answer_chain = create_stuff_documents_chain(llm, prompt)

rag_chain = create_retrieval_chain(retriever,question_answer_chain) 


Finally we have also to handle the chat history and define a query function

In [24]:
chat_history = []

def chat_query(question):
    a = rag_chain.invoke({"input": question, "chat_history": chat_history})

    human_msg = "<|start_header_id|>user<|end_header_id|>" + question + "<|eot_id|>"
    ai_answer = "<|start_header_id|>assistent<|end_header_id|>" + a['answer'] + "<|eot_id|>"

    chat_history.extend([
        ("human", human_msg),
        ("ai", ai_answer)
        ])
    print(a['answer'])
    return a

Now we can run our chatbot and ask some questions related to our text.

In [25]:
a = chat_query("Tell me about the 20-day window of opportunity")

The 20-day window of opportunity refers to a specific time period used in the study to analyze the replacement of the delta variant by the omicron variant in metropolitan France. This window is defined as a 20-day period that starts 10 days before the inflection point, where the omicron variant exceeds 50% of reported samples.

The inflection point is considered to be the midpoint of the 20-day window, which is the 10th day. This means that the first 10 days of the window are used to model the initial growth of the omicron variant, and the last 10 days are used to model the replacement of the delta variant by the omicron variant.

The authors of the study chose this 20-day window for several reasons:

1. The omicron variant was still relatively rare 10 days prior to the inflection point, so the initial growth of the omicron variant could be accurately modeled.
2. The 20-day window allows for a sufficient number of days to observe the replacement of the delta variant by the omicron vari

In [26]:
a = chat_query("At what dates did the 20-day window of opportunity happen in france?")

According to the study, the 20-day window of opportunity occurred in metropolitan France during the period from December 2021 to January 2022.

The exact dates of the 20-day window of opportunity varied by region, but the study mentions that the invasion of the omicron variant, specifically the lineage BA.1, occurred approximately 3 weeks after its first detection in metropolitan France.

Given that the omicron variant was first detected in France in early December 2021, the 20-day window of opportunity likely started around December 10-15, 2021, and ended around January 5-10, 2022.

Here are the estimated dates for the 20-day window of opportunity in some of the French metropolitan regions, based on the study:

* Ile-de-France region: December 12, 2021 - January 1, 2022
* Hauts-de-France region: December 15, 2021 - January 4, 2022
* Nouvelle-Aquitaine region: December 10, 2021 - January 30, 2022
* Occitanie region: December 12, 2021 - January 2, 2022

Please note that these dates are 

In [11]:
embedding_process.terminate()

In [12]:
llm_process.terminate()