# Jupyter Notebook to set up a RAG application for posing questions to a set of Word documents in a local folder

### Set up the Ollama model and embeddings

In [7]:
# Code for accessing Ollama and posing questions
# Test wether it works to invoke a model and pose a question

MODEL = "gemma:2b"
# MODEL = "mistral"
# MODEL = "llama2"


from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

model = Ollama(model=MODEL)
embeddings = OllamaEmbeddings(model=MODEL)

# Actual method to invoke a model and ask a question:
#model.invoke("Tell me a joke")

"What do you call a guy who's always telling jokes but never gets laughed at?\n\nA joke-a-day man!"

## Get Documents to apply the RAG to.
Get a list of paths of docx documents I want to load into memory

In [2]:
import os

data_path = os.getenv("DOCUMENTS_PATH")

# Retrieve the path names of files in the specified directory
docx_file_paths = [os.path.join(data_path, file) for file in os.listdir(data_path)]
docx_file_paths_cleaned = [item for item in docx_file_paths if item.endswith('.docx')]

### Load in files to apply RAG to
Load the docx files into memory using langchain's Docx2txtLoader function. The load_and_split function then serves to chunk documents. 

For now it seems that it chuncks automatically on page, although I want to see if I can reduce chunck size, as this is hopefully better for creating the embeddings. I'm running into serious performance issues on my MBA when using embeddings on chunk page sizes. Pehaps smaller chunks offer better performance (?!?) IDK...

It seems that the nltk library offers helpful functionality with tokenization. Perhaps I can get chunck size down to sentences or 

In [10]:
from langchain_community.document_loaders import Docx2txtLoader
import nltk
from nltk.tokenize import sent_tokenize

loaded_texts = [Docx2txtLoader(path) for path in docx_file_paths_cleaned]

test_pages = loaded_texts[8].load_and_split()


# To find out the number of chunks
number_of_chunks = len(test_pages)
print(f"The document is split into {number_of_chunks} chunks.")

The document is split into 2 chunks.


### Create embeddings
For now the vector store is in memory, with the use of the DocArrayInMemorySearch method from the langchain_community library, but the goal is to write it to disk to probably free up RAM (check chromaDB?)

In [9]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(test_pages, embedding = embeddings)

retriever = vectorstore.as_retriever()

# test the application of the vector store 
retriever.invoke("Gewenste datum realisatie")

[Document(page_content='Initiatief: BRN-06-01a Corrigeren en aanvullen gerelateerdengegevens\n\nFase 1: Binnen gemeentelijk\n\n\nVoordat een initiatief voor impactanalyse aan RvIG kan worden aangeboden moet voldoende duidelijk zijn wat het probleem is en de gewenste ‘business’-oplossing. Ook moet gespecificeerd zijn wat de acceptatiecriteria zijn die worden gesteld aan de oplossingsrichting(en) voor implementatie. \n\nDeze template beschrijft de inhoud van een initiatiefbeschrijving om “ready” te zijn voor impactanalyse.\n\nInitiatieven worden beschreven door het Realisatieteam en aangeboden aan het portfolio-overleg\n\n\n\nOnderwerp \n\nCorrigeren en aanvullen van gerelateerdengegevens op de persoonslijst\n\nSamenvatting van het initiatief\n\nTijdens de ontwikkeling van Operatie BRP is gebleken dat de gerelateerdengegevens op de persoonslijsten in de BRP hiaten en fouten bevatten. In totaal ging het om miljoenen sets gerelateerdengegevens. Dit initiatief richt zich op inconsistenties 

## Create Template


Templates help 

In [None]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
prompt.format(context="Here is some context", question="Here is a question")

# Inspect how the prompt will look like:
print(prompt.format(context="Here is some context", question="Here is a question"))


## Create Chain

In [None]:
from operator import itemgetter

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
)


## Run the code

In [None]:

#chain.invoke(
#    {
#
#    "question": "Geef de code van dit initiatief, deze heeft de vorm van AAA-11-11 " 
#    }
#)