# In this notebook I will start to work with RAG using Langchain.
I will try to load some PDF files using a Langchain loader.  The PDF files are images which will pose a challange.

## VENV Setup
I used chroma as the vector store.  It required python 3.11.  I setup a venv specfic to this notebook.

```powershell
py -3.11 -m venv winvenv31
```

## Langchain Unstructured Data Loader
This loader can be pointed at a folder.  It will attempt to load all files in the folder.
https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory

The unstructured data loader uses the python unstructured library.  It's documentation is here
https://github.com/Unstructured-IO/unstructured#installing-the-library

There are some requirements that must be met to do PDF ocr.

Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.

    libmagic-dev (filetype detection)
    poppler-utils (images and PDFs)
    tesseract-ocr (images and PDFs, install tesseract-lang for additional language support)
    libreoffice (MS Office docs)
    pandoc (EPUBs, RTFs and Open Office docs)



## Installing poppler
Download the windows poppler binaries from here
https://github.com/oschwartz10612/poppler-windows/releases

Add the bin folder to the windows path as seen here
https://stackoverflow.com/questions/73669337/how-to-how-to-install-poppler-from-the-tar-file-downloaded-from-poppler-officia

## Installing tesseract
Notes on windows install here
https://tesseract-ocr.github.io/tessdoc/Installation.html

Download from here
https://github.com/UB-Mannheim/tesseract/wiki

Run installer

Add C:\Program Files\Tesseract-OCR to the windows path


In [None]:
%pip install langchain
%pip install cmake
%pip install pytesseract
%pip install unstructured
%pip install "unstructured[pdf]"

## Loading PDF files
The directory loader can import a whole folder of misc files.  In this example I am loading a folder of just PDF files to test OCR.

In [1]:
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('D:\\Temp\\OldCrsToTestAiRAG', glob="**/*.pdf", use_multithreading=True)
docs = loader.load()
len(docs)

  from .autonotebook import tqdm as notebook_tqdm


41

## Splitting
Next the loaded docs are split into chunks.  Whole docs wont fit into a context window.  The chunks contain some overlap so context is kept between chunks.

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

len(splits)


392

## Vector Store
Next a vectory store (vector DB) must be created to store the chunk vectors.  There are a variety of options.  For this notebook I will use chromadb as decribed here
https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa

Chroma requires python 3.11.  It did not work with 3.12.  Issue to track here
https://github.com/chroma-core/chroma/issues/1410

In [None]:
%pip install chromadb
%pip install gpt4all

In [3]:
from langchain.vectorstores import Chroma
from langchain.embeddings import GPT4AllEmbeddings

vectorstore = Chroma.from_documents(documents=splits, embedding=GPT4AllEmbeddings())

In [None]:
# Test fetching something from vectorstore
question = "Locked high radiation barrier inconsistent with "
docs = vectorstore.similarity_search(question)
len(docs)

# note documents do contain metadata, like file name/path
docs[0]

## Setting up the llm
In this case I'll use KoboldCPP 
I tried running llama-2-13b.Q6_K but results weren't impressive.
using orca-2-13b.Q5_K_M gives better results.

In [31]:
from langchain.llms import KoboldApiLLM

llm = KoboldApiLLM(endpoint='http://localhost:5001/v1', temperature=0.1, max_context_length=4096 )

## Test a query with the docs embedded
This is just a test to see if the llm and docs are both working

In [None]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

prompt = PromptTemplate.from_template(
    "Summarize the main themes in these retrieved docs: {docs}"
)

llm_chain = LLMChain(llm=llm, prompt=prompt)

question = "What happened when flushing RM-1664?"
# note documents do contain metadata, like file name/path
docs = vectorstore.similarity_search(question)
result = llm_chain(docs)

result["text"]

## Trying to include sources
https://python.langchain.com/docs/use_cases/question_answering/#adding-sources

In [33]:
from operator import itemgetter
from langchain.schema import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

# This was the example template from the docs
# template = """Use the following pieces of context to answer the question at the end.
# If you don't know the answer, just say that you don't know, don't try to make up an answer.
# Use three sentences maximum and keep the answer as concise as possible.
# Always say "thanks for asking!" at the end of the answer.
# {context}
# Question: {question}
# Helpful Answer:"""

template = """Use the following pieces of CONTEXT to ANSWER the QUESTION at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

CONTEXT
{context}

QUESTION
{question}

ANSWER
"""
rag_prompt_custom = PromptTemplate.from_template(template)


rag_chain_from_docs = (
    {
        "context": lambda input: format_docs(input["documents"]),
        "question": itemgetter("question"),
    }
    | rag_prompt_custom
    | llm
    | StrOutputParser()
)
rag_chain_with_source = RunnableParallel(
    {"documents": retriever, "question": RunnablePassthrough()}
) | {
    "documents": lambda input: [doc.metadata for doc in input["documents"]],
    "answer": rag_chain_from_docs,
}

rag_chain_with_source.invoke("What actions were taken when the backhoe ran over the flags at Foxbird Island?")

{'documents': [{'source': 'D:\\Temp\\OldCrsToTestAiRAG\\I3110_04-66.pdf'},
  {'source': 'D:\\Temp\\OldCrsToTestAiRAG\\04-66.pdf'},
  {'source': 'D:\\Temp\\OldCrsToTestAiRAG\\04-62.pdf'},
  {'source': 'D:\\Temp\\OldCrsToTestAiRAG\\04-93.pdf'},
  {'source': 'D:\\Temp\\OldCrsToTestAiRAG\\04-68.pdf'},
  {'source': 'D:\\Temp\\OldCrsToTestAiRAG\\04-89.pdf'}],
 'answer': 'When the backhoe ran over the FSS flags at Foxbird Island, immediate actions included instructing workers to suspend activities that would further disturb the FSS markings, notifying Manafort management, and notifying the FSS and Remediation Superintendent.'}