### Revolutionizing PDF Analysis: Harnessing Offline LLAMA2 and Milvus for Efficient RAG Creation

Make sure to install all the relevant libraries (langchain, pypdf, and pymilvus)

In [144]:
from langchain_community.llms import Ollama
model = Ollama(model="llama2")


Verify if the model is functioning correctly.

In [149]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()
chain = model | parser
chain.invoke("How many provinces Pakistan has?")

'\nPakistan has 4 provinces:\n\n1. Punjab\n2. Sindh\n3. Khyber Pakhtunkhwa (formerly known as the North-West Frontier Province)\n4. Balochistan'

Let's load the PDF and split it into multiple documents based on the chunk size.

In [150]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter

loader = PyPDFLoader("EJ1172284.pdf")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
pages = text_splitter.split_documents(documents)
pages

[Document(page_content="The EUROCALL Review,  Volume 25, No. 2, September 2017  \n \n 18 Research paper  \n \nA look at advanced learners’ use of mobile devices for \nEnglish language study: Insights from interview data  \nMariusz Kruk  \nUniversity of Zielona Gora, Poland  \n____________________________________________ __________________  \nmkruk @ uz.zgora.pl  \n  \nAbstract  \nThe paper discusses the results of a study which explored advanced learners of English \nengagement  with their mobile devices to develop learning experiences that meet their \nneeds and goals as foreign language learners. The data were collected from 20 students \nby means of a semi -structured interview. The gathered data were subjected to \nqualitative and q uantitative analysis. The results of the study demonstrated that , on the \none hand , some subjects manifested heightened awareness relating to the \nadvantageous role of mobile devices in their learning endeavors, their ability to reach \nfor suitable

Let's attempt a basic chain using a prompt template.

In [151]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you
cannot answer the question, reply "Sorry! I am not sure.

context: {context}

Qustion: {question}

"""

promt = PromptTemplate.from_template(template)
promt.format(context="Here is some context", question="Here is some question")


'\nAnswer the question based on the context below. If you\ncannot answer the question, reply "Sorry! I am not sure.\n\ncontext: Here is some context\n\nQustion: Here is some question\n\n'

In [152]:
chain = promt | model | parser
#chain.input_schema.schema()

chain.invoke({"context": "This is RAG tutorial with Milvus as Vector Store", "question": "What is this tutorial about'?"})

'Sure, I can help you with that! Based on the context you provided, this tutorial is likely about using Milvus as a vector store for object recognition and detection tasks in RAG (Robotics, Automation, and Graphics).'

I have set up a local Docker instance for Milvus. Lets establish a connection with it.

In [154]:
from pymilvus import MilvusClient

milvus_host = "localhost"
milvus_port = "19530"
milvus_uri = f"http://{milvus_host}:{milvus_port}"

client = MilvusClient(
    uri=milvus_uri
)

Define the schema of the collection and index the embeddings field.

In [155]:
schema = MilvusClient.create_schema(
    enable_dynamic_field=True
)

# Add fields to schema

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="source", datatype=DataType.VARCHAR, max_length=64000)
schema.add_field(field_name="page", datatype=DataType.INT64)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=64000)
schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=4096)

# Prepare index parameters
index_params = client.prepare_index_params()

# Add indexes
index_params.add_index(
    field_name="id"
)

index_params.add_index(
    field_name="vector", 
    index_type="AUTOINDEX",
    metric_type="IP"
)

# Create a collection
client.create_collection(
    collection_name="custom_llm_rag",
    schema=schema,
    index_params=index_params
)


We'll utilize OllamaEmbeddings with the llama2 model.

In [156]:
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="llama2")

doc_embeddings = [{"source": pages[page].metadata["source"], "page": int(pages[page].metadata["page"]),\
                  "text": pages[page].page_content, "vector": embeddings.embed_documents(pages[page])[0]\
                  } for page in range(len(pages))]

doc_embeddings[0]

{'source': 'EJ1172284.pdf',
 'page': 0,
 'text': "The EUROCALL Review,  Volume 25, No. 2, September 2017  \n \n 18 Research paper  \n \nA look at advanced learners’ use of mobile devices for \nEnglish language study: Insights from interview data  \nMariusz Kruk  \nUniversity of Zielona Gora, Poland  \n____________________________________________ __________________  \nmkruk @ uz.zgora.pl  \n  \nAbstract  \nThe paper discusses the results of a study which explored advanced learners of English \nengagement  with their mobile devices to develop learning experiences that meet their \nneeds and goals as foreign language learners. The data were collected from 20 students \nby means of a semi -structured interview. The gathered data were subjected to \nqualitative and q uantitative analysis. The results of the study demonstrated that , on the \none hand , some subjects manifested heightened awareness relating to the \nadvantageous role of mobile devices in their learning endeavors, their abili

Insert bulk data into the milvus collection

In [157]:

# Insert data
res = client.insert(
    collection_name="custom_llm_rag",
    data=doc_embeddings
)

print(res)


{'insert_count': 11, 'ids': [448530671951237109, 448530671951237110, 448530671951237111, 448530671951237112, 448530671951237113, 448530671951237114, 448530671951237115, 448530671951237116, 448530671951237117, 448530671951237118, 448530671951237119]}


Let's define retriever for milvus vectorstore

In [158]:
from langchain_community.vectorstores import Milvus

vectorstore = Milvus(
    embeddings,
    connection_args={"host": milvus_host, "port": milvus_port},
    collection_name="custom_llm_rag",
)

In [159]:
vectorstore.as_retriever(
    search_kwargs={"expr": 'source == "EJ1172284.pdf"'}
).get_relevant_documents("Literature Review")

[Document(page_content='The EUROCALL Review,  Volume 25, No. 2, September 2017  \n \n 28 Little, D. (2009). Language learner autonomy and the European Language Portfolio: \nTwo L2 English examples.  Language Teaching , 42, 222-233. \nLittle, D. (1991).  Learner autonomy 1: Definitions, issues and problems . Dublin: \nAuthentik.  \nLittlewood, W. (1996). Autonomy: An anatomy and a framework.  System , 24(4), 427 -\n435. \nMiangah, T.M. & Nezarat, A. (2012). Mobile -assisted language \nlearning.  International  Journal of Distributed and Parallel Systems, 3 (1), 309 -319. \nNah, K.C., White, P. & Sussex, R. (2008). The potential of using a mobile phone to \naccess the internet  for learning EFL listening skills within a Korean \ncontext.  ReCALL , 20(3), 331 -347. \nOz, H. (2015). An investigation of preservice English teachers’ percep tions of mobile \nassisted language learning.  English Language Teaching , 8(2), 22 -34. \nPettit, J. & Kukulska -Hulme, A. (2007). Going with the grain: 

In [160]:
from operator import itemgetter
retriever = vectorstore.as_retriever()
chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | promt
    | model
    | parser
)

Now, let's pose questions using the Offline LLAMA2 based on the PDF document.

In [164]:
questions = [
    "What is this paper about?",
    "Summarize the paper",
    "What are the findings of the research?"
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {chain.invoke({'question': question})}")
    print()

Question: What is this paper about?
Answer: The paper discusses the concept of autonomy in foreign/second language learning and teaching, with a focus on the literature review and analysis of the topic. The author provides an overview of the definition and components of autonomy in language learning, as well as the social dimensions of learner autonomy and the role of classroom-based approaches in fostering autonomy. The paper also discusses the notion of autonomous language learning and how it differs from other concepts in the field. Overall, the paper provides a comprehensive analysis of the concept of autonomy in foreign/second language learning and teaching, drawing on a range of sources and perspectives to provide a nuanced understanding of the topic.

Question: Summarize the paper
Answer: The article discusses a study that explored advanced learners of English engagement with mobile devices for language learning. The study collected data through semi-structured interviews from 2