# Chat With Your Research Paper

Here are some of the things i want to use: <br>
- Built on LangChain framework.
- LLM is Llama 3.1 from TogetherAI API.
- Splitting the document using TextSplitter with overlap from LangChain.
- Embed the pdf using all-mpnet-base-v2 from HuggingFace.
- Using FAISS to store the embedding result and as the vector search too.
- Improve the result by using query rewriter and prompt engineering.

Some alternative or improvement: <br>
- Other frameworks like LlamaIndex is on par with LangChain
- You can use any other LLM provider like antrophic(claude), openai(chatgpt), perplexity, and many more.
- Or You can host model on your own personal computer using Ollama (albeit pretty heavy workload on your PC)
- Using semantic chunker rather than normal text splitter might yield better result (but hard on your pc too)
- Using LLM provider for the embedding model which makes it lighter for your PC.
- You don't have to use query rewriter or prompt engineering, this is just some simple improvement.

In [1]:
import os 
os.chdir("/Users/komangandikawirasantosa/Chat-With-Your-ResearchPaper")

## Importing Library

In [2]:
# For data preprocessing and when loading into FAISS
from glob import glob
import getpass
import os
import numpy as np
import PyPDF2
from PyPDF2 import PdfReader
from langchain_text_splitters import CharacterTextSplitter
from langchain.vectorstores import FAISS
from sentence_transformers import SentenceTransformer
from langchain.document_loaders import PyPDFLoader
from langchain_together import TogetherEmbeddings

  from tqdm.autonotebook import tqdm, trange


In [3]:
from langchain_together import TogetherEmbeddings
from langchain_openai import ChatOpenAI

In [9]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

## Together API

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass()
os.environ["TOGETHER_API_KEY"] = getpass.getpass()

## Splitting and Chunking the Document

In [5]:
paper_paths = glob("PDF/*.pdf")
pages = []

for path in paper_paths:
    try:
        loader = PyPDFLoader(path)
        doc = loader.load()
        text_splitter = CharacterTextSplitter(chunk_size=1000, 
                                      chunk_overlap=200)
        chunked_documents = text_splitter.split_documents(doc)
        
        pages.extend(chunked_documents)
    except Exception as e:
        print('Skipping', path, e)

## Load the document into the VectorDB

In [6]:
embeddings = TogetherEmbeddings(
    model="togethercomputer/m2-bert-80M-32k-retrieval",
)
db = FAISS.from_documents(
    pages,
    embeddings
)

In [7]:
retriever = db.as_retriever()

## Creating The Chain

In [8]:
llm = ChatOpenAI(
        base_url="https://api.together.xyz/v1",
        model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
        temperature=0.3,
    )

In [10]:
# Using XML(such as <h1> </h1>) tags as a way of prompt engineering
template = """
<instruction>
You are an expert in undersatnding research paper, answer the question based on the provided context
</instruction>

Here is the context:
<context>
{context}
</context>

Here is the question:
<question>
{question}
</question>
"""
prompt = ChatPromptTemplate.from_template(template)

In [11]:
rewrite_template = """
<instruction>
1. You are a rewriter specialist
2. Rewrite the question for better search query by removing distraction in the question or only extracting the question
3. Follow the output example
4. Only output the rewrited question
</instruction>

here are the output example:
<output_example>
question 1: "How tall is the Eiffel Tower? It looked so high when i was there last year"
answer 1: "What is the height of the Eiffel Tower?"

question 2: "1 oz is 28 grams, how many cm is 1 inch?"
answer 2: "convert 1 inch to cm"

question 3: "What's the main point of the article? What did the author try to convey?"
answer 3: "What is the main key point of the article"

question 4: "The Bruno Mars concert last night was dope as hell, what is the purpose EDA in data science?"
answer 4: "What is the purpose EDA in data science?"
</output_example>

Here is the question:
<question>
{x}
</question>
"""
rewrite_prompt = ChatPromptTemplate.from_template(rewrite_template)

In [12]:
rewrite_retrieve_read_chain = (
    {
        "context": {"x": RunnablePassthrough()} 
                   | rewrite_prompt
                   | llm
                   | StrOutputParser()
                   | retriever,
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [13]:
# First - Normal Question
input_query = "What are the disadvantages of VGG-16 models?"
output = rewrite_retrieve_read_chain.invoke(input_query)

print(output)

The VGG-16 model has some disadvantages, including:

1. The first fully connected layer generates a great number of parameters, which increases the amount of calculation.
2. The small and medium-sized data samples do not perform well in the deep network due to the size limits of the dataset.
3. The limited data scale causes an overfitting problem, which results in the inability of the model to generalize.

These disadvantages are mentioned in the provided context, specifically in the document "Comparative_study_on_the_performance_of_face_recog.pdf" on page 3.
