## RAG: Retrieval Augmented Generation.
- Large language models (LLMs) have a limited context size.
- TLDR
- Not all context is relevant to a given question
- Query -> Search -> Results -> (LLM) -> Answer

sag(search)가 아니라 rag(사람이 retrieval하므로)

In [1]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25

In [9]:
! pip3 install langchain_text_splitters

Collecting langchain_text_splitters
  Downloading langchain_text_splitters-0.0.2-py3-none-any.whl.metadata (2.2 kB)
Downloading langchain_text_splitters-0.0.2-py3-none-any.whl (23 kB)
Installing collected packages: langchain_text_splitters
Successfully installed langchain_text_splitters-0.0.2


In [2]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY

In [3]:
import warnings

warnings.filterwarnings("ignore")

In [4]:
from langchain_upstage import UpstageLayoutAnalysisLoader 


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [5]:
from IPython.display import display, HTML

display(HTML(docs[0].page_content[:1000]))

In [6]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage

# context를 넣는다는건 탄소를 더 많이 태우는 것. 필요한 부분만 넣자. 그게 RAG.

llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [7]:
chain.invoke({"question": "What is bug classficiation?", "Context": docs}) # 논문 docs를 context로 넣어줌

'To answer the question "What is bug classification?", we can extract the relevant information from the context.\n\nThe context states: "The change classification technique involves two steps: training and classification. The change classification algorithms learn from a training set, that is, a collection of changes that are known to belong to an existing class, that is, the changes are labeled with the known class. Features are extracted from the changes and the classification algorithm learns which features are the most useful for discriminating among the various classes. In this manner, change classification predicts the existence of bugs in software changes."\n\nBased on this information, we can conclude that bug classification is a technique for predicting the existence of bugs in software changes by using machine learning classifiers to determine whether a new software change is more similar to prior buggy changes or clean changes. The classifier is trained using features extrac

In [11]:
from langchain_community.retrievers import BM25Retriever # 랭체인은 search engine이 있음 쓰면 됨
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

text_splitter = RecursiveCharacterTextSplitter.from_language( # 청킹-덩어리화
    chunk_size=1000, chunk_overlap=100, language=Language.HTML # 정보를 자르면 개념이 잘릴 수 있으니 overlap을 줌 보통 10% 많으면 50%
)
splits = text_splitter.split_documents(docs) # doc를 splitter로 청킹

retriever = BM25Retriever.from_documents(splits) # 나뉜 splits에서 search

In [12]:
retriever.invoke("What is bug classficiation?")

[Document(page_content="<p id='102' style='font-size:16px'>One assumption of the presentation so far is that a bug is<br>repaired in a single bug-fix change. What happens when a<br>bug is repaired across multiple commits? There are two<br>cases. In the first case, a bug repair is split across multiple<br>commits, with each commit modifying a separate section of<br>the code (code sections are disjoint). Each separate change is<br>tracked back to its initial bug-introducing change, which is<br>then used to train the SVM classifier. In the second case, a bug<br>fix occurs incrementally over multiple commits, with some<br>later fixes modifying earlier ones (the fix code partially<br>overlaps). The first patch in an overlapping code section<br>would be traced back to the original bug-introducing change.<br>Later modifications would not be traced back to the original<br>bug-introducing change. Instead, they would be traced back<br>to an intermediate modification, which is identified as bug",

In [13]:
query = "What is bug classficiation?"
context_docs = retriever.invoke(query)
chain.invoke({"question": query, "Context": context_docs}) # 쉬운 문제인데 답이 없다네?
# 이유는 이 검색기가 keyword 기반이라서, 복잡한 query에 대답을 못줌

'The information is not present in the context.'

In [14]:
query = "What is bug classficiation?"
context_docs = retriever.invoke("bug")
chain.invoke({"question": query, "Context": context_docs}) # 답이 나옴 즉 구린 검색기라 keyword로 검색해줘야 함 우리의 needs랑 안 맞음
# 그래서 embedding 개념(vector based search)이 필요함
# 옛날 검색기의 고질적 문제임

'Bug classification is a technique that predicts whether there is a bug in any of the lines that were changed in one file in one SCM commit transaction. It is different from previous bug prediction work that focuses on finding prediction or regression models to identify fault-prone or buggy modules, files, and functions. Bug classification can predict buggy changes as soon as a change is made, which is immediate compared to bug predictions at the module, file, or method level.'

# Excercise 
It seems keyword search is not the best for LLM queries. What are some alternatives?