## RAG: Retrieval Augmented Generation.
- Large language models (LLMs) have a limited context size.
- TLDR
- Not all context is relevant to a given question
- Query -> Search -> Results -> (LLM) -> Answer

## Keyword VS Semantic Search 
![Vector](https://blog.dataiku.com/hs-fs/hubfs/dftt%202.webp?width=1346&height=632&name=dftt%202.webp)

from https://blog.dataiku.com/semantic-search-an-overlooked-nlp-superpower

![Emb_search](figures/emb_search.png)

from https://sreent.medium.com/llms-embeddings-and-vector-search-d4bd9362df56

In [2]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25

In [3]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY

In [4]:
import warnings

warnings.filterwarnings("ignore")

# Most powerful solar embedding
![Solar Embedding](figures/solar_emb.jpeg)



In [5]:
from langchain_upstage import UpstageEmbeddings
# 근데 이건 내가 아는 embedding이랑 다른가? 검색 엔진에서 어떻게 쓰이는 거지? 그냥 유사도가 아니라? 나는 topic modeling 할 때 썼던 것 같음

embeddings_model = UpstageEmbeddings(model="solar-embedding-1-large")
embeddings = embeddings_model.embed_documents(
    [
        "What is the best season to visit Korea?",
    ]
)

len(embeddings), len(embeddings[0])

(1, 4096)

In [6]:
# RAG 1. load doc (done), 2. chunking, splits, 3. embeding - indexing, 4. retrieve

In [7]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [8]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language( # chunk_size는 100~500~1000개가 괜찮음
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 125


In [13]:
! pip3 install langchain_chroma

Collecting langchain_chroma
  Using cached langchain_chroma-0.1.1-py3-none-any.whl.metadata (1.3 kB)
Collecting chromadb<0.6.0,>=0.4.0 (from langchain_chroma)
  Using cached chromadb-0.5.0-py3-none-any.whl.metadata (7.3 kB)
Collecting fastapi<1,>=0.95.2 (from langchain_chroma)
  Using cached fastapi-0.111.0-py3-none-any.whl.metadata (25 kB)
Collecting build>=1.0.3 (from chromadb<0.6.0,>=0.4.0->langchain_chroma)
  Using cached build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb<0.6.0,>=0.4.0->langchain_chroma)
  Using cached chroma-hnswlib-0.7.3.tar.gz (31 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->c

  error: subprocess-exited-with-error
  
  × Building wheel for chroma-hnswlib (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [5 lines of output]
      running bdist_wheel
      running build
      running build_ext
      building 'hnswlib' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for chroma-hnswlib
ERROR: Could not build wheels for chroma-hnswlib, which is required to install pyproject.toml-based projects


In [14]:
from langchain_chroma import Chroma # 이걸로 embed

# 3. Embed & indexing
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
)

ModuleNotFoundError: No module named 'langchain_chroma'

In [None]:
# 4. retrive
retriever = vectorstore.as_retriever()
result_docs = retriever.invoke("What is Bug Classification?")
print(len(result_docs))
print(result_docs[0].page_content[:100])

4
<p id='48' style='font-size:16px'>Similar in spirit to change classification is work that<br>classif


In [None]:
result_docs

NameError: name 'result_docs' is not defined

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [None]:
chain.invoke({"question": "What is bug classficiation?", "Context": result_docs})

'Bug classification is a process that involves extracting keywords from bug reports or software maintenance requests and using them as features to train a machine learning classifier. The goal of this classification is to place a bug report into a specific category or to find the developer best suited to fix a bug. This work, along with change classification, highlights the potential of using machine learning techniques in software engineering. If an existing concern such as assigning bugs to developers can be recast as a classification problem, then it is possible to leverage the large collection of data stored in bug tracking and SCM systems.'

# Excercise: Hybrid
Sometimes keyword search can be useful. Design a system that does keyword and semantic search, then combine the results. Use them as context for your RAG.