## LangChain Multi-Query for RAG

In [16]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

Found cached dataset json (file://C:/Users/sy/.cache/huggingface/datasets/jamescalam___json/jamescalam--ai-arxiv-chunked-c0ecde7e34f06e42/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)


NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

In [5]:
import pandas as pd

df = pd.read_json("hf://datasets/jamescalam/ai-arxiv-chunked/train.jsonl", lines=True)

In [10]:
df.head()

Unnamed: 0,doi,chunk-id,chunk,id,title,summary,source,authors,categories,comment,journal_ref,primary_category,published,updated,references
0,1910.01108,0,"DistilBERT, a distilled version of BERT: small...",1910.01108,"DistilBERT, a distilled version of BERT: small...",As Transfer Learning from large-scale pre-trai...,http://arxiv.org/pdf/1910.01108,"[Victor Sanh, Lysandre Debut, Julien Chaumond,...",[cs.CL],February 2020 - Revision: fix bug in evaluatio...,,cs.CL,20191002,20200301,[{'id': '1910.01108'}]
1,1910.01108,1,"loss combining language modeling, distillation...",1910.01108,"DistilBERT, a distilled version of BERT: small...",As Transfer Learning from large-scale pre-trai...,http://arxiv.org/pdf/1910.01108,"[Victor Sanh, Lysandre Debut, Julien Chaumond,...",[cs.CL],February 2020 - Revision: fix bug in evaluatio...,,cs.CL,20191002,20200301,[{'id': '1910.01108'}]
2,1910.01108,2,in real-time has the potential to enable novel...,1910.01108,"DistilBERT, a distilled version of BERT: small...",As Transfer Learning from large-scale pre-trai...,http://arxiv.org/pdf/1910.01108,"[Victor Sanh, Lysandre Debut, Julien Chaumond,...",[cs.CL],February 2020 - Revision: fix bug in evaluatio...,,cs.CL,20191002,20200301,[{'id': '1910.01108'}]
3,1910.01108,3,through distillation via the supervision of a ...,1910.01108,"DistilBERT, a distilled version of BERT: small...",As Transfer Learning from large-scale pre-trai...,http://arxiv.org/pdf/1910.01108,"[Victor Sanh, Lysandre Debut, Julien Chaumond,...",[cs.CL],February 2020 - Revision: fix bug in evaluatio...,,cs.CL,20191002,20200301,[{'id': '1910.01108'}]
4,1910.01108,4,generalization capabilities of the model and h...,1910.01108,"DistilBERT, a distilled version of BERT: small...",As Transfer Learning from large-scale pre-trai...,http://arxiv.org/pdf/1910.01108,"[Victor Sanh, Lysandre Debut, Julien Chaumond,...",[cs.CL],February 2020 - Revision: fix bug in evaluatio...,,cs.CL,20191002,20200301,[{'id': '1910.01108'}]


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41584 entries, 0 to 41583
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   doi               41584 non-null  float64
 1   chunk-id          41584 non-null  int64  
 2   chunk             41584 non-null  object 
 3   id                41584 non-null  float64
 4   title             41584 non-null  object 
 5   summary           41584 non-null  object 
 6   source            41584 non-null  object 
 7   authors           41584 non-null  object 
 8   categories        41584 non-null  object 
 9   comment           25617 non-null  object 
 10  journal_ref       2216 non-null   object 
 11  primary_category  41584 non-null  object 
 12  published         41584 non-null  int64  
 13  updated           41584 non-null  int64  
 14  references        41584 non-null  object 
dtypes: float64(2), int64(3), object(10)
memory usage: 4.8+ MB


In [20]:
from langchain.docstore.document import Document

docs = []

for index, row in df.iterrows():
    doc = Document(
        page_content=row["chunk"],
        metadata={
            "title": row["title"],
            "source": row["source"],
            "id": row["id"],
            "chunk-id": row["chunk-id"],
            "text": row["chunk"]
        }
    )
    docs.append(doc)

#### Embedding and Vector DB Setup

In [21]:
import os
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')

In [22]:
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = "text-embedding-ada-002"

embed = OpenAIEmbeddings(
    model=model_name, openai_api_key=OPENAI_API_KEY, disallowed_special=()
)

  warn_deprecated(


In [None]:
api_key = ''

In [24]:
from pinecone import Pinecone

pc = Pinecone(api_key=api_key)

In [25]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

In [26]:
import time

index_name = "langchain-multi-query-demo"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [27]:
len(docs)

41584

In [29]:
from tqdm.auto import tqdm

batch_size = 100

for i in tqdm(range(0, len(docs), batch_size)):
    i_end = min(len(docs), i+batch_size)
    docs_batch = docs[i:i_end]
    # get IDs
    ids = [f"{doc.metadata['id']}-{doc.metadata['chunk-id']}" for doc in docs_batch]
    # get text and embed
    texts = [d.page_content for d in docs_batch]
    embeds = embed.embed_documents(texts=texts)
    # get metadata
    metadata = [d.metadata for d in docs_batch]
    to_upsert = zip(ids, embeds, metadata)
    index.upsert(vectors=to_upsert)

  0%|          | 0/416 [00:00<?, ?it/s]

In [30]:
from langchain.vectorstores import Pinecone

text_field = "text"

vectorstore = Pinecone(index, embed.embed_query, text_field)



In [31]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

  warn_deprecated(


- MultiQueryRetriever 초기화

In [32]:
from langchain.retrievers.multi_query import MultiQueryRetriever

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(), llm=llm
)

#### logging 설정
- 디버깅이나 모니터링을 위하여 소프트웨어 동작 상태 정보를 기록해서 볼 수 있음 
- 로그(log)를 생성하도록 시스템을 작성하는 활동
- 문제가 발생하는 경우 로그가 중요한 정보를 제공하기에 문제 진단, 해결가능
- INFO: 상태 변경과 같은 정보성 로그를 표시

In [33]:
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [34]:
question = "tell me about llama 2?"

docs = retriever.get_relevant_documents(query=question)
len(docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['What information can you provide about llama 2?', 'What are some details about llama 2?', 'Can you share some insights on llama 2?']


5

- retriever.get_relevant_documents 질의와 관련된 문서를 검색하고 반환
- 쿼리와 관련하여 검색기가 찾은 문서 수: 5개 

In [35]:
docs

[Document(page_content='Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang\nRoss Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang\nAngela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic\nSergey Edunov Thomas Scialom\x03\nGenAI, Meta\nAbstract\nIn this work, we develop and release Llama 2, a collection of pretrained and ﬁne-tuned\nlarge language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.\nOur ﬁne-tuned LLMs, called L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc , are optimized for dialogue use cases. Our\nmodels outperform open-source chat models on most benchmarks we tested, and based on\nourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosedsource models. We provide a detailed description of our approach to ﬁne-tuning and safety', metadata={'chunk-id': 1.0, 'id': 2307.09288, 'source': 'http://arxiv.org/pdf/2307.09288', 'title': 'Ll

### Adding the Generation in RAG
- 지금까진 쿼리로 구동되는 rag를 구축
- 템플릿 생성 추가, 체인 생성 

In [36]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

QA_PROMPT = PromptTemplate(
    input_variables=["query", "contexts"],
    template="""You are a helpful assistant who answers user queries using the
    contexts provided. If the question cannot be answered using the information
    provided say "I don't know".

    Contexts:
    {contexts}

    Question: {query}""",
)

# Chain
qa_chain = LLMChain(llm=llm, prompt=QA_PROMPT)

- contexts: 하나의 텍스트 조각으로 결합되며, 각 조각은 줄(---)로 구분되므로 모델은 정보의 한 비트가 끝나고 다른 정보가 시작되는 위치 파악가능 
- "\n---\n": 각 텍스트 사이에 삽입되는 구분 기호 
- 검색된 모든 문서를 가져와 단일 문자열로 결합

In [37]:
out = qa_chain(
    inputs={
        "query": question,
        "contexts": "\n---\n".join([d.page_content for d in docs])
    }
)
out["text"]

  warn_deprecated(


'Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called L/l.sc/a.sc/m.sc/a.sc/two.taboldstyle-C/h.sc/a.sc/t.sc, are optimized for dialogue use cases. These models outperform open-source chat models on most benchmarks tested and may be a suitable substitute for closed-source models based on humane evaluations for helpfulness and safety. The approach to fine-tuning and safety is detailed in the work.'

### Chaining Everything with a SequentialChain
- 여러개의 체인을 연결

In [38]:
from langchain.chains import TransformChain

def retrieval_transform(inputs: dict) -> dict:
    docs = retriever.get_relevant_documents(query=inputs["question"])
    docs = [d.page_content for d in docs]
    docs_dict = {
        "query": inputs["question"],
        "contexts": "\n---\n".join(docs)
    }
    return docs_dict

retrieval_chain = TransformChain(
    input_variables=["question"],
    output_variables=["query", "contexts"],
    transform=retrieval_transform
)

- d.page_content: 텍스트 콘텐츠 추출, 검색된 각 문서에서 텍스트 콘텐츠(페이지 콘텐츠) 목록을 생성
- context 결합: 모든 문서 텍스트를 단일 문자열을 \n---\n으로 구분하여 원래 쿼리와 함께 딕셔너리에 저장

In [39]:
from langchain.chains import SequentialChain

rag_chain = SequentialChain(
    chains=[retrieval_chain, qa_chain],
    input_variables=["question"],  # we need to name differently to output "query"
    output_variables=["query", "contexts", "text"]
)

In [40]:
out = rag_chain({"question": question})
out["text"]

INFO:langchain.retrievers.multi_query:Generated queries: ['What information can you provide about llama 2?', 'What are some details about llama 2?', 'Can you share some insights on llama 2?']


'Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, called L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle-C/h.sc/a.sc/t.sc, are optimized for dialogue use cases. These models outperform open-source chat models on most benchmarks tested and may be a suitable substitute for closed-source models based on humane evaluations for helpfulness and safety. The approach to fine-tuning and safety is detailed in the work.'

- TransformChain(retrieval_chain)으로 쿼리와 관련된 문서를 검색하고 처리
- SequentialChain으로 앞선 출력물(쿼리와 컨텍스트로 생성된 사전)을 qa_chain으로 전달하여 최종 답변을 생성

### Custom Multiquery
- 프롬프트에 따른 다양한 답변 생성 비교 

- LineList 클래스 생성
- ** pydantic: 데이터의 타입과 범위를 검증해주는 라이브러리, type annotation을 사용하여 검증과 setting 관리
- ** field: 데이터 범위, 길이제한 등 설정 가능

In [41]:
from typing import List
from langchain.chains import LLMChain
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser


# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        return LineList(lines=lines)


output_parser = LineListOutputParser()

In [56]:
template = """
Your task is to generate 3 different search queries that aim to
answer the user question from multiple perspectives. The user questions
are focused on Large Language Models, Machine Learning, and related
disciplines.
Each query MUST tackle the question from a different viewpoint, we
want to get a variety of RELEVANT search results.
Provide these alternative questions separated by newlines.
Original question: {question}
"""

- prompttemplate 생성
- llm 설정
- llmchain 생성 
- outputparser: 출력을 원하는 형식으로 구문 분석하는 프로세스

In [57]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template=template,
)
llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

In [58]:
retriever = MultiQueryRetriever(
    retriever=vectorstore.as_retriever(), llm_chain=llm_chain, parser_key="lines"
)  # "lines" is the key (attribute name) of the parsed output

In [59]:
docs = retriever.get_relevant_documents(
    query=question
)
len(docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the key features and capabilities of Large Language Models like GPT-3 and BERT?', '2. How does Machine Learning play a role in the development and advancement of Large Language Models?', '3. What are the ethical considerations and potential risks associated with deploying Large Language Models in various industries and applications?']


12

- 질문과 유사한 문서 검색: 12개

In [60]:
docs

[Document(page_content='2 Related Work\n2.1 Large Language Models\nIn recent times, Large Language Models (LLMs) have garnered increasing attention for their exceptional performance in diverse natural language processing (NLP) tasks. Initially, transformer\nmodels such as BERT [Devlin et al., 2019], GPT [Radford and Narasimhan, 2018], and T5 [Raffel\net al., 2020] were developed with different pre-training objectives. However, the emergence of GPT3 [Brown et al., 2020], which scales up the number of model parameters and data size, showcases\nsigniﬁcant zero-shot generalization abilities, enabling them to perform commendably on previously\nunseen tasks. Consequently, numerous LLMs such as OPT [Zhang et al., 2022], BLOOM [Scao\net al., 2022], PaLM [Chowdhery et al., 2022], and LLaMA [Touvron et al., 2023] are created, ushering in the success of LLMs. Additionally, Ouyang et al. [Ouyang et al., 2022] propose InstructGPT\nby aligning human instruction and feedback with GPT-3. Furthermore, 

In [61]:
retrieval_chain = TransformChain(
    input_variables=["question"],
    output_variables=["query", "contexts"],
    transform=retrieval_transform
)

rag_chain = SequentialChain(
    chains=[retrieval_chain, qa_chain],
    input_variables=["question"],  # we need to name differently to output "query"
    output_variables=["query", "contexts", "text"]
)

In [62]:
out = rag_chain({"question": question})
out["text"]

INFO:langchain.retrievers.multi_query:Generated queries: ['1. How do Large Language Models like GPT-3 contribute to advancements in natural language processing?', '2. What are the key differences between traditional machine learning algorithms and deep learning models used in language processing?', '3. What are the ethical considerations surrounding the use of Large Language Models in various industries and applications?']


'LLaMA is a Large Language Model created by Touvron et al. in 2023, which is part of the success of Large Language Models (LLMs). It is one of the numerous LLMs like OPT, BLOOM, and PaLM that have been developed to showcase significant zero-shot generalization abilities and perform well on diverse natural language processing tasks.'

- 라마2에 대해 알려주세요
- 'LLaMA는 Touvron 등이 2023년에 개발한 대형 언어 모델입니다. GPT-3와 같은 모델의 성공을 활용하기 위해 개발된 수많은 LLM 중 하나입니다. LLaMA는 OPT, BLOM 및 PaLM과 같은 다른 모델과 함께 중요한 제로 샷 일반화 능력을 보여주고 다양한 자연어 처리 작업에서 탁월한 것을 목표로 합니다.'

In [49]:
template = """
Your task is to generate 3 different search queries that aim to
answer the user question from multiple perspectives.
Each query MUST tackle the question from a different viewpoint,
we want to get a variety of RELEVANT search results.
Provide these alternative questions separated by newlines.
Original question: {question}
""" 

- logging / class annotation / contexts 부분