## Retrieval Agents

- 대화형 agent 
    - 정적 데이터 세트를 학습시키기에 최신 데이터에 엑세스할 수 없음  
    - 모든 분야가 아닌 전문 분야에 관한 지식을 가짐 
- 순수한 rag 사용
    - 해당 컨텍스트가 필요한지 여부에 관계없이 모든 쿼리에서 관련 컨텍스트나 정보를 검색
    - 실제로 외부 지식이 필요하지 않은 쿼리에 대해 시스템이 불필요하거나 관련 없는 데이터를 가져올 수 있음

### 데이터셋  - hugging face SQuAD

- Stanford Question Answering Dataset
- 위키피디아 기사의 내용과 질문 및 해당 답변으로 구성됨

In [6]:
!pip install -qU \
    openai==1.6.1 \
    pinecone-client==3.1.0 \
    langchain==0.1.1 \
    langchain-community==0.0.13 \
    tiktoken==0.5.2 \
    datasets==2.12.0

In [9]:
from datasets import load_dataset

df = load_dataset("rajpurkar/squad")

ValueError: Invalid pattern: '**' can only be an entire path component

In [10]:
import pandas as pd

splits = {'train': 'plain_text/train-00000-of-00001.parquet', 'validation': 'plain_text/validation-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/rajpurkar/squad/" + splits["train"])

In [11]:
df.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...


- 원래 데이터셋은 챗지피티를 이용해서 하는게 아니었기에, 제거를 한거고 실제 구축할때는 rag안쓰고 언어모델 자체를 챗봇으로 학습하기에 
- rag목적을 위해 중복행을 제거한것임.

#### context 열을 기반으로 첫번째 행을 제외한 중복 행 제거

In [12]:
df.drop_duplicates(subset='context', keep='first', inplace=True)
df.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
5,5733bf84d058e614000b61be,University_of_Notre_Dame,"As at most other universities, Notre Dame's st...",When did the Scholastic Magazine of Notre dame...,"{'text': ['September 1876'], 'answer_start': [..."
10,5733bed24776f41900661188,University_of_Notre_Dame,The university is the major seat of the Congre...,Where is the headquarters of the Congregation ...,"{'text': ['Rome'], 'answer_start': [119]}"
15,5733a6424776f41900660f51,University_of_Notre_Dame,The College of Engineering was established in ...,How many BS level degrees are offered in the C...,"{'text': ['eight'], 'answer_start': [487]}"
20,5733a70c4776f41900660f64,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...,What entity provides help with the management ...,"{'text': ['Learning Resource Center'], 'answer..."


### 벡터 DB를 생성하여 벡터를 저장 
- 파인콘에선 pineconedb 이용 

In [15]:
import os
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
from langchain.embeddings.openai import OpenAIEmbeddings

In [16]:
model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

  warn_deprecated(


In [18]:
api_key = ''

In [19]:
from pinecone import Pinecone

pc = Pinecone(api_key=api_key)

#### 서버리스 Pinecone 인스턴스의 구성을 지정
- Pinecone 서비스가 배포될 클라우드 공급자(aws)와 지역(us-east-1) 정의. 원하는 환경 설정 가능
- 벡터 저장소 초기환경 설정: 클라우드 공급자, 지역 및 배포 환경과 관련된 기타 설정과 같은 매개변수를 정의

In [20]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

#### 파인콘 인덱스가 적절하게 생성되고 사용할 준비가 되었는지 확인
- 차원 설정: 임베딩을 구성하고 인덱스를 생성할 때 'text-embedding-ada-002' 모델의 임베딩 크기와 일치하도록 차원을 1536으로 설정
- 인덱스 생성: 기존 인덱스가 존재하는지 확인 후, 생성 
- 벡터 삽입: 모델에서 생성된 벡터가 인덱스에 들어가는 형태
- 장점) 일관성을 유지하고 인덱스가 벡터를 적절하게 처리가능

- 사용자 도메인에 따른 특성 연구 
- rag 확장

In [21]:
import time

index_name = "langchain-retrieval-agent"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# 인덱스가 이미 존재하는지 확인 (처음인 경우에는 존재하지 않아야 함)
if index_name not in existing_indexes:
    # 존재하지 않는 경우 인덱스 생성
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # 인덱스가 초기화될 때까지 기다림
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

- 인덱스 생성만 한 상태로 아직 벡터를 추가하지 않았기 때문에 파인콘 인덱스의 total_vector_count가 0
- 벡터를 추가하면 저장된 벡터 수를 반영하여 인덱스의 'total_Vector_count'가 증가

### Indexing

- 처리 과정
    - 배치: 데이터세트를 100개 레코드의 더 작은 부분(배치)으로 분할
    - 메타데이터 수집: 데이터에서 제목, 맥락 등의 정보를 추출
    - 임베딩 생성: 모델을 사용하여 텍스트 데이터(컨텍스트)를 수치 표현으로 변환
    - ids: 배치 ID로 배치의 각 레코드에 대한 고유한 식별자
    - Pinecone에 업로드: 이러한 임베딩을 해당 ID 및 메타데이터와 함께 Pinecone 인덱스로 전송하여 나중에 검색할 수 있도록 저장
    - *** UPSERT(UPDATE + INSERT) 중복되는 값이 있다면 Update를 하고 중복되는 값이 없다면 Insert를 하는 쿼리

In [24]:
from tqdm.auto import tqdm

batch_size = 100

texts = []
metadatas = []

for i in tqdm(range(0, len(df), batch_size)):
    # get end of batch
    i_end = min(len(df), i+batch_size)
    batch = df.iloc[i:i_end]
    # first get metadata fields for this record
    metadatas = [{
        'title': record['title'],
        'text': record['context']
    } for j, record in batch.iterrows()]
    # get the list of contexts / documents
    documents = batch['context']
    # create document embeddings
    embeds = embed.embed_documents(documents)
    # get IDs
    ids = batch['id']
    # add everything to pinecone
    index.upsert(vectors=zip(ids, embeds, metadatas))

  0%|          | 0/189 [00:00<?, ?it/s]

In [26]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891}

### 벡터 저장소 생성 및 쿼리

In [27]:
from langchain.vectorstores import Pinecone

text_field = "text"  

# vector store 
vectorstore = Pinecone(
    index, embed.embed_query, text_field
)



In [28]:
query = "when was the college of engineering in the University of Notre Dame established?"

vectorstore.similarity_search(
    query,  
    k=3  # 쿼리와 가장 유사한 n개의 문서
)

[Document(page_content="In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis. By contrast, the Jesuit colleges, bastions of academic conservatism, were reluctant to move to a system of electives. Their graduates were shut out of Harvard Law School for that reason. Notre Dame continued to grow over the years, adding more colleges, programs, and sports teams. By 1921, with the addition of the College of Commerce, Notre Dame had grown from a small college to a university with five colleges and a professional law school. The university continued to expand and add new residence halls and buildings with each subsequent president.", metadata={'title': 'University_of_Notre_Dame'}),
 Document(page_content='The College of Engineering was established in 1920, however, early c

### 대화형 에이전트 초기화

In [29]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA

# chat completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)
# conversational memory
conversational_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=5,
    return_messages=True
)
# retrieval qa chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

  warn_deprecated(


In [33]:
qa.run(query)

  warn_deprecated(


'The College of Engineering at the University of Notre Dame was established in 1920.'

- 노트르담 대학교의 공과대학은 언제 세워졌습니까?
- '노트르담 대학교 공과대학은 1920년에 설립되었습니다.'

#### tool 지정

In [30]:
from langchain.agents import Tool

tools = [
    Tool(
        name='Knowledge Base',
        func=qa.run,
        description=(
            'use this tool when answering general knowledge queries to get '
            'more information about the topic'
        )
    )
]

#### agent 
- 자세한 출력을 나타내며, 수행할 수 있는 단계 수 n개 제한
- '생성' 조건이 충족되면 에이전트를 조기에 중지

In [31]:
from langchain.agents import initialize_agent

agent = initialize_agent(
    agent='chat-conversational-react-description',
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3,
    early_stopping_method='generate',
    memory=conversational_memory
)

  warn_deprecated(


In [32]:
agent(query)

  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Knowledge Base",
    "action_input": "Establishment date of the College of Engineering at the University of Notre Dame"
}
```[0m
Observation: [36;1m[1;3mThe College of Engineering at the University of Notre Dame was established in 1920.[0m
Thought:[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The College of Engineering at the University of Notre Dame was established in 1920."
}
```[0m

[1m> Finished chain.[0m


{'input': 'when was the college of engineering in the University of Notre Dame established?',
 'chat_history': [],
 'output': 'The College of Engineering at the University of Notre Dame was established in 1920.'}

In [34]:
agent("can you tell me some facts about the University of Notre Dame?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Knowledge Base",
    "action_input": "University of Notre Dame facts"
}
```[0m
Observation: [36;1m[1;3mThe University of Notre Dame, located in South Bend, Indiana, is a Catholic research university known for its strong alumni network, top-ranked undergraduate colleges, and graduate programs. It has a large campus with recognizable landmarks like the Golden Dome and the Basilica. The university offers over 50 master's, doctoral, and professional degree programs, including the Notre Dame Law School. The campus has a strong residential community with over 80% of undergraduates living on campus in single-sex residence halls. The university is also known for its research institutes in various fields, such as the Medieval Institute and the Kellogg Institute for International Studies. Notre Dame has a diverse student body representing all 50 states and 100 countries. The university has been recognized fo

{'input': 'can you tell me some facts about the University of Notre Dame?',
 'chat_history': [HumanMessage(content='when was the college of engineering in the University of Notre Dame established?'),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.')],
 'output': "The University of Notre Dame, located in South Bend, Indiana, is a Catholic research university known for its strong alumni network, top-ranked undergraduate colleges, and graduate programs. It has a large campus with recognizable landmarks like the Golden Dome and the Basilica. The university offers over 50 master's, doctoral, and professional degree programs, including the Notre Dame Law School. The campus has a strong residential community with over 80% of undergraduates living on campus in single-sex residence halls. The university is also known for its research institutes in various fields, such as the Medieval Institute and the Kellogg Institute for International S

- 노트르담 대학에 대한 몇 가지 사실을 알려주시겠습니까?
- 인디애나주 사우스벤드에 위치한 노트르담 대학교는 강력한 동문 네트워크, 최상위 학부 대학 및 대학원 프로그램으로 유명한 가톨릭 연구 대학입니다.... 

In [35]:
agent("what is 2 * 7?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The result of 2 * 7 is 14."
}
```[0m

[1m> Finished chain.[0m


{'input': 'what is 2 * 7?',
 'chat_history': [HumanMessage(content='when was the college of engineering in the University of Notre Dame established?'),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.'),
  HumanMessage(content='can you tell me some facts about the University of Notre Dame?'),
  AIMessage(content="The University of Notre Dame, located in South Bend, Indiana, is a Catholic research university known for its strong alumni network, top-ranked undergraduate colleges, and graduate programs. It has a large campus with recognizable landmarks like the Golden Dome and the Basilica. The university offers over 50 master's, doctoral, and professional degree programs, including the Notre Dame Law School. The campus has a strong residential community with over 80% of undergraduates living on campus in single-sex residence halls. The university is also known for its research institutes in various fields, such as the Medieval Insti

#### process
- text 1536 차원 벡터로 임베딩
- 이 벡터를 pinecone index에 추가 후 저장
- pinecone 벡터 스토어를 생성하여 이 인덱스를 LangChain에 연결 
- 벡터 저장소에서 데이터를 검색하도록 하는 agent 생성

## 교수님 코멘트
- rag 사용에선 할루시네이션때문에 모르면 모른다고 대답하도록 함. 
- 이번 코드에선 일반적으로 대답가능한 것은 대답하고, 특정 도메인에 대해 질문했으면, 검색해서 rag 사용해서 답변.
- 임베딩하는 이유? 속도. 텍스트간 유사도 미리 임베딩해서 빨리 검색하도록 함. 
- rag가 (사이즈) 커지면? 속도가 느려짐. 
- 계속 공부하고 싶으면 , 이전에 했던 rag 코드에 agent를 추가해서 만들어보자. 기존의 agent 사용해서
- ex. 학칙, 수학 계산, 일반적인 대답