### 테스트케이스 생성

- 평가 기준:
  - Accuracy(정확도) = 실제 질문에 정확한 답변을 주었는지

In [None]:
import os

from openai import OpenAI


OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
client = OpenAI(api_key=OPENAI_API_KEY)

In [None]:
with open('./res/guidebook_full.txt', 'r') as f:
    text = f.read()

In [None]:
text

In [None]:
prompt = f"""You are a professional agent proficient in writing company guidebooks.
A company guidebook contains company mission, internal rules, social welfare and other information needed for employees to work in a company.
You have previously written a company guidebook for a Korean firm named "ABC 컴퍼니" upon request.
This time, you will prepare sample questions that employees may ask.
Give the guidebook text below, you should come up to 8 questions related to information in the guidebook.
You should come up with 2 more questions that are not relevant to the guidebook. In total, you should produce 10 questions.

[GUIDEBOOK START]

{text}

[GUIDEBOOK END]"""

completion = client.chat.completions.create(
    model='gpt-4o-2024-05-13',
    messages=[{'role': 'user', 'content': prompt}],
    temperature=0.0
)

print(completion.choices[0].message.content)

In [None]:
questions = [
    "ABC 컴퍼니의 정규 업무 시간은 어떻게 되나요?",
    "연차 휴가를 사용하려면 어떤 절차를 따라야 하나요?",
    "회사에서 제공하는 건강 검진은 어떻게 신청할 수 있나요?",
    "사내 헬스장은 언제 이용할 수 있나요?",
    "성과 평가는 몇 회 실시되며, 어떤 기준으로 평가되나요?",
    "신입사원은 입사 후 어떤 교육을 이수해야 하나요?",
    "회사의 비전은 무엇인가요?",
    "윤리 위반 행위를 발견했을 때 어떻게 신고할 수 있나요?",
    "회사 근처에 추천할 만한 식당이 있나요?",
    "회사에서 제공하는 주차 공간은 어떻게 이용할 수 있나요?"
]

### 데이터 청킹 - 쪼개야 하는 이유
1. Context Length 초과해서 LLM API에 안 들어가기 때문
2. 크게 넣으면 정확도 하락
3. Input Tokens가 늘어나 비용 증가

In [None]:
contexts = text.split('\n\n')
len(contexts), contexts

In [None]:
import numpy as np

from utils import get_embeddings, cosine_similarity


def retrieve_context(question, contexts):
    question_embedding = get_embeddings([question], model='text-embedding-3-small')[0]
    context_embeddings = get_embeddings(contexts, model='text-embedding-3-small')

    similarities = [cosine_similarity(question_embedding, context_embedding) for context_embedding in context_embeddings]

    most_relevant_index = np.argmax(similarities)
    return contexts[most_relevant_index]

In [None]:
for i in range(len(questions)):
    print(i)
    print(questions[i])
    print(retrieve_context(questions[i], contexts))
    print()

In [None]:
from anthropic import Anthropic

anthropic_client = Anthropic(api_key=os.environ['ANTHROPIC_API_KEY'])

response = anthropic_client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    messages=[{"role": "user", "content": questions[0]}],
    temperature=0.0
)
print(questions[0])
print(response.content[0].text)

In [None]:
for i in range(len(questions)):
    context = retrieve_context(questions[i], contexts)
    prompt = context + '\n\n' + questions[i]

    response = anthropic_client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )    
    print(f"QUESTION {i+1}: {questions[i]}")
    print(f"CONTEXT: {context}")
    print(f"RESPONSE: {response.content[0].text}")
    print()

In [None]:
contexts_overlap = []
for i in range(1, len(contexts)-1, 1):
    contexts_overlap.append(contexts[i-1] + contexts[i] + contexts[i+1])

In [None]:
contexts_overlap

In [None]:
for i in range(len(questions)):
    context = retrieve_context(questions[i], contexts_overlap)
    prompt = context + '\n\n' + questions[i]

    response = anthropic_client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )    
    print(f"QUESTION {i+1}: {questions[i]}")
    print(f"CONTEXT: {context}")
    print(f"RESPONSE: {response.content[0].text}")
    print()