# 파트 2: 고급 기법과 평가

### 예제코드 : Chapter01/RAG_Overview.ipynb

> part1 환경설정을 그대로 가져옵니다.

In [1]:
import sys
import os

current_dir = os.path.abspath(os.getcwd())
python_path = os.path.join(current_dir, "..")

if python_path not in sys.path:
    sys.path.append(python_path)

In [2]:
from cllama import send_query

### 검색 지표 (섹션 1) 

> 텍스트 문서들의 관련성을 평가하는데 중요한 코사인 유사도에 대해 먼저 살펴봅니다.


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer(
        stop_words='english',
        use_idf=True,
        norm='l2',
        ngram_range=(1, 2),  # Use unigrams and bigrams
        sublinear_tf=True,   # Apply sublinear TF scaling
        analyzer='word'      # You could also experiment with 'char' or 'char_wb' for character-level features
    )
    tfidf = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf[0:1], tfidf[1:2])
    return similarity[0][0]

> 코사인 유사도는 텍스트의 벡터 표현 간 각도만을 엄격하게 측정하기 때문에 중의적인 쿼리문을 다룰 때는 제한이 있습니다. 

> 자연어 처리(NLP) 기법들을 활용해서 단어들 사이의 의미 관계를 좀 더 잘 포착하는 계산 방식을 도입한다면 유사도를 좀 더 개선할 수 있습니다.

In [4]:
import spacy
import nltk
nltk.download('wordnet')

from nltk.corpus import wordnet
from collections import Counter
import numpy as np

# 1회만 실행: spaCy 모델 다운로드
spacy.cli.download("en_core_web_sm")
# Load spaCy model
nlp = spacy.load("en_core_web_sm")


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jeonghyeseong/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [5]:

def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.add(lemma.name())
    return synonyms

def preprocess_text(text):
    doc = nlp(text.lower())
    lemmatized_words = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        lemmatized_words.append(token.lemma_)
    return lemmatized_words

def expand_with_synonyms(words):
    expanded_words = words.copy()
    for word in words:
        expanded_words.extend(get_synonyms(word))
    return expanded_words

def calculate_enhanced_similarity(text1, text2):
    # Preprocess and tokenize texts
    words1 = preprocess_text(text1)
    words2 = preprocess_text(text2)

    # Expand with synonyms
    words1_expanded = expand_with_synonyms(words1)
    words2_expanded = expand_with_synonyms(words2)

    # Count word frequencies
    freq1 = Counter(words1_expanded)
    freq2 = Counter(words2_expanded)

    # Create a set of all unique words
    unique_words = set(freq1.keys()).union(set(freq2.keys()))

    # Create frequency vectors
    vector1 = [freq1[word] for word in unique_words]
    vector2 = [freq2[word] for word in unique_words]

    # Convert lists to numpy arrays
    vector1 = np.array(vector1)
    vector2 = np.array(vector2)

    # Calculate cosine similarity
    cosine_similarity = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

    return cosine_similarity
     

### 단순 RAG (섹션 2) 

p.26


> 키워드 검색과 매칭을 사용하는 단순 RAG는 법률 문서나 의료 문서와 같이 잘 정의된 조직 내부 문서에서 효율적일 수 있습니다.

> `query`, `db_records` 는 앞서 정의한 내용을 그대로 가져옵니다.

In [6]:
import textwrap
query = "define a rag store"
db_records = [
    "Retrieval Augmented Generation (RAG) represents a sophisticated hybrid approach in the field of artificial intelligence, particularly within the realm of natural language processing (NLP).",
    "It innovatively combines the capabilities of neural network-based language models with retrieval systems to enhance the generation of text, making it more accurate, informative, and contextually relevant.",
    "This methodology leverages the strengths of both generative and retrieval architectures to tackle complex tasks that require not only linguistic fluency but also factual correctness and depth of knowledge.",
    "At the core of Retrieval Augmented Generation (RAG) is a generative model, typically a transformer-based neural network, similar to those used in models like GPT (Generative Pre-trained Transformer) or BERT (Bidirectional Encoder Representations from Transformers).",
    "This component is responsible for producing coherent and contextually appropriate language outputs based on a mixture of input prompts and additional information fetched by the retrieval component.",
    "Complementing the language model is the retrieval system, which is usually built on a database of documents or a corpus of texts.",
    "This system uses techniques from information retrieval to find and fetch documents that are relevant to the input query or prompt.",
    "The mechanism of relevance determination can range from simple keyword matching to more complex semantic search algorithms which interpret the meaning behind the query to find the best matches.",
    "This component merges the outputs from the language model and the retrieval system.",
    "It effectively synthesizes the raw data fetched by the retrieval system into the generative process of the language model.",
    "The integrator ensures that the information from the retrieval system is seamlessly incorporated into the final text output, enhancing the model's ability to generate responses that are not only fluent and grammatically correct but also rich in factual details and context-specific nuances.",
    "When a query or prompt is received, the system first processes it to understand the requirement or the context.",
    "Based on the processed query, the retrieval system searches through its database to find relevant documents or information snippets.",
    "This retrieval is guided by the similarity of content in the documents to the query, which can be determined through various techniques like vector embeddings or semantic similarity measures.",
    "The retrieved documents are then fed into the language model.",
    "In some implementations, this integration happens at the token level, where the model can access and incorporate specific pieces of information from the retrieved texts dynamically as it generates each part of the response.",
    "The language model, now augmented with direct access to retrieved information, generates a response.",
    "This response is not only influenced by the training of the model but also by the specific facts and details contained in the retrieved documents, making it more tailored and accurate.",
    "By directly incorporating information from external sources, Retrieval Augmented Generation (RAG) models can produce responses that are more factual and relevant to the given query.",
    "This is particularly useful in domains like medical advice, technical support, and other areas where precision and up-to-date knowledge are crucial.",
    "Retrieval Augmented Generation (RAG) systems can dynamically adapt to new information since they retrieve data in real-time from their databases.",
    "This allows them to remain current with the latest knowledge and trends without needing frequent retraining.",
    "With access to a wide range of documents, Retrieval Augmented Generation (RAG) systems can provide detailed and nuanced answers that a standalone language model might not be capable of generating based solely on its pre-trained knowledge.",
    "While Retrieval Augmented Generation (RAG) offers substantial benefits, it also comes with its challenges.",
    "These include the complexity of integrating retrieval and generation systems, the computational overhead associated with real-time data retrieval, and the need for maintaining a large, up-to-date, and high-quality database of retrievable texts.",
    "Furthermore, ensuring the relevance and accuracy of the retrieved information remains a significant challenge, as does managing the potential for introducing biases or errors from the external sources.",
    "In summary, Retrieval Augmented Generation represents a significant advancement in the field of artificial intelligence, merging the best of retrieval-based and generative technologies to create systems that not only understand and generate natural language but also deeply comprehend and utilize the vast amounts of information available in textual form.",
    "A RAG vector store is a database or dataset that contains vectorized data points."
]

def print_formatted_response(response):
    # Format the response with textwrap
    wrapped_text = textwrap.fill(response, width=80)
    print(wrapped_text)

> 키워드 검색 및 매칭 함수를 구현합니다.

In [7]:
def find_best_match_keyword_search(query, db_records):
    best_score = 0
    best_record = None

    # Split the query into individual keywords
    query_keywords = set(query.lower().split())

    # Iterate through each record in db_records
    for record in db_records:
        # Split the record into keywords
        record_keywords = set(record.lower().split())

        # Calculate the number of common keywords
        common_keywords = query_keywords.intersection(record_keywords)
        current_score = len(common_keywords)

        # Update the best score and record if the current score is higher
        if current_score > best_score:
            best_score = current_score
            best_record = record

    return best_score, best_record

# Assuming 'query' and 'db_records' are defined in previous cells in your Colab notebook
best_keyword_score, best_matching_record = find_best_match_keyword_search(query, db_records)

print(f"Best Keyword Score: {best_keyword_score}")
print_formatted_response(best_matching_record)

Best Keyword Score: 3
A RAG vector store is a database or dataset that contains vectorized data
points.


> 코사인 유사도 지표 함수를 이용해서 쿼리문과 최적 레코드의 유사도 점수를 계산해봅니다.  

> 유사도가 높지 않은 이유는 사용자 입력(쿼리문)은 짧은 반면에 응답은 더 길고 완전하기 때문입니다.

In [9]:
# Cosine Similarity
score = calculate_cosine_similarity(query, best_matching_record)
print(f"Best Cosine Similarity Score: {score:.3f}")

Best Cosine Similarity Score: 0.126


> 개선된 유사도 지표를 확인해봅니다.

In [10]:
# Enhanced Similarity
response = best_matching_record
print(query,": ", response)
similarity_score = calculate_enhanced_similarity(query, response)
print(f"Enhanced Similarity:, {similarity_score:.3f}")

define a rag store :  A RAG vector store is a database or dataset that contains vectorized data points.
Enhanced Similarity:, 0.642


> 사용자 입력과 키워드 검색으로 찾은 데이터셋의 최적 매칭 레코드를 연결해서 입력을 증갑합니다.

In [11]:
augmented_input=query+ ": "+ best_matching_record
print_formatted_response(augmented_input)

define a rag store: A RAG vector store is a database or dataset that contains
vectorized data points.


In [12]:
def call_llm_with_full_text(itext: str):
    text_input = '\n'.join(itext)
    prompt = f"Please elaborate on the following content and tralsate the result : \n {text_input}"    
    try:
        result = send_query([
            {
                "role": "system",
                "content": "You are an expert Natural Language Processing excercise expert."
            },
            {
                "role": "assistant",
                "content": "1. You can explain read the input and answer in detail"
            },
            {
                "role": "user",
                "content": prompt
            }
        ])
        
        return result
        
    except Exception as e:
        print(f"Error: {e}")
        result = None
    
    return None

In [13]:
# Call the function and print the result
llm_response = call_llm_with_full_text(augmented_input)
print_formatted_response(llm_response)

Okay, let's break down this content, elaborate on it, and then provide a more
comprehensive explanation, followed by translations into a few common languages.
**Understanding the Content: RAG Vector Stores**  The provided text introduces
the concept of a "RAG vector store." Let's unpack what that means.  It's a key
component in a modern approach to working with information, especially when
using large language models (LLMs) like me.  * **RAG stands for Retrieval-
Augmented Generation.**  This is the overarching technique. It combines two core
ideas:     * **Retrieval:**  Finding relevant information from a knowledge
source.     * **Generation:**  Using that retrieved information to generate a
response (like a text answer, code snippet, etc.).  The "generation" part is
typically handled by an LLM.  * **Vector Store:** This is *where* the knowledge
for the "Retrieval" part is stored.  Here's a detailed explanation of what a
vector store is and how it works:      * **Traditional Databases

### 고급 RAG (섹션 3)

> 데이터셋이 커질수록 키워드 검색 방식은 실행 시간이 너무 오래 걸릴 수 있습니다.

> 이번 섹션에서는 벡터 검색과 색인 기반 검색을 활용하여 검색 효율성과 처리 속도를 향상하는 방법을 살펴봅니다.


#### 벡터 검색 (섹션 3.1)

> 벡터 검색은 사용자 쿼리문과 문서를 벡터라는 수치 배열로 변환해서 수학 계산에 사용합니다. 

> 덕분에 대용량 데이터를 다룰 때 관련 데이터를 더 빠르게 검색할 수 있습니다.


In [14]:
def find_best_match(text_input, records):
    best_score = 0
    best_record = None
    for record in records:
        current_score = calculate_cosine_similarity(text_input, record)
        if current_score > best_score:
            best_score = current_score
            best_record = record
    return best_score, best_record

In [17]:
best_similarity_score, best_matching_record = find_best_match(query, db_records)

print_formatted_response(best_matching_record)

A RAG vector store is a database or dataset that contains vectorized data
points.


> 단순 RAG 에서도 이 문장이 최적의 레코드로 선택되었었던 것처럼, 단순 RAG가 반드시 고급 RAG 보다 나쁘지는 않음을 말해줍니다.

In [19]:
print(f"Best Cosine Similarity Score: {best_similarity_score:.3f}") 

# Enhanced Similarity
similarity_score = calculate_enhanced_similarity(query, best_matching_record)
print(f"Enhanced Similarity:, {similarity_score:.3f}")

Best Cosine Similarity Score: 0.126
Enhanced Similarity:, 0.642


> 단순 RAG 와 동일한 레코드가 검색되었으므로 코사인 유사도와 개선된 유사도 모두 동일할 수 밖에 없습니다.

> 결과가 같은데 굳이 벡터 검색을 사용할 필요가 있을까 의문이 들지만 데이터셋 규모가 더 커진다면 벡터 검색의 장점이 좀 더 확실히 드러날 겁니다.


In [22]:
augmented_input=query+": "+best_matching_record

print_formatted_response(augmented_input)

# Call the function and print the result
llm_response = call_llm_with_full_text(augmented_input)
print_formatted_response(llm_response)

define a rag store: A RAG vector store is a database or dataset that contains
vectorized data points.
Okay, let's break down this content. It's a very short and foundational
explanation of a **RAG vector store**, a core component in modern Natural
Language Processing (NLP) and particularly Retrieval-Augmented Generation (RAG)
systems.  Here's a detailed elaboration, followed by translations into several
languages:  **Elaboration of "Define a RAG store: A RAG vector store is a
database or dataset that contains vectorized data points."**  This statement is
defining a key component used in building applications that can "understand" and
respond to questions based on large amounts of data. Let's unpack it piece by
piece:  * **RAG:** Stands for **Retrieval-Augmented Generation**. It's an
architecture for building language models (like those powered by LLMs - Large
Language Models) that combines the strengths of *information retrieval* with the
power of *text generation*.  Think of it as giv

#### 색인 기반 검색 (섹션 3.2)

> 사용자 쿼리문의 벡터를 문서 내용의 직접적인 벡터와 비교하지 않고, 색인화된 벡터(indexed vector)와 비교합니다.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def setup_vectorizer(records):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(records)
    return vectorizer, tfidf_matrix

def find_best_match(query, vectorizer, tfidf_matrix):
    query_tfidf = vectorizer.transform([query])
    similarities = cosine_similarity(query_tfidf, tfidf_matrix)
    best_index = similarities.argmax()  # Get the index of the highest similarity score
    best_score = similarities[0, best_index]
    return best_score, best_index

vectorizer, tfidf_matrix = setup_vectorizer(db_records)

best_similarity_score, best_index = find_best_match(query, vectorizer, tfidf_matrix)
best_matching_record = db_records[best_index]

print_formatted_response(best_matching_record)

A RAG vector store is a database or dataset that contains vectorized data
points.


> 이번에도 최적 레코드가 동일하므로 유사도 역시 이전과 동일합니다. 

> 다만, 차이는 최적 레코드를 좀 더 빨리 찾아냈다는 점입니다.

In [26]:
# Cosine Similarity
best_cosine_similarity_score = calculate_cosine_similarity(query, best_matching_record)
print(f"Best Cosine Similarity Score: {best_cosine_similarity_score:.3f}")


# Enhanced Similarity
enhanced_similarity_score = calculate_enhanced_similarity(query, best_matching_record)
print(f"Enhanced Similarity:, {enhanced_similarity_score:.3f}")

Best Cosine Similarity Score: 0.126
Enhanced Similarity:, 0.642


> 특징 행렬이 어떤 형태인지 확인해봅니다.

In [28]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

def setup_vectorizer(records):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(records)

    # Convert the TF-IDF matrix to a DataFrame for display purposes
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

    # Display the DataFrame
    print(tfidf_df)

    return vectorizer, tfidf_matrix

vectorizer, tfidf_matrix = setup_vectorizer(db_records)

     ability    access  accuracy  accurate     adapt  additional  advancement  \
0   0.000000  0.000000  0.000000  0.000000  0.000000    0.000000     0.000000   
1   0.000000  0.000000  0.000000  0.216364  0.000000    0.000000     0.000000   
2   0.000000  0.000000  0.000000  0.000000  0.000000    0.000000     0.000000   
3   0.000000  0.000000  0.000000  0.000000  0.000000    0.000000     0.000000   
4   0.000000  0.000000  0.000000  0.000000  0.000000    0.236479     0.000000   
5   0.000000  0.000000  0.000000  0.000000  0.000000    0.000000     0.000000   
6   0.000000  0.000000  0.000000  0.000000  0.000000    0.000000     0.000000   
7   0.000000  0.000000  0.000000  0.000000  0.000000    0.000000     0.000000   
8   0.000000  0.000000  0.000000  0.000000  0.000000    0.000000     0.000000   
9   0.000000  0.000000  0.000000  0.000000  0.000000    0.000000     0.000000   
10  0.186734  0.000000  0.000000  0.000000  0.000000    0.000000     0.000000   
11  0.000000  0.000000  0.00

> 입력을 증강하고 LLM 을 호출해봅니다.

In [29]:
augmented_input=query+": "+best_matching_record
print_formatted_response(augmented_input)

# Call the function and print the result
llm_response = call_llm_with_full_text(augmented_input)
print_formatted_response(llm_response)

define a rag store: A RAG vector store is a database or dataset that contains
vectorized data points.
Okay, let's break down this content and elaborate on it, then I'll provide a
more comprehensive explanation, effectively "expanding" on what's given.  I'll
aim to explain it in a way that someone unfamiliar with the concept of RAG
vector stores can understand, while also hitting on the important technical
details.  After the elaboration, I'll provide a translation into a few different
languages.  **Elaboration of "RAG Vector Store"**  The provided text defines a
RAG vector store very simply: it's a database (or dataset) holding vectorized
data points. However, that definition needs a lot of context. Here's a more
detailed explanation, built up from the core concept:  **1. What is RAG?
(Retrieval-Augmented Generation)**  Before diving into the *store* itself, let’s
understand the bigger picture: **RAG**. RAG is a technique used with Large
Language Models (LLMs) like GPT-3, GPT-4, Gemini

#### 모듈형 RAG (섹션 4)

* `키워드 검색`은 단순 검색에 적합
* `벡터 검색`은 의미론적으로 풍부한 문서에 이상적
* `색인 기반 검색`은 대규모 데이터에서 빠른 속도를 제공

> 각 접근 방식에는 장점이 있습니다. 중요한 것은, 세 방법을 하나의 프로젝트 안에서 조화롭게 사용할 수 있다는 점입니다.


In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

class RetrievalComponent:
    def __init__(self, method='vector'):
        self.method = method
        if self.method == 'vector' or self.method == 'indexed':
            self.vectorizer = TfidfVectorizer()
            self.tfidf_matrix = None

    def fit(self, records):
      self.documents = records  # Initialize self.documents here
      if self.method == 'vector' or self.method == 'indexed':
        self.tfidf_matrix = self.vectorizer.fit_transform(records)

    def retrieve(self, query):
        if self.method == 'keyword':
            return self.keyword_search(query)
        elif self.method == 'vector':
            return self.vector_search(query)
        elif self.method == 'indexed':
            return self.indexed_search(query)

    def keyword_search(self, query):
        best_score = 0
        best_record = None
        query_keywords = set(query.lower().split())
        for index, doc in enumerate(self.documents):
            doc_keywords = set(doc.lower().split())
            common_keywords = query_keywords.intersection(doc_keywords)
            score = len(common_keywords)
            if score > best_score:
                best_score = score
                best_record = self.documents[index]
        return best_record

    def vector_search(self, query):
        query_tfidf = self.vectorizer.transform([query])
        similarities = cosine_similarity(query_tfidf, self.tfidf_matrix)
        best_index = similarities.argmax()
        return db_records[best_index]

    def indexed_search(self, query):
        # Assuming the tfidf_matrix is precomputed and stored
        query_tfidf = self.vectorizer.transform([query])
        similarities = cosine_similarity(query_tfidf, self.tfidf_matrix)
        best_index = similarities.argmax()
        return db_records[best_index]

In [31]:
# Usage example
retrieval = RetrievalComponent(method='vector')  # Choose from 'keyword', 'vector', 'indexed'
retrieval.fit(db_records)
best_matching_record = retrieval.retrieve(query)

print_formatted_response(best_matching_record)

A RAG vector store is a database or dataset that contains vectorized data
points.


In [32]:
augmented_input=query+": "+best_matching_record
print_formatted_response(augmented_input)

# Call the function and print the result
llm_response = call_llm_with_full_text(augmented_input)
print_formatted_response(llm_response)

define a rag store: A RAG vector store is a database or dataset that contains
vectorized data points.
Okay, let's break down this content. It's a very concise definition of a **RAG
vector store** within the context of Natural Language Processing (NLP) and
specifically, Retrieval Augmented Generation (RAG).  I'll elaborate on it,
explaining each component, its purpose, and the broader context. Then I'll
provide translations into several languages.  **Elaboration of the Content:**
The text defines a RAG vector store. Let's dissect that:  * **RAG (Retrieval
Augmented Generation):** This is a key architectural pattern in modern NLP.
Instead of a large language model (LLM) *only* relying on its pre-trained
knowledge, RAG enhances it by *retrieving* relevant information from an external
knowledge source *before* generating a response.  This allows the LLM to provide
more accurate, up-to-date, and contextually relevant answers. Think of it as
giving the LLM "cheat sheets" to consult before it