# Пример сервиса с использованием RAG для вопросно-ответной системы

### Setup:

1. В этом руководстве для выполнения некоторых метрик используется OpenAI, поэтому убедитесь, что ваш ключ OpenAI готов и доступен в вашей среде.
2. Вставте Ваш ключ для OpenAI в вайл 'env.txt'
3. Установите нужные библиотеки. Рекомендуется использовать изолированое пространство (virtual environment).
    * pip install -r requirements.txt
4. запускайте блокнот

# Пайплайн

![rag_pipeline](data/rag_pipeline.jpg)

In [4]:
import os
import openai
import pandas as pd
from datasets import Dataset
from dotenv import dotenv_values
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_openai import OpenAI
from langchain.chains import RetrievalQA

In [5]:
# pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_colwidth', 100)

In [6]:
# Load environment variables from env.txt file
env_vars = dotenv_values('env.txt')

# Explicitly set the environment variables from the env.txt file to override the global variables if set
for key, value in env_vars.items():
    os.environ[key] = value

# Verify the environment variables are set correctly
if 'OPENAI_API_KEY' in os.environ:
    print(f"OpenAI API Key successfully loaded")
else:
    print("Failed to load OpenAI API Key")

OpenAI API Key successfully loaded


### Загрузка документов: 

LangChain предоставляет несколько встроенных загрузчиков документов, которые работают с PDF-файлами, JSON-файлами или Python-файлами в вашей файловой директории.  Мы можем использовать PyPDFLoader от LangChain, чтобы импортировать PDF.

In [7]:
# load document
loader = PyPDFLoader("data/HAI_AI-Index-Report-2024_Chapter1.pdf")
documents = loader.load()

### Разделение документов на куски (chunking):

Когда документ длинный, необходимо разделить его на части. Существуют различные способы разделения текста. Давайте воспользуемся простейшим методом CharacterTextSplitter, который разделяет текст по символам и измеряет длину фрагмента по количеству символов. 

In [8]:
# split the documents into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

### Создание текстовых эмбеддингов: 

Затем текстовые фрагменты переводятся в числовые векторы с помощью эмбеддингов, что позволяет нам работать с текстовыми данными, как при семантическом поиске, эффективным с вычислительной точки зрения образом. Для этой задачи мы можем выбрать, например, OpenAI.

In [9]:
# select which embeddings we want to use
embeddings = OpenAIEmbeddings()

### Создание хранилища векторов:

Затем нам нужно сохранить векторы эмбеддинга в векторном хранилище, которое позволит нам искать и извлекать соответствующие векторы во время запросов.

In [10]:
# create the vectorestore to use as the index
db = Chroma.from_documents(texts, embeddings)

### Создание интерфейса ретривера: 

Мы можем представить хранилище векторов в интерфейсе ретривера. Чтобы получить текст, мы можем выбрать тип поиска, например « similarity», чтобы использовать поиск по сходству в объекте retriever, где он выбирает векторы текстовых фрагментов, которые наиболее похожи на вектор вопроса. k=2 позволяет нам найти 2 наиболее релевантных вектора текстовых фрагментов. 

In [11]:
# expose this index in a retriever interface
retriever = db.as_retriever(
    search_type="similarity", search_kwargs={"k": 2}
)

Создание цепочки RetrievalQA для ответов на вопросы: 

Цепочка RetrievalQA связывает большую языковую модель с нашим интерфейсом ретривера. 

Вы также можете указать тип цепочки как один из четырех вариантов:

* stuff
* map reduce
* refine
* map_rerank
    

1. Тип цепочки по умолчанию = «stuff» включает в запрос ВЕСЬ текст из документов.

2. Тип «map_reduce» разбивает тексты на группы, задает вопрос LLM для каждой группы отдельно и выводит окончательный ответ, основываясь на ответах каждой группы. 

3. Тип «refine» разбивает тексты на блоки, представляет первый блок LLM, а затем отправляет ответ вместе со вторым блоком LLM. Он постепенно уточняет ответ, обрабатывая все батчи. 

4. Тип «map-rerank» делит тексты на блоки, представляет каждый из них в LLM, возвращает оценку, показывающую, насколько полно он отвечает на вопрос, и определяет окончательный ответ, основываясь на ответах, получивших наибольшее количество баллов из каждого блока.

In [12]:
# create a chain to answer questions

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="map_reduce",
    retriever=retriever,
    return_source_documents=True,
    verbose=True, # logging
)

In [13]:
queries = []
rag_answers = []
contexts = []
naive_llm_answers = []
ground_truths = []

In [14]:
def collect_info(result, query_list, rag_answer_list, context_list):
    query_list.append(result["query"])
    rag_answer_list.append(result["result"])
    current_contexts = []
    for document in result["source_documents"]:
        current_contexts.append(document.page_content)
    context_list.append(current_contexts)

In [15]:
result = qa_chain.invoke("In which year the Ecosystem Graphs were introduced?")

# store information for evaluation dataset
collect_info(result, queries, rag_answers, contexts)
ground_truths.append("2023")

result



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'In which year the Ecosystem Graphs were introduced?',
 'result': ' The Ecosystem Graphs were introduced in 2022.',
 'source_documents': [Document(page_content='49\nArtificial Intelligence\nIndex Report 2024 Chapter 1 PreviewArtificial Intelligence\nIndex Report 2024\nlevel.7 CSET extracted year values from the first \npublication date within a family. Countries are assigned \nto patents based on the country or filing office where \na patent is first filed (e.g., if a patent is filed with the \nUSPTO on January 1, 2020, and then with the German \nPatenting Office on January 2, 2020, the patent is \nclassified as a patent with U.S. inventors).8 Note that \nthe same patent may have multiple countries (but not \nyears) attributed to it if the inventors filed their patent \nin multiple countries on the same first filing date (e.g., \nif a patent is filed with the USPTO on January 1, 2020, \nand then with the German Patenting Office on January \n1, 2020, the patent is classified a

In [16]:
for document in result["source_documents"]:
    print(document.metadata)
    print("++++++++++++++++++++++++++")
    print(document.page_content)
    print("==========================")

{'page': 48, 'source': 'data/HAI_AI-Index-Report-2024_Chapter1.pdf'}
++++++++++++++++++++++++++
49
Artificial Intelligence
Index Report 2024 Chapter 1 PreviewArtificial Intelligence
Index Report 2024
level.7 CSET extracted year values from the first 
publication date within a family. Countries are assigned 
to patents based on the country or filing office where 
a patent is first filed (e.g., if a patent is filed with the 
USPTO on January 1, 2020, and then with the German 
Patenting Office on January 2, 2020, the patent is 
classified as a patent with U.S. inventors).8 Note that 
the same patent may have multiple countries (but not 
years) attributed to it if the inventors filed their patent 
in multiple countries on the same first filing date (e.g., 
if a patent is filed with the USPTO on January 1, 2020, 
and then with the German Patenting Office on January 
1, 2020, the patent is classified as a patent with U.S. 
inventors and as a patent with German inventors).
Note that patents f

In [17]:
result = qa_chain.invoke("What was the training cost of GPT-4?")

# store information for evaluation dataset
collect_info(result, queries, rag_answers, contexts)
ground_truths.append("over over 100 million Dollars")
for document in result["source_documents"]:
    print(document.metadata)

result



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
{'page': 36, 'source': 'data/HAI_AI-Index-Report-2024_Chapter1.pdf'}
{'page': 37, 'source': 'data/HAI_AI-Index-Report-2024_Chapter1.pdf'}


{'query': 'What was the training cost of GPT-4?',
 'result': ' The training cost for GPT-4 was mentioned to be over $100 million.',
 'source_documents': [Document(page_content='Chapter 1 Preview 37\nArtificial Intelligence\nIndex Report 2024Training Cost\nA prominent topic in discussions about foundation \nmodels is their speculated costs. While AI \ncompanies seldom reveal the expenses involved \nin training their models, it is widely believed that \nthese costs run into millions of dollars and are \nrising. For instance, OpenAI’s CEO, Sam Altman, \nmentioned that the training cost for GPT-4 was over \n$100 million. This escalation in training expenses \nhas effectively excluded universities, traditionally \ncenters of AI research, from developing their own \nleading-edge foundation models. In response, policy \ninitiatives, such as President Biden’s Executive Order \non AI , have sought to level the playing field between \nindustry and academia by creating a National AI \nResearch Re

# Давайте сгенерируем наивные ответы LLM с помощью OpenAI без нашего RAG 

In [18]:
def generate_naive_llm_response(question):
    try:
        chat_response = openai.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": question,
                        },
                    ],
                }
            ],
            max_tokens=100,
            stream=False, # usability vs. performance tradeoff
            temperature=0.01,
        )

        if hasattr(chat_response, "choices"):
            print(chat_response.choices[0].message.content.strip())
            return chat_response.choices[0].message.content.strip()
    except Exception as e:
        raise Exception(f"Failed to generate response for question: {question}") from e


In [19]:
for question in queries:
    naive_llm_answers.append(generate_naive_llm_response(question))

The Ecosystem Graphs were introduced in the year 2019.
The training cost of GPT-4 is not publicly disclosed by OpenAI, the organization behind the development of the GPT series of models. However, it is estimated that training a large language model like GPT-4 could cost millions of dollars in terms of computational resources, electricity, and manpower.


In [20]:
# create pandas df from these lists
my_general_ds = pd.DataFrame(
    {
        "questions": queries,
        "rag_answers": rag_answers,
        "naive_llm_answers": naive_llm_answers,
        "ground_truths": ground_truths,
        "contexts": contexts,
    }
)

In [21]:
my_general_ds

Unnamed: 0,questions,rag_answers,naive_llm_answers,ground_truths,contexts
0,In which year the Ecosystem Graphs were introduced?,The Ecosystem Graphs were introduced in 2022.,The Ecosystem Graphs were introduced in the year 2019.,2023,[49\nArtificial Intelligence\nIndex Report 2024 Chapter 1 PreviewArtificial Intelligence\nIndex ...
1,What was the training cost of GPT-4?,The training cost for GPT-4 was mentioned to be over $100 million.,"The training cost of GPT-4 is not publicly disclosed by OpenAI, the organization behind the deve...",over over 100 million Dollars,[Chapter 1 Preview 37\nArtificial Intelligence\nIndex Report 2024Training Cost\nA prominent topi...


# Оценка наивных ответов LLM с помощью RAGAS

дополнительные комментарии и пояснения смотрите в файле 'ragas_evaluation.ipynb'

In [22]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
from ragas import evaluate

In [23]:
# NOTE: ragas documentation is not up to date (14.05.2024). The requred Dataset structure is:
# Dataset.from_dict(
#         {
#             'question': list[str],
#             'contexts': list[list[str]],
#             'ground_truth': list[str],
#             'answer': list[str]
#         }
#     )

In [24]:
naive_llm_dataset = Dataset.from_dict(
    {
        'question': queries,
        'contexts': contexts,
        'ground_truth': ground_truths,
        'answer': naive_llm_answers,
    }
)

In [25]:
naive_result = evaluate(
    naive_llm_dataset, 
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

naive_result

Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

{'context_precision': 0.5000, 'faithfulness': 0.3333, 'answer_relevancy': 0.9800, 'context_recall': 0.5000}

In [26]:
df_naive = naive_result.to_pandas()
df_naive

Unnamed: 0,question,contexts,ground_truth,answer,context_precision,faithfulness,answer_relevancy,context_recall
0,In which year the Ecosystem Graphs were introduced?,[49\nArtificial Intelligence\nIndex Report 2024 Chapter 1 PreviewArtificial Intelligence\nIndex ...,2023,The Ecosystem Graphs were introduced in the year 2019.,0.0,0.0,0.982378,0.0
1,What was the training cost of GPT-4?,[Chapter 1 Preview 37\nArtificial Intelligence\nIndex Report 2024Training Cost\nA prominent topi...,over over 100 million Dollars,"The training cost of GPT-4 is not publicly disclosed by OpenAI, the organization behind the deve...",1.0,0.666667,0.977655,1.0


# Оценка ответов, сгенерированных с RAG с использованием RAGAS

In [27]:
RAG_dataset = Dataset.from_dict(
    {
        'question': queries,
        'contexts': contexts,
        'ground_truth': ground_truths,
        'answer': rag_answers,
    }
)

In [28]:
rag_result = evaluate(
    RAG_dataset, 
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

rag_result

Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

{'context_precision': 0.5000, 'faithfulness': 0.5000, 'answer_relevancy': 0.9902, 'context_recall': 0.5000}

In [29]:
df_rag = rag_result.to_pandas()
df_rag

Unnamed: 0,question,contexts,ground_truth,answer,context_precision,faithfulness,answer_relevancy,context_recall
0,In which year the Ecosystem Graphs were introduced?,[49\nArtificial Intelligence\nIndex Report 2024 Chapter 1 PreviewArtificial Intelligence\nIndex ...,2023,The Ecosystem Graphs were introduced in 2022.,0.0,0.0,0.982378,0.0
1,What was the training cost of GPT-4?,[Chapter 1 Preview 37\nArtificial Intelligence\nIndex Report 2024Training Cost\nA prominent topi...,over over 100 million Dollars,The training cost for GPT-4 was mentioned to be over $100 million.,1.0,1.0,0.997954,1.0
