# Implementação do ReAct para Agentic RAG

- Implementar a ReAct  com LLaMa 3 70B (groq)
- Testar no dataset do IIRC - 50 primeiras perguntas com resposta (test_questions.json em anexo)
- Usar o prompt do LLaMAIndex: https://github.com/run-llama/llama_index/blob/a87b63fce3cc3d24dc71ae170a8d431440025565/llama_index/agent/react/prompts.py
- Salvar as respostas finais das 50 perguntas no JSON para exercício futuro de avaliação
- Instruir o modelo a seguir a sequência Thougth, Action, Input, Observation (a observação não é do próprio modelo, mas resultado da busca)
- É necessário usar o parâmetro stop_sequence="Observation:", para o modelo parar de gerar texto e esperar o retorno da busca. Implementem o código da busca e retornem os top-k documentos pro modelo (sugestão: k=5).
- Instruir o modelo agir passo-a-passo (decomposição da pergunta).
- Podem usar o LangChain, LLaMAindex ou outro framework. Ou implementar na mão.
- Usar a busca como ferramenta
- Usar o BM25 como buscador (repetir indexação do exercício passado)
- Usar a indexação do Visconde: https://github.com/neuralmind-ai/visconde/blob/main/iirc_create_indices.ipynb

In [70]:
%%capture
%pip install --quiet torch
%pip install --quiet groq
%pip install --quiet llama-index-llms-groq
%pip install --quiet langchain
%pip install --quiet langchain-groq
%pip install --quiet rank_bm25
%pip install --quiet bs4
%pip install --quiet spacy

In [71]:
import os
import json
from langchain_core.prompts import PromptTemplate

NUM_QUESTIONS = 50
LLM ="llama3-70b-8192"
MODEL_NAME = 'llama3-70b-8192'
GROQ_API_KEY = ''
REACT_CHAT_SYSTEM_HEADER = '''Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought:{agent_scratchpad}'''

In [72]:
if not os.path.exists("context_articles.json"):
    !wget http://jamesf-incomplete-qa.s3.amazonaws.com/context_articles.tar.gz
    !tar -xzf context_articles.tar.gz

In [73]:
with open('test_questions.json', 'r') as file:
	test_questions  = json.load(file)

with open('context_articles.json', 'r') as file:
	articles 	  = json.load(file)

In [74]:
print(f"""len(test_questions) : {len(test_questions) }
test_questions[0]: {test_questions[0]}
test_questions[0]['questions']: {test_questions[0]['question']}
test_questions[0]['questions'][0]: {test_questions[0]['question'][0]}
""")

print(f"""len(articles) : {len(articles)}

articles["wimboldsley"] : {articles["wimboldsley"]}
""")

len(test_questions) : 50
test_questions[0]: {'answer': {'type': 'span', 'answer_spans': [{'text': 'sky and thunder god', 'passage': 'zeus', 'type': 'answer', 'start': 83, 'end': 102}]}, 'question': 'What is Zeus know for in Greek mythology?', 'context': [{'text': 'he Palici the sons of Zeus', 'passage': 'main', 'indices': [684, 710]}, {'text': 'in Greek mythology', 'passage': 'main', 'indices': [137, 155]}, {'text': 'Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion', 'passage': 'Zeus', 'indices': [0, 110]}], 'question_links': ['Greek mythology', 'Zeus'], 'title': 'Palici'}
test_questions[0]['questions']: What is Zeus know for in Greek mythology?
test_questions[0]['questions'][0]: W

len(articles) : 56550

articles["wimboldsley"] : Wimboldsley is a village in <a href="Cheshire">Cheshire</a>, <a href="England">England</a>, 2 miles south of <a href="Middlewich">Middlewich</a>. The population at the <a href="United%20Kingdom%20C

In [75]:
# Implementação baseada na do aluno Fabio Grassiotto: https://github.com/fabiograssiotto/IA024/blob/main/Aula_9_10/RA890441%20-%20Aula9_10.ipynb

from bs4 import BeautifulSoup

def extract_clean_text(html_text):
    """Extrai e limpa o texto de um HTML."""
    soup = BeautifulSoup(html_text, 'html.parser')
    return soup.get_text()

def add_document(documents, all_titles, title, content):
    """Adiciona um documento se o título não estiver em all_titles."""
    title_lower = title.lower()
    if title_lower not in all_titles:
        documents.append({
            "title": title_lower,
            "content": content
        })
        all_titles.add(title_lower)

# Inicialização de variáveis
questions_found = []
num_questions_found = 0
documents = []
all_titles = set()

# Itera sobre cada item nas perguntas de teste
for item in test_questions:
    question = item['question']
    answer = item['answer']
    answer_type = answer['type']

    # Determina o tipo de resposta e extrai a resposta final
    if answer_type == 'binary' or answer_type == 'value':
        final_answer = answer['answer_value']
    elif answer_type == 'span':
        final_answer = answer['answer_spans'][0]['text']
    elif answer_type == 'none':
        final_answer = 'none'
    else:
        final_answer = 'An error perhaps, bad type'
        print(f"Unknown answer type: {answer_type}")

    # Ignora perguntas com respostas 'none'
    if final_answer == 'none':
        continue

    # Adiciona informações extras no contexto
    for context in item['context']:
        if context['passage'] == "main":
            clean_text = extract_clean_text(context['text'])
            add_document(documents, all_titles, item['title'], clean_text)

    # Obtém o texto dos artigos
    for link in item["question_links"]:
        link_lower = link.lower()
        if link_lower in articles:
            clean_text = extract_clean_text(articles[link_lower])
            add_document(documents, all_titles, link, clean_text)

    # Adiciona a pergunta e resposta encontradas à lista
    questions_found.append({"Question": question, "Answer": final_answer})
    num_questions_found += 1

    # Verifica se encontrou o número necessário de perguntas
    if num_questions_found == NUM_QUESTIONS:
        break

# Variáveis resultantes
questions_to_ask = questions_found
all_documents = documents
all_titles_list = all_titles

print(f"Total de perguntas encontradas: {num_questions_found}")
print(f"Total de documentos: {len(all_documents)}")

Total de perguntas encontradas: 50
Total de documentos: 87


In [76]:
# Trecho obtido do Visconde: https://github.com/neuralmind-ai/visconde/blob/main/iirc_create_indices.ipynb

import spacy
import tqdm

nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

stride = 2
max_length = 3

def window(documents, stride, max_length):
    treated_documents = []

    for j,document in enumerate(tqdm.tqdm(documents)):
        doc_text = document['content']
        doc = nlp(doc_text[:10000])
        sentences = [sent.text.strip() for sent in doc.sents]
        for i in range(0, len(sentences), stride):
            segment = ' '.join(sentences[i:i + max_length])
            treated_documents.append({
                "title": document['title'],
                "contents": document['title']+". "+segment,
                "segment": segment
            })
            if i + max_length >= len(sentences):
                break
    return treated_documents

treated_documents = window(documents, stride, max_length)

100%|██████████| 87/87 [00:00<00:00, 169.45it/s]


In [77]:
if not os.path.isdir("iirc_indices"):
    !mkdir iirc_indices

file = open("iirc_indices/contents.jsonl",'w')

for i, doc in enumerate(treated_documents):
    doc['id'] = i
    if doc['segment'] != "":
        file.write(json.dumps(doc)+"\n")

In [78]:
!python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 1 -input iirc_indices -index iirc_index -storeRaw

pyserini.index is deprecated, please use pyserini.index.lucene.
2024-05-15 19:23:32,395 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:204) - Setting log level to INFO
2024-05-15 19:23:32,396 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - AbstractIndexer settings:
2024-05-15 19:23:32,396 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) -  + DocumentCollection path: iirc_indices
2024-05-15 19:23:32,396 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + CollectionClass: JsonCollection
2024-05-15 19:23:32,396 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + Index path: iirc_index
2024-05-15 19:23:32,397 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Threads: 1
2024-05-15 19:23:32,397 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Optimize (merge segments)? false
May 15, 2024 7:23:32 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentInde

## ReAct

In [79]:
from langchain.tools import tool
from pyserini.search.lucene import LuceneSearcher

searcher = LuceneSearcher('iirc_index')

@tool
def search_tool(query: str, k: int = 5):
    """Search documents and return the top-k results."""
    hits = searcher.search(query, k=k)
    results = []
    for hit in hits:
        doc = searcher.doc(hit.docid).raw()
        doc_content = json.loads(doc)
        clean_text = BeautifulSoup(doc_content['contents'], 'html.parser').get_text()
        results.append({"title": doc_content['id'], "content": clean_text})
    return results

# Ferramenta de decomposição de perguntas
@tool
def question_decomposition(question: str):
    """Decompose a question into sub-questions."""
    return [question]

In [80]:
query = "Zeus Greek mythology"
results = search_tool.invoke(query)

for result in results:
    print(f"Title: {result['title']}\nContent: {result['content']}\n")

Title: 37
Content: zeus. Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion, who rules as king of the gods of Mount Olympus. His name is cognate with the first element of his Roman equivalent Jupiter. His mythologies and powers are similar, though not identical, to those of Indo-European deities such as Jupiter, Perkūnas, Perun, Indra and Thor.

Title: 6
Content: greek mythology. Poets and artists from ancient times to the present have derived inspiration from Greek mythology and have discovered contemporary significance and relevance in the themes. Sources. Greek mythology is known today primarily from Greek literature and representations on visual media dating from the Geometric period from to onward.

Title: 1
Content: greek mythology. Greek mythology is the body of myths originally told by the ancient Greeks and a genre of Ancient Greek folklore. These stories concern the origin and the nature of the world, the lives and a

In [81]:
from langchain.agents import AgentExecutor, create_react_agent
from langchain_groq import ChatGroq

prompt = PromptTemplate.from_template(REACT_CHAT_SYSTEM_HEADER)
llm   = ChatGroq(temperature=0, groq_api_key=GROQ_API_KEY, model_name=MODEL_NAME)

tools = [search_tool, question_decomposition]

agent = create_react_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, handle_parsing_errors=True)

In [83]:
agent_executor.invoke({"input": "How old was Messe when the First World War started?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find information about Messe and the First World War.

Action: search_tool
Action Input: "Messe First World War"[0m[36;1m[1;3m[{'title': 97, 'content': 'world war i. The term "First World War" was first used in September 1914 by German biologist and philosopher Ernst Haeckel, who claimed that "there is no doubt that the course and character of the feared \'European War\'\xa0... will become the first world war in the full sense of the word," citing a wire service report in The Indianapolis Star on 20 September 1914. Prior to World War II, the events of 1914–1918 were generally known as the Great War or simply the World War. In October 1914, the Canadian magazine Maclean\'s wrote, "Some wars name themselves.'}, {'title': 96, 'content': 'world war i. However, despite the conclusive Allied victory (and the creation of the League of Nations during the Peace Conference, intended to prevent future wars), a seco

{'input': 'How old was Messe when the First World War started?',
 'output': 'Giovanni Messe was born on December 10, 1883, and the First World War started on July 28, 1914. Therefore, Messe was 30 years old when the First World War started.'}

In [84]:
import string
from collections import Counter
import pandas as pd
import re

# Implementação da avaliação baseada no trabalho do aluno Fabio Grassiotto: https://github.com/fabiograssiotto/IA024/blob/main/Aula_9_10/RA890441%20-%20Aula9_10.ipynb 

def normalize_answer(s):
	def remove_articles(text):
		return re.sub(r"\b(a|an|the)\b", " ", text)

	def white_space_fix(text):
		return " ".join(text.split())

	def remove_punc(text):
		exclude = set(string.punctuation)
		return "".join(ch for ch in text if ch not in exclude)

	def lower(text):
		return text.lower()

	return white_space_fix(remove_articles(remove_punc(lower(s))))

def token_f1_score(prediction, ground_truth):
	prediction_tokens = normalize_answer(prediction).split()
	ground_truth_tokens = normalize_answer(ground_truth).split()
	common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
	num_same = sum(common.values())
	if num_same == 0:
		return 0
	precision = 1.0 * num_same / len(prediction_tokens)
	recall = 1.0 * num_same / len(ground_truth_tokens)
	f1 = (2 * precision * recall) / (precision + recall)
	return f1

In [88]:
import time

# Executar a avaliação
df_results = pd.DataFrame(columns=['question', 'answer', 'LLM answer', 'F1', 'Exact Match'])

for item in questions_to_ask:
	question = item['Question']
	answer = normalize_answer(item['Answer'])

	# RAG context using search
	agent_answer = agent_executor.invoke({"input": "Question: " + question})
	llm_answer = normalize_answer(agent_answer['output'])
	f1_score = token_f1_score(llm_answer, answer)
	e_match_score = 1 if llm_answer == answer else 0
	
	row = pd.Series([question, answer, llm_answer, f1_score, e_match_score], index=df_results.columns)
	df_results = pd.concat([df_results, pd.DataFrame([row])], ignore_index=True)
	time.sleep(3)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find information about Zeus in Greek mythology.

Action: search_tool
Action Input: "Zeus Greek mythology"[0m[36;1m[1;3m[{'title': 37, 'content': 'zeus. Zeus (British English , North American English ; , Zeús ) is the sky and thunder god in ancient Greek religion, who rules as king of the gods of Mount Olympus. His name is cognate with the first element of his Roman equivalent Jupiter. His mythologies and powers are similar, though not identical, to those of Indo-European deities such as Jupiter, Perkūnas, Perun, Indra and Thor.'}, {'title': 6, 'content': 'greek mythology. Poets and artists from ancient times to the present have derived inspiration from Greek mythology and have discovered contemporary significance and relevance in the themes. Sources. Greek mythology is known today primarily from Greek literature and representations on visual media dating from the Geometric period from to onward.'}, {'tit

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama3-70b-8192` in organization `org_01hwv02r7vf59r3zt1t99zsn56` on tokens per minute (TPM): Limit 3000, Used 2819, Requested ~3143. Please try again in 59.227s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

In [None]:
# Imprimir o DataFrame
print("Resultados da Avaliação:")
print(df_results)
df_results.to_csv("results.csv")

# Calcular métricas
f1_mean = df_results['F1'].mean()
f1_std = df_results['F1'].std()
exact_match_mean = df_results['Exact Match'].mean()
exact_match_std = df_results['Exact Match'].std()

# Imprimir métricas formatadas
print("\nMétricas:")
print(f"F1 Score: Média: {f1_mean:.2f}, Desvio Padrão: {f1_std:.2f}")
print(f"Exact Match Score: Média: {exact_match_mean:.2f}, Desvio Padrão: {exact_match_std:.2f}")