In [None]:
%%capture
!pip install llama-index==0.10.29 ragas==0.1.7 llama-index-embeddings-openai llama-index-llms-openai

In [1]:
import os
import sys
from getpass import getpass
import nest_asyncio

from dotenv import load_dotenv

sys.path.append('../helpers')

nest_asyncio.apply()

load_dotenv()

True

In [2]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or getpass("Enter your OpenAI API key: ")

I'm using OpenAI here because Cohere has rate limits for it's free tier. You don't need to run this code yourself if you don't want to incur costs from OpenAI. I'll upload the dataset to the Hugging Face Hub and I'll show you how to download it from there when we need it.

In [3]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

llm = OpenAI(model="gpt-4-turbo-2024-04-09", temperature=0.25)

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

We've already cleaned up our data before. Recall that we've persisted the `Document` objects to disk using a Docstore in such a way that each Document object represents cleaned text from a page of a book.

In [4]:
from utils import get_documents_from_docstore

documents = get_documents_from_docstore("../data/words-of-the-senpais")

# Create a set of `Documents` for the evaluation set

- 📚 **`group_documents_by_author`**: A utility function that sorts a collection of douments into groups based on who wrote them.

- 🗂️ **How It Works**: It creates a  dictionary where each author's name is linked to all the documents they've written.
  - Starts with an empty dictionary ready to be filled with author-document pairs.
  - Goes through each document, checking the author's name and adding the document under the appropriate author in the dictionary.
  - If a document doesn't list an author, it skips adding that document with a warning note.

- 📝 **Input**: Takes a list of `Document` objects, each with metadata that includes the `author` field (the name of its author).

- 🔖 **Output**: Outputs a dictionary that groups all the documents by their respective authors.
  

In [5]:
import random
from utils import group_documents_by_author

random.seed(42)

documents_by_author = group_documents_by_author(documents)

- 📚 **`sample_documents`**: Picks a set number of documents randomly from each author's collection within a grouped dictionary.

- 🎲 **Sampling Logic**: It tries to get a specific number of documents for each author. If an author doesn't have enough documents, it alerts you.
  - Begins with an empty list for storing selected samples.
  - Loops through each author, considers only docs with >500 characters, checking if there are enough documents to fulfill the sampling requirement.
  - Randomly selects the desired number of documents from those available, adding them to the overall sample list.
  - Issues a warning if the documents under an author are too few to meet the sampling number.

- 📝 **Input**: Receives a dictionary where authors are keys and values are lists of their documents, along with an optional number of documents to sample per author.

- 🔖 **Output**: Outputs a list of randomly chosen documents from across all authors, sticking to the specified number per author when possible.

In [6]:
from utils import sample_documents

docs_for_eval_set = sample_documents(documents_by_author)

In [7]:
exclude_metadata_keys = ['file_name', 'page_number']

for doc in docs_for_eval_set:
    doc.excluded_llm_metadata_keys = exclude_metadata_keys

# Perform a sanity check

In [8]:
from collections import Counter

def count_documents_by_author(documents):
    """
    Count the number of documents each author has in a list of document objects.

    :param documents: List of document objects with metadata containing 'author'.
    :return: A Counter object with authors as keys and counts of their documents as values.
    """
    # Extract the author from each document's metadata and count occurrences
    author_counts = Counter(doc.metadata['author'] for doc in documents if 'author' in doc.metadata)
    return author_counts

author_counts = count_documents_by_author(docs_for_eval_set)
for author, count in author_counts.items():
    print(f"Author '{author}' has {count} documents.")

Author 'Naval Ravikant' has 10 documents.
Author 'Balaji Srinivasan' has 10 documents.
Author 'Paul Graham' has 10 documents.
Author 'Nassim Nicholas Taleb' has 10 documents.
Author 'Seneca' has 10 documents.


We'll use an in-memory vector database to create an index.

In [9]:
from utils import setup_vector_store, create_index

vector_store = setup_vector_store(qdrant_url=":memory:", qdrant_api_key=None, collection_name="test-set-generation")

index = create_index(vector_store=vector_store, documents=docs_for_eval_set)

# LlamaIndex has a built in dataset generator

The `RagDatasetGenerator` automatically generates a dataset for RAG evaluation. 

This dataset is created based on a set of documents to query. The generated dataset will have diverse set of queries and their corresponding responses, which we'll use to evaluate the performance various RAG strategies.

You can look at the [source code](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/llama_dataset/generator.py) for more details.

In [10]:
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

### Question generation prompt

If you look at the source code, you'll see that the RagDatsetGenerator takes an optional string argument called `question_gen_query`. This string os used to generate questions from the document context. It instructs the model to act as a teacher/professor and create diverse questions based on the provided context. The default value for this is:

```text
You are a Teacher/Professor. Your task is to setup 2 questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided.
```

You can verify this by running:

```python
print(dataset_generator.question_gen_query)
```

These is a good starting point for building an evaluation dataset, but we can be more specific for our use case. The usecase I have in mind is drawing on the wisdom and knowledge of my virtual mentors. I want to come to them questions for advice whenever I need it, so we should create an evaluation dataset from that perspective.

In [11]:
from prompts import QUESTION_GEN_QUERY

print(QUESTION_GEN_QUERY)

You're a high-performer seeking wisdom and advice. You've aggregated the writings of these influential thinkers:

- Naval Ravikant: Known for his insights on how to build wealth and achieve happiness through developing specific knowledge, embracing accountability, playing long-term games, and understanding the power of compound interest in all areas of life.

- Balaji Srinivasan: Has insights on how to think independently, identify opportunities, and build a better future through the strategic application of technology and clear reasoning.

- Paul Graham: Provides advice on the hacker mindset, arguing that hackers are really makers and creators - akin to painters - who can leverage their unique way of thinking to push boundaries, challenge the status quo, and shape the future through technology and entrepreneurship.

- Nassim Nicholas Taleb: Argues for "Skin in the Game", that is having a personal stake in the outcome is necessary for fairness as it aligns incentives and exposes indivi


There are also a couple of other prompts that are used to generate an evaluation dataset.

The `text_question_template` and `text_qa_template` are prompt templates to generate questions and question-answer pairs, respectively. If not provided, default templates are used:


In [12]:
from utils import display_prompt_dict
display_prompt_dict(RagDatasetGenerator(None).get_prompts())

 **Prompt Key**: text_question_template
**Text:**
```
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
{query_str}

```

**Prompt Key**: text_qa_template
**Text:**
```
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 
```



## Let's implement some custom prompts

In [13]:
from llama_index.core.prompts.base import PromptTemplate
from prompts import TEXT_QUESTION_STR

text_question_template = PromptTemplate(TEXT_QUESTION_STR)

print(text_question_template.get_template())

You're seeking advice on navigating complex decisions, building wealth and happiness, identifying opportunities, embracing innovation, and living a meaningful life from one of the following influential thinkers:

- Naval Ravikant: Known for his insights on how to build wealth and achieve happiness through developing specific knowledge, embracing accountability, playing long-term games, and understanding the power of compound interest in all areas of life.

- Balaji Srinivasan: Has insights on how to think independently, identify opportunities, and build a better future through the strategic application of technology and clear reasoning.

- Paul Graham: Provides advice on the hacker mindset, arguing that hackers are really makers and creators - akin to painters - who can leverage their unique way of thinking to push boundaries, challenge the status quo, and shape the future through technology and entrepreneurship.

- Nassim Nicholas Taleb: Argues for "Skin in the Game", that is having a

In [14]:
from prompts import TEXT_QA_STRING

text_qa_template = PromptTemplate(TEXT_QA_STRING)

print(text_qa_template.get_template())

You're provided with context, which is the writing from of the following influential thinkers:

- Naval Ravikant: Naval would advise you to focus on building unique knowledge, taking ownership of your actions, playing long-term games with compounding in mind, and aligning your pursuits with genuine passion.

- Balaji Srinivasan: Balaji would encourage you to think critically and independently, leverage technology strategically, and develop a clear vision for the change you want to see.

- Paul Graham: Paul would suggest embracing the hacker mindset – a way of thinking that values problem-solving, building, and continuous learning to push boundaries and challenge conventions.

- Nassim Nicholas Taleb: Nassim would emphasize the importance of having "skin in the game," ensuring your incentives align with the outcomes and you are exposed to both the rewards and consequences of your choices.

- Seneca: Seneca would advise focusing on essential things, mastering your emotions through Stoic 

In [15]:
from llama_index.core.schema import MetadataMode

dataset_generator= RagDatasetGenerator(
    nodes=docs_for_eval_set,
    metadata_mode = MetadataMode.LLM,
    workers=os.cpu_count(),
    num_questions_per_chunk=2,
    show_progress=True,
    llm=llm,
    text_question_template=text_question_template,
    text_qa_template=text_qa_template,
    question_gen_query=QUESTION_GEN_QUERY,
)

In [16]:
display_prompt_dict(dataset_generator.get_prompts())

 **Prompt Key**: text_question_template
**Text:**
```
You're seeking advice on navigating complex decisions, building wealth and happiness, identifying opportunities, embracing innovation, and living a meaningful life from one of the following influential thinkers:

- Naval Ravikant: Known for his insights on how to build wealth and achieve happiness through developing specific knowledge, embracing accountability, playing long-term games, and understanding the power of compound interest in all areas of life.

- Balaji Srinivasan: Has insights on how to think independently, identify opportunities, and build a better future through the strategic application of technology and clear reasoning.

- Paul Graham: Provides advice on the hacker mindset, arguing that hackers are really makers and creators - akin to painters - who can leverage their unique way of thinking to push boundaries, challenge the status quo, and shape the future through technology and entrepreneurship.

- Nassim Nicholas Taleb: Argues for "Skin in the Game", that is having a personal stake in the outcome is necessary for fairness as it aligns incentives and exposes individuals to both the potential rewards and risks of their decisions.

- Seneca: Offers timeless advice on how to cultivate wisdom, build mental resilience, and live a life of purpose and contentment by focusing on what is essential, mastering one's emotions, and aligning oneself with nature.

Context information is below, incuding:
---------------------
{context_str}
---------------------

Given the context information and the identity of the thinker that authored it, generate questions that seek actionable advice or practical tips that you can apply in your own life. 

Restrict the questions to the context information provided and ask the questions from the first-person perspective. 

There is no need to address the thinker directly, they know you're speaking to them specifically. Ask the question directly and as succinctly as possible.

{query_str}

```

**Prompt Key**: text_qa_template
**Text:**
```
You're provided with context, which is the writing from of the following influential thinkers:

- Naval Ravikant: Naval would advise you to focus on building unique knowledge, taking ownership of your actions, playing long-term games with compounding in mind, and aligning your pursuits with genuine passion.

- Balaji Srinivasan: Balaji would encourage you to think critically and independently, leverage technology strategically, and develop a clear vision for the change you want to see.

- Paul Graham: Paul would suggest embracing the hacker mindset – a way of thinking that values problem-solving, building, and continuous learning to push boundaries and challenge conventions.

- Nassim Nicholas Taleb: Nassim would emphasize the importance of having "skin in the game," ensuring your incentives align with the outcomes and you are exposed to both the rewards and consequences of your choices.

- Seneca: Seneca would advise focusing on essential things, mastering your emotions through Stoic principles, and living in accordance with nature and reason.

Context information is below.
---------------------
{context_str}
---------------------

Given the context information and information about who authored it, answer the query. 

When providing advice, speak directly to the asker and use the "you" voice (second-person perspective). There is no need to identify yourself in your response, the asker knows who you are. 

Query: {query_str}
Answer:

```





**Note:** This took ~17 minutes to run and cost ~7$ USD in OpenAI calls

In [17]:
rag_dataset = dataset_generator.generate_dataset_from_nodes()

100%|██████████| 50/50 [00:17<00:00,  2.83it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
rag_dataset.save_json("rag_dataset.json")

# Using ragas

In [None]:
from ragas.testset.generator import TestsetGenerator

In [None]:
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-3.5-turbo")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(     
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=embeddings,
)

distributions = {
    simple: 0.3,
    multi_context: 0.5,
    reasoning: 0.2
}

generator_llamaindex = generator.generate_with_llamaindex_docs(
    documents = docs_for_eval_set,
    distributions=distributions,
    test_size=50
)


In [None]:
generator_llamaindex.to_dataset()[:10]