# Hyperparameter Optimization

This week will use [Optuna](https://optuna.org/), a library to make finding the best hyperparameters easy. 

We will use it to discover the best approach for chunking documents and indexing the chunks.

In [1]:
%load_ext autoreload
%autoreload 2
%load_ext dotenv
%dotenv

In [2]:
from collections import defaultdict
import os
import re

import chromadb
from llama_index.core import Document, VectorStoreIndex, set_global_handler
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
    MarkdownNodeParser,
)
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.voyageai import VoyageEmbedding
import nest_asyncio
import optuna
import pandas as pd
from pymilvus import MilvusClient
from qdrant_client import QdrantClient, AsyncQdrantClient

In [3]:
# configure
filename = 'everdell.md'
qa_filename = 'everdell-selected.csv'
ngram_size = 2  # use 2 instead of 3 so we don't skip 2-word header chunks
f_beta = 3      # weight recall 3 times as important as precision in f-score
n_trials = 25   # number of Optuna trials

pd.set_option('display.max_colwidth', None)

## Define a few helper functions

In [5]:
def generate_ngrams_from_text(text, ngram_size=3):
    """
    Generate ngrams from a specified text string.
    
    An ngram is a sequence of n words in a row. 
    For example, if ngram_size=3 and the text was "You can not play the ranger.",
    this would result in the following list of ngrams: 
    (you, can, not), (can, not, play), (not, play, the), (play, the ranger).
    """
    
    # Lowercase and replace non-alphanumeric characters with spaces
    cleaned_text = re.sub(r'[^a-z0-9\s]', ' ', text.lower())
    
    # Split text into words
    words = cleaned_text.split()
    
    # Generate ngrams
    return [tuple(words[i:i+ngram_size]) for i in range(len(words)+1-ngram_size)]

In [6]:
def generate_ngrams_from_texts(texts, ngram_size=3):
    """Generate all ngrams from a list of texts."""
    
    all_ngrams = []
    for text in texts:
        ngrams = generate_ngrams_from_text(text, ngram_size=ngram_size)
        all_ngrams.extend(ngrams)
        
    return all_ngrams

In [7]:
def precision_recall(predicted_ngrams, true_ngrams):
    """
    Return the precision and recall of a predicted list of ngrams by comparing 
    the predicted ngrams to the true ngrams and calculating the precision and recall.
    """
    
    # Convert lists to sets for easier comparison
    predicted_set = set(predicted_ngrams)
    true_set = set(true_ngrams)
    
    # Calculate true positives, false positives, and false negatives
    true_positives = len(predicted_set & true_set)
    false_positives = len(predicted_set - true_set)
    false_negatives = len(true_set - predicted_set)
    
    # Calculate precision and recall
    precision = true_positives / (true_positives + false_positives) if true_positives + false_positives > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if true_positives + false_negatives > 0 else 0
    
    return precision, recall

In [8]:
def f_score(precision, recall, beta=1.0):
    """
    Calculate the F-score (harmonic mean) of a given precision and recall, 
    with an option to weight recall higher.
    
    We need to calculate an F-score so we can convert the precision and recall metrics 
    into a single metric so we can say when one (precision, recall) pair is better 
    than another (precision, recall) pair. Setting the beta greater than 1 allows us to
    give more weight to recall than precision in the F-score.
    
    Parameters:
    precision (float): Precision of the model
    recall (float): Recall of the model
    beta (float): Weight of recall in the harmonic mean (default is 1.0, which means F1 score)
    
    Returns:
    float: The F(beta) score
    """
    if precision + recall == 0:
        return 0.0
    beta_squared = beta ** 2
    return (1 + beta_squared) * (precision * recall) / (beta_squared * precision + recall)

## Read question-answers and generate ngrams from manual quotes

The question-answers file has been augmented by a human to include the sentences/paragraphs from the manual that are needed (necessary and sufficient) to answer each question.

To evaluate the quality of a list of chunks retrieved from an index, we want to compare the sentences/paragraphs in the chunks against the sentences/paragraphs specified by the human in the question-answer file.

To do the comparison we can't simply check for equality, because the retrieved chunk may only overlap part of the human-specified sentence/paragraph. So we generate *ngrams* for the retrieved chunks and the human-specified sentence/paragraphs, and compare how many ngrams they have in common using the standard precision and recall metrics.

In [9]:
# read question-answers
qa_df = pd.read_csv(f'data/{qa_filename}', na_filter=False)
print(len(qa_df))
qa_df.head(3)

200


Unnamed: 0,url,question,answer,manual quote 1,manual quote 2,manual quote 3
0,https://boardgamegeek.com/thread/2440267/first-couple-playthroughs-gathered-questions,What is the die for?,It’s for the solo game.,solo rules,"To play Rugwort's card, roll the 8-sided die",
1,https://boardgamegeek.com/thread/2017107/new-question-about-dungeon-card,"Let's say I have 15 cards in my city, and I have the Dungeon and someone already in the one cell. Can I play the Ranger to unlock the second cell, putting a different Critter already in my city into the now-unlocked second cell?","You could not play the Ranger as it would get you past the city's limit. However if you have Dungeon with one prisoner and Ranger in your city and you are at 15 played slots, you could use the Ranger's power to get an existing Critter into the 2nd cell.","""Your city has a maximum of 15 spaces \nto play cards into. Each card takes up one \nspace. Recommended layout is 3 rows with \n5 cards in each. Event cards do not count \nagainst this 15 card limit.""",,
2,https://boardgamegeek.com/thread/2017107/new-question-about-dungeon-card,"Can I play a NEW card when all 15 seats in the city are occupied, but there is a free cell in the Dungeon?","Yes, this could be done.","""Your city has a maximum of 15 spaces \nto play cards into. Each card takes up one \nspace. Recommended layout is 3 rows with \n5 cards in each. Event cards do not count \nagainst this 15 card limit.""",,


In [10]:
# NOTE: we shouldn't include questions in the *test* set right now,
# but people are still adding the manual quotes, 
# and since we have so few questions with manual quotes so far 
# we will use all of them for this demo.

# keep only rows with at least 1 manual quote
qa_df = qa_df[qa_df['manual quote 1'].notna() & (qa_df['manual quote 1'] != '')]
print(len(qa_df))

42


In [11]:
# generate bigrams (ngram size=2) for each manual quote
# and store them in the question_ngrams dictionary
question_ngrams = {}
for _, row in qa_df.iterrows():
    question = row['question']
    all_ngrams = []
    for column in qa_df.columns:
        if column in ['url', 'question', 'answer']:
            continue
        text = row[column]
        if not text:
            continue
        all_ngrams.extend(generate_ngrams_from_text(text, ngram_size=ngram_size))
    if len(all_ngrams) == 0:
        print('ERROR: no ngrams in quotes for ', question)
        continue
    question_ngrams[question] = all_ngrams
print(len(question_ngrams))

42


## Read the document

In [12]:
# load document
documents = []
with open(f'data/{filename}', 'r', encoding='utf-8') as file:
    document = Document(
        text = file.read(),
        metadata = {"filename": filename},
    )
    # add the document to a single-entry documents list that we will use below
    documents.append(document)
print(len(documents[0].text))

21237


## Optimize hyperparameters by creating an index and evaluating the retrieved chunks

Creating an index involves a sequence of steps (a pipeline). Each step is configured using hyperparameters:
- split each document into chunks
- add metadata - e.g., document title, summary of previous and next chunks, pointer to parent chunk
- add an embedding (vector) - decide whether you want the embedding to include chunk metadata or just the text
- index the chunk - choose a vector store and index the embeddings, keywords, or both

Evaluate the retrieved chunks
- issue the queries
- compare the ngrams in the retrieved chunks to the ngrams in the human-specified sentences/paragraphs

In [13]:
def objective(trial):
    """
    This function is called by Optuna. It creates an index, run queries over the index, 
    calculates the precision and recall of the results, and returns the average f-score.
    """

    #
    # Define hyperparameters
    # 
    
    # define embedder
    embed_model_name = trial.suggest_categorical('embed_model', [
        'text-embedding-3-small', 
        'voyage-large-2-instruct',
        # GPU runs out of memory with gte model
        # 'Alibaba-NLP/gte-large-en-v1.5',  # WARNING: this downloads 1.74G of data
    ])
    if embed_model_name == 'text-embedding-3-small':
        embed_model = OpenAIEmbedding(
            model=embed_model_name,
            embed_batch_size=10,
            max_retries=25,
            timeout=180,
            reuse_client=False,
        )
    elif embed_model_name == 'voyage-large-2-instruct':
        embed_model = VoyageEmbedding(
            model_name=embed_model_name,
        )
    elif embed_model_name == 'Alibaba-NLP/gte-large-en-v1.5':
        embed_model = HuggingFaceEmbedding(model_name=embed_model_name, trust_remote_code=True)
        
    # define splitter
    splitter_name = trial.suggest_categorical('splitter', [
        'sentence', 
        'semantic',
        'markdown',
    ])
    if splitter_name == 'sentence':
        chunk_size = trial.suggest_int('chunk_size', 256, 1024)
        chunk_overlap = trial.suggest_int('chunk_overlap', 0, 200)
        splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    elif splitter_name == 'semantic':
        buffer_size = trial.suggest_int('buffer_size', 1, 3)
        breakpoint_percentile_threshold = trial.suggest_int('breakpoint_percentile_threshold', 60, 95)
        include_prev_next_rel = trial.suggest_categorical('include_prev_next_rel', [True, False])
        splitter = SemanticSplitterNodeParser(
            buffer_size=buffer_size, 
            breakpoint_percentile_threshold=breakpoint_percentile_threshold, 
            include_prev_next_rel=include_prev_next_rel,
            embed_model=embed_model,
        )
    elif splitter_name == 'markdown':
        include_prev_next_rel = trial.suggest_categorical('include_prev_next_rel', [True, False])
        splitter = MarkdownNodeParser(
            include_prev_next_rel=include_prev_next_rel,
        )

    # add metadata
    ## nothing for now

    # define index
    query_mode = VectorStoreQueryMode.DEFAULT
    index_type = trial.suggest_categorical('index', [
        'chromadb',
        'qdrant',
        # Milvus cloud doesn't support hybrid indices yet
        # 'milvus',  # need to create a (free) account at https://cloud.zilliz.com/ 
                   # and add MILVUS_URI=your public endpoint and MILVUS_TOKEN=your token (api key) to your .env file
    ])
    if index_type == 'chromadb':
        chroma_client = chromadb.EphemeralClient()
        # delete collection if it exists
        if any(coll.name == 'test' for coll in chroma_client.list_collections()):
            chroma_client.delete_collection('test')        
        chroma_collection = chroma_client.create_collection('test')
        vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    elif index_type == 'qdrant':
        query_mode = VectorStoreQueryMode.HYBRID
        client = QdrantClient(location=':memory:')
        # delete collection if it exists
        if client.collection_exists('test'):
            client.delete_collection('test')        
        # create our vector store with hybrid indexing enabled
        # batch_size controls how many nodes are encoded with sparse vectors at once
        # hybrid uses Splade v1 for sparse vectors
        vector_store = QdrantVectorStore(
            collection_name='test',
            client=client,
            enable_hybrid=True,
            batch_size=20,
        )
    elif index_type == 'milvus':
        query_mode = VectorStoreQueryMode.HYBRID
        milvus_k = trial.suggest_int('milvus_k', 40, 80)
        # delete collection if it exists
        client = MilvusClient(uri=os.environ['MILVUS_URI'], token=os.environ['MILVUS_TOKEN'])
        if client.has_collection('test'):
            client.drop_collection(collection_name='test')
        client.close()
        # hybrid uses BGE-M3 for sparse vectors
        vector_store = MilvusVectorStore(
            uri=os.environ['MILVUS_URI'], 
            token=os.environ['MILVUS_TOKEN'],
            collection_name='test',
            dim=len(embed_model.get_text_embedding('foo')),
            overwrite=True,
            enable_sparse=True,
            hybrid_ranker='RRFRanker',
            hybrid_ranker_params={'k': milvus_k},
        )

    # define top_k
    top_k = trial.suggest_int('top_k', 2, 5)
    sparse_top_k = top_k * 5

    # 
    # Use hyperparameters to split documents into chunks, generate embeddings, and insert into an index
    #
    
    # create a simple ingestion pipeline: chunk the documents and create embeddings
    pipeline = IngestionPipeline(transformations=[
        splitter,
        embed_model,
    ])

    # create an index from the vector store
    index = VectorStoreIndex.from_vector_store(
        vector_store,
        embed_model=embed_model,
    )

    # run the pipeline to generate nodes
    nodes = pipeline.run(documents=documents)
    # print('nodes', len(nodes))

    # add the nodes to the index
    index.insert_nodes(nodes)

    # assert all nodes have been indexed
    assert len(index.as_retriever(similarity_top_k=len(nodes)).retrieve('foo')) == len(nodes)

    # create a retriever from the index
    retriever = index.as_retriever(
        vector_store_query_mode=query_mode,
        similarity_top_k=top_k,
        sparse_top_k=sparse_top_k,
    )

    #
    # Evaluate the quality of the chunks retrieved from the index for the sample questions
    #
    
    # issue all questions and calculate the f-score on the retrieved chunks
    f_scores = []
    for question, true_ngrams in question_ngrams.items():
        response = retriever.retrieve(question)
        # print([node.id_ for node in response])
        predicted_ngrams = generate_ngrams_from_texts([node.text for node in response], ngram_size=ngram_size)
        precision, recall = precision_recall(predicted_ngrams, true_ngrams)
        score = f_score(precision, recall, beta=f_beta)
        f_scores.append(score)
    avg_f_score = sum(f_scores) / len(f_scores)

    # return the average f-score
    return avg_f_score

In [14]:
# ask Optuna to find the best hyperparameters

study_name = 'test'  # Unique identifier of the study.
storage_name = f"sqlite:///optuna-{study_name}.db"
print(f"To see a dashboard, open a terminal, activate the virtual environment, and run: optuna-dashboard {storage_name}")
study = optuna.create_study(
    study_name=study_name, 
    storage=storage_name,
    load_if_exists=True,
    direction='maximize',
)
study.optimize(objective, n_trials=n_trials)

study.best_params

To see a dashboard, open a terminal, activate the virtual environment, and run: optuna-dashboard sqlite:///optuna-test.db


[I 2024-06-23 18:19:30,385] A new study created in RDB with name: test
[I 2024-06-23 18:20:30,617] Trial 0 finished with value: 0.21598551827159682 and parameters: {'embed_model': 'text-embedding-3-small', 'splitter': 'markdown', 'include_prev_next_rel': True, 'index': 'chromadb', 'top_k': 4}. Best is trial 0 with value: 0.21598551827159682.
[I 2024-06-23 18:21:27,156] Trial 1 finished with value: 0.25620708209243037 and parameters: {'embed_model': 'voyage-large-2-instruct', 'splitter': 'semantic', 'buffer_size': 2, 'breakpoint_percentile_threshold': 80, 'include_prev_next_rel': True, 'index': 'chromadb', 'top_k': 3}. Best is trial 1 with value: 0.25620708209243037.
[I 2024-06-23 18:22:00,417] Trial 2 finished with value: 0.24174347524532452 and parameters: {'embed_model': 'voyage-large-2-instruct', 'splitter': 'markdown', 'include_prev_next_rel': False, 'index': 'chromadb', 'top_k': 2}. Best is trial 1 with value: 0.25620708209243037.


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

WARNI [root] Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.
[I 2024-06-23 18:22:42,762] Trial 3 finished with value: 0.2548235158629432 and parameters: {'embed_model': 'voyage-large-2-instruct', 'splitter': 'sentence', 'chunk_size': 285, 'chunk_overlap': 123, 'index': 'qdrant', 'top_k': 4}. Best is trial 1 with value: 0.25620708209243037.
[I 2024-06-23 18:25:10,278] Trial 4 finished with value: 0.30502262085574366 and parameters: {'embed_model': 'text-embedding-3-small', 'splitter': 'semantic', 'buffer_size': 2, 'breakpoint_percentile_threshold': 61, 'include_prev_next_rel': False, 'index': 'chromadb', 'top_k': 5}. Best is trial 4 with value: 0.30502262085574366.


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

WARNI [root] Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.
[I 2024-06-23 18:26:14,438] Trial 5 finished with value: 0.2261776199930114 and parameters: {'embed_model': 'voyage-large-2-instruct', 'splitter': 'markdown', 'include_prev_next_rel': False, 'index': 'qdrant', 'top_k': 2}. Best is trial 4 with value: 0.30502262085574366.


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

WARNI [root] Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.
[I 2024-06-23 18:27:14,207] Trial 6 finished with value: 0.2261776199930114 and parameters: {'embed_model': 'voyage-large-2-instruct', 'splitter': 'markdown', 'include_prev_next_rel': True, 'index': 'qdrant', 'top_k': 2}. Best is trial 4 with value: 0.30502262085574366.
[I 2024-06-23 18:27:48,620] Trial 7 finished with value: 0.17283256284527276 and parameters: {'embed_model': 'voyage-large-2-instruct', 'splitter': 'sentence', 'chunk_size': 721, 'chunk_overlap': 151, 'index': 'chromadb', 'top_k': 3}. Best is trial 4 with value: 0.30502262085574366.


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

WARNI [root] Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.
[I 2024-06-23 18:28:50,380] Trial 8 finished with value: 0.1476844907886592 and parameters: {'embed_model': 'voyage-large-2-instruct', 'splitter': 'sentence', 'chunk_size': 1022, 'chunk_overlap': 123, 'index': 'qdrant', 'top_k': 3}. Best is trial 4 with value: 0.30502262085574366.


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

WARNI [root] Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.
[I 2024-06-23 18:30:11,159] Trial 9 finished with value: 0.20464838672264968 and parameters: {'embed_model': 'voyage-large-2-instruct', 'splitter': 'sentence', 'chunk_size': 393, 'chunk_overlap': 67, 'index': 'qdrant', 'top_k': 5}. Best is trial 4 with value: 0.30502262085574366.
[I 2024-06-23 18:33:40,489] Trial 10 finished with value: 0.2845149991991019 and parameters: {'embed_model': 'text-embedding-3-small', 'splitter': 'semantic', 'buffer_size': 2, 'breakpoint_percentile_threshold': 61, 'include_prev_next_rel': False, 'index': 'chromadb', 'top_k': 5}. Best is trial 4 with value: 0.30502262085574366.
[I 2024-06-23 18:35:25,866] Trial 11 finished with value: 0.2992974061018135 and parameters: {'embed_model': 'text-embedding-3-small', 'splitter': 'semantic', 'buffer_size': 2, 'breakpoint_percentile_threshold': 62, 'include_prev_next_rel': False, 'index': 'chromadb', 

{'embed_model': 'text-embedding-3-small',
 'splitter': 'semantic',
 'buffer_size': 1,
 'breakpoint_percentile_threshold': 60,
 'include_prev_next_rel': False,
 'index': 'chromadb',
 'top_k': 4}