# Hyperparameter Optimization

This week will use [Optuna](https://optuna.org/), a library to make finding the best hyperparameters easy.

We will use it to discover the best approach for chunking documents and indexing the chunks.


In [None]:
%load_ext autoreload
%autoreload 2
%load_ext dotenv
%dotenv


In [None]:
from collections import defaultdict
import os
import re

from llama_index.core import Document, VectorStoreIndex, set_global_handler
import optuna
import pandas as pd

# Modifications
from utils.retrieve import objective, generate_quote_ngrams


In [None]:
# configure
filename = "everdell.md"
qa_filename = "everdell-selected.csv"
ngram_size = 2  # use 2 instead of 3 so we don't skip 2-word header chunks
f_beta = 3  # weight recall 3 times as important as precision in f-score
n_trials = 25  # number of Optuna trials

pd.set_option("display.max_colwidth", None)


## Read question-answers and generate ngrams from manual quotes

The question-answers file has been augmented by a human to include the sentences/paragraphs from the manual that are needed (necessary and sufficient) to answer each question.

To evaluate the quality of a list of chunks retrieved from an index, we want to compare the sentences/paragraphs in the chunks against the sentences/paragraphs specified by the human in the question-answer file.

To do the comparison we can't simply check for equality, because the retrieved chunk may only overlap part of the human-specified sentence/paragraph. So we generate _ngrams_ for the retrieved chunks and the human-specified sentence/paragraphs, and compare how many ngrams they have in common using the standard precision and recall metrics.


In [None]:
# read question-answers
qa_df = pd.read_csv(f"data/{qa_filename}", na_filter=False)
print(len(qa_df))
qa_df.head(3)


In [None]:
# NOTE: we shouldn't include questions in the *test* set right now,
# but people are still adding the manual quotes,
# and since we have so few questions with manual quotes so far
# we will use all of them for this demo.

# keep only rows with at least 1 manual quote
qa_df = qa_df[qa_df["manual quote 1"].notna() & (qa_df["manual quote 1"] != "")]
print(len(qa_df))


In [None]:
# generate bigrams (ngram size=2) for each manual quote
# and store them in the question_ngrams dictionary
question_ngrams = generate_quote_ngrams(qa_df, ngram_size)
print(len(question_ngrams))


## Read the document


In [None]:
# load document
documents = []
with open(f"data/{filename}", "r", encoding="utf-8") as file:
    document = Document(
        text=file.read(),
        metadata={"filename": filename},
    )
    # add the document to a single-entry documents list that we will use below
    documents.append(document)
print(len(documents[0].text))


## Optimize hyperparameters by creating an index and evaluating the retrieved chunks

Creating an index involves a sequence of steps (a pipeline). Each step is configured using hyperparameters:

- split each document into chunks
- add metadata - e.g., document title, summary of previous and next chunks, pointer to parent chunk
- add an embedding (vector) - decide whether you want the embedding to include chunk metadata or just the text
- index the chunk - choose a vector store and index the embeddings, keywords, or both

Evaluate the retrieved chunks

- issue the queries
- compare the ngrams in the retrieved chunks to the ngrams in the human-specified sentences/paragraphs


In [None]:
# ask Optuna to find the best hyperparameters

study_name = "test"  # Unique identifier of the study.
storage_name = f"sqlite:///optuna-{study_name}.db"
print(
    f"To see a dashboard, open a terminal, activate the virtual environment, and run: optuna-dashboard {storage_name}"
)
study = optuna.create_study(
    study_name=study_name,
    storage=storage_name,
    load_if_exists=True,
    direction="maximize",
)
study.optimize(objective, n_trials=n_trials)

study.best_params
