# Data preparation for fine-tuning

In this tutorial, we will show an example of the first step for fine-tuning: dataset preparation.

## 0. Installation

In [1]:
# % pip install -U datasets

In [2]:
import os

os.environ["HF_ENDPOINT"]="https://hf-mirror.com"

Suppose we are willing to fine-tune our model for financial tasks. We found an open-source dataset that could be useful: [financial-qa-10k](https://huggingface.co/datasets/virattt/financial-qa-10K). Let's see how to properly prepare our dataset for fine-tuning.

The raw dataset has the following structure:
- 5 columns of: 'question', 'answer', 'context', 'ticker', and 'filing'.
- 7000 rows.

In [3]:
from datasets import load_dataset

ds = load_dataset("virattt/financial-qa-10K", split="train")
ds

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['question', 'answer', 'context', 'ticker', 'filing'],
    num_rows: 7000
})

## 1. Data for Fine-tuning

Construct the dataset to the following format:

``` python
{"query": str, "pos": List[str], "neg":List[str], "pos_scores": List[int], "neg_scores": List[int], "prompt": str, "type": str}
```

`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts. `pos_scores` is a list of scores corresponding to the query and pos, `neg_scores` is a list of scores corresponding to the `query` and `neg`, if you don't use knowledge distillation, it can be ignored. `prompt` is the prompt used for the query, it will cover query_instruction_for_retrieval. `type` is used for bge-en-icl, it includes `normal`, `symmetric_class`, `symmetric_clustering`, .etc. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.

We select the columns 'question' and 'context' as our query and answer(pos), and rename the columns. Then add the 'id' column for later evaluation use.

In [4]:
ds = ds.select_columns(column_names=["question", "context"])
ds = ds.rename_column("question", "query")
ds = ds.rename_column("context", "pos")
ds = ds.add_column("id", [str(i) for i in range(len(ds))])
ds[0]

{'query': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',
 'pos': 'Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.',
 'id': '0'}

Negative examples are important during the training of embedding models. Our initial dataset does not come with negative texts. Thus we directly sample a few from the whole corpus.

In [5]:
import numpy as np

np.random.seed(520)
neg_num = 10

def str_to_lst(data):
    data["pos"] = [data["pos"]]
    return data

# sample negative texts
new_col = []
for i in range(len(ds)):
    ids = np.random.randint(0, len(ds), size=neg_num)
    while i in ids:
        ids = np.random.randint(0, len(ds), size=neg_num)
    neg = [ds[i.item()]["pos"] for i in ids]
    new_col.append(neg)
ds = ds.add_column("neg", new_col)

# change the key of 'pos' to a list
ds = ds.map(str_to_lst)

Map: 100%|██████████| 7000/7000 [00:00<00:00, 22336.83 examples/s]


Lastly, we add the prompt which is used for query. It will be the `query_instruction_for_retrieval` during inference.

In [6]:
instruction = "Represent this sentence for searching relevant passages: "
ds = ds.add_column("prompt", [instruction]*len(ds))

Now a single row of the dataset is:

In [7]:
ds[0]

{'query': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',
 'pos': ['Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.'],
 'id': '0',
 'neg': ['Kroger expects that its value creation model will deliver total shareholder return within a target range of 8% to 11% over time.',
  'CSB purchased First Mortgages of $2.9 billion during 2023.',
  'See Note 13 to our Consolidated Financial Statements for information on certain legal proceedings for which there are contingencies.',
  'Diluted earnings per share were $16.69 in fiscal 2022 compared to $15.53 in fiscal 2021.',
  'In the year ended December 31, 2023, Total net sales and revenue increased primarily due to: (1) increased net wholesale volumes primarily due to increased sales of crossover vehicles and full-size pickup trucks, partially offset by decreased sales of mid-size pickup trucks; (2) favorable Pri

Then we split the dataset into training set and testing set.

In [8]:
split = ds.train_test_split(test_size=0.1, shuffle=True, seed=520)
train = split["train"]
test = split["test"]

Now we are ready to store the data for later fine-tuning:

In [15]:
train.to_json("ft_data/training.json")

Creating json from Arrow format: 100%|██████████| 7/7 [00:00<00:00, 39.73ba/s]


16583481

## Test Data for Evaluation

The last step is to construct the testing dataset following the [format](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/evaluation#8-custom-dataset) for evaluation.

In [10]:
test

Dataset({
    features: ['query', 'pos', 'id', 'neg', 'prompt'],
    num_rows: 700
})

First select the columns for queries:

In [11]:
queries = test.select_columns(column_names=["id", "query"])
queries = queries.rename_column("query", "text")
queries[0]

{'id': '1289',
 'text': 'How does Starbucks recognize the interest and penalties related to income tax matters on their financial statements?'}

Then select the columns for corpus:

In [12]:
corpus = ds.select_columns(column_names=["id", "pos"])
corpus = corpus.rename_column("pos", "text")

Finally, make the qrels that indicating the relations of queries and corresponding corpus"

In [13]:
qrels = test.select_columns(["id"])
qrels = qrels.rename_column("id", "qid")
qrels = qrels.add_column("docid", list(test["id"]))
qrels = qrels.add_column("relevance", [1]*len(test))
qrels[0]

Flattening the indices: 100%|██████████| 700/700 [00:00<00:00, 180956.10 examples/s]


{'qid': '1289', 'docid': '1289', 'relevance': 1}

Store the training set

In [14]:
queries.to_json("ft_data/test_queries.jsonl")
corpus.to_json("ft_data/corpus.jsonl")
qrels.to_json("ft_data/test_qrels.jsonl")

Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 210.42ba/s]
Creating json from Arrow format: 100%|██████████| 7/7 [00:00<00:00, 261.19ba/s]
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 591.08ba/s]


30574

Finetune

In [10]:
from FlagEmbedding import FlagModel

finetuned_path = "test_encoder_only_base_bge-large-en-v1.5"
model_name = "BAAI/bge-large-en-v1.5"
model = FlagModel(finetuned_path, 
# model = FlagModel(model_name,
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  devices=[0,1],
                  use_fp16=False)

In [11]:
queries_text = [q[1] for q in queries.items()]
corpus_text = [corpus[str(i)][0] for i in range(len(corpus))]

queries_embeddings = model.encode_queries(queries_text)
corpus_embeddings = model.encode_corpus(corpus_text)

initial target device: 100%|██████████| 2/2 [00:30<00:00, 15.31s/it]
pre tokenize: 100%|██████████| 2/2 [00:00<00:00, 116.32it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████| 2/2 [00:00<00:00, 123.47it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 2/2 [00:00<00:00, 13.06it/s]
Inference Embeddings: 100%|██████████| 2/2 [00:00<00:00, 13.14it/s]
Chunks: 100%|██████████| 2/2 [00:05<00:00,  2.56s/it]
pre tokenize: 100%|██████████| 14/14 [00:00<00:00, 55.58it/s]
pre tokenize: 100%|██████████| 14/14 [00:00<00:00, 27.82it/s]
Inference Embeddings: 100%|██████████| 14/1

In [12]:
import faiss
import numpy as np

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
# corpus_embeddings = corpus_embeddings.astype(np.float32)
# train and add the embeddings to the index
index.train(corpus_embeddings)
index.add(corpus_embeddings)

print(f"total number of vectors: {index.ntotal}")

total number of vectors: 7000


In [13]:
from tqdm import tqdm

query_size = len(queries_embeddings)

all_scores = []
all_indices = []

for i in tqdm(range(0, query_size, 32), desc="Searching"):
    j = min(i + 32, query_size)
    query_embedding = queries_embeddings[i: j]
    score, indice = index.search(query_embedding.astype(np.float32), k=100)
    all_scores.append(score)
    all_indices.append(indice)

all_scores = np.concatenate(all_scores, axis=0)
all_indices = np.concatenate(all_indices, axis=0)

Searching: 100%|██████████| 22/22 [00:00<00:00, 31.84it/s]


In [14]:
results = {}
for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):
    results[queries_ids[idx]] = {}
    for score, index in zip(scores, indices):
        if index != -1:
            results[queries_ids[idx]][corpus_ids[index]] = float(score)

In [15]:
from FlagEmbedding.abc.evaluation.utils import evaluate_metrics, evaluate_mrr

k_values = [10,100]
eval_res = evaluate_metrics(qrels, results, k_values)
mrr = evaluate_mrr(qrels, results, k_values)

for res in eval_res:
    print(res)
print(mrr)

defaultdict(<class 'list'>, {'NDCG@10': 0.84061, 'NDCG@100': 0.85484})
defaultdict(<class 'list'>, {'MAP@10': 0.81157, 'MAP@100': 0.81471})
defaultdict(<class 'list'>, {'Recall@10': 0.93, 'Recall@100': 0.99429})
defaultdict(<class 'list'>, {'P@10': 0.093, 'P@100': 0.00994})
defaultdict(<class 'list'>, {'MRR@10': 0.81157, 'MRR@100': 0.81471})


In [None]:
# Original test result

defaultdict(<class 'list'>, {'NDCG@1': 0.58286, 'NDCG@5': 0.68588, 'NDCG@10': 0.70405})
defaultdict(<class 'list'>, {'Recall@1': 0.58286, 'Recall@5': 0.76714, 'Recall@10': 0.82286})


In [None]:
# Fake test result

defaultdict(<class 'list'>, {'NDCG@1': 0.75571, 'NDCG@5': 0.84706, 'NDCG@10': 0.85623})
defaultdict(<class 'list'>, {'Recall@1': 0.75571, 'Recall@5': 0.92286, 'Recall@10': 0.95143})


In [9]:
from FlagEmbedding import FlagReranker

reranker = FlagReranker(
    'BAAI/bge-reranker-base', 
    query_max_length=256,
    use_fp16=True,
    devices=['cuda:1'],
)

score = reranker.compute_score(['I am happy to help', 'Assisting you is my pleasure'])
print(score)

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[6.453125]
