You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a few more questions. I would be happy if you answer. Here is my FineTuning Code:
from ragatouille import RAGTrainer
from ragatouille.data import CorpusProcessor, llama_index_sentence_splitter
import os
import glob
import random
def main():
trainer = RAGTrainer(model_name="ColBERT_1.0", # ColBERT_1 for first sample
# pretrained_model_name="colbert-ir/colbertv2.0",
pretrained_model_name="intfloat/e5-base", # ???
language_code="tr" # ???
)
# pretrained_model_name: base model to train
# model_name: new name to trained model
# Path to the directory containing all the `.txt` files for indexing
folder_path = "/text" # text folder contains several txt files.
# Initialize lists to store the texts and their corresponding file names
all_texts = []
document_ids = []
# Read all `.txt` files in the specified folder and extract file names
for file_path in glob.glob(os.path.join(folder_path, "*.txt")):
with open(file_path, "r", encoding="utf-8") as file:
content = file.read()
all_texts.append(content)
document_ids.append(os.path.splitext(os.path.basename(file_path))[0]) # Extract file name without extension
# chunking
corpus_processor = CorpusProcessor(document_splitter_fn=llama_index_sentence_splitter)
documents = corpus_processor.process_corpus(documents=all_texts, document_ids=document_ids, chunk_size=256) # overlap=0.1 chosen
# To train retrieval models like colberts, we need training triplets: queries, positive passages, and negative passages for each query.
# fake query-relevant passage pair
queries = ["document relevant query-1",
"document relevant query-2",
"document relevant query-3",
"document relevant query-4",
"document relevant query-5",
"document relevant query-6"
] * 3
pairs = []
for query in queries:
fake_relevant_docs = random.sample(documents, 10)
for doc in fake_relevant_docs:
pairs.append((query, doc))
# prepare training data
trainer.prepare_training_data(raw_data=pairs,
data_out_path="./data_out_path",
all_documents=all_texts,
num_new_negatives=10,
mine_hard_negatives=True
)
trainer.train(batch_size=32,
nbits=4, # how many bits will trained-model use
maxsteps=500000,
use_ib_negatives=True, # in-batch negative for calculate loss
dim=128, # per embedding will be 128 dimensions
learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
doc_maxlen=256, # Maximum document length
use_relu=False, # Disable ReLU
warmup_steps="auto", # Defaults to 10%
)
if __name__ == "__main__":
main()
When I use my code, my model with a structure like the one below is recorded in checkpoints.
colbert
vocab.txt
tokenizer_config.json
tokenizer.json
special_tokens_map.json
model.safetensors
config.json
artifact.metadata
I need to fine-tune the intfloat/e5-base or intfloat/multilingual-e5-base model with my own data and ColBERT. Do you know any changes I need to make to the code or its internal library code?
Also, how can I try my model with the structure I shared above, which I fine-tuned using my code? Do you have a code we can "load" and try?
Thanks for your interest
The text was updated successfully, but these errors were encountered:
I have a few more questions. I would be happy if you answer. Here is my FineTuning Code:
When I use my code, my model with a structure like the one below is recorded in checkpoints.
colbert
I need to fine-tune the intfloat/e5-base or intfloat/multilingual-e5-base model with my own data and ColBERT. Do you know any changes I need to make to the code or its internal library code?
Also, how can I try my model with the structure I shared above, which I fine-tuned using my code? Do you have a code we can "load" and try?
Thanks for your interest
The text was updated successfully, but these errors were encountered: