## Create initial vector store

In [5]:
!python ../scripts/create_vector_store.py \
  --dataset "covid" \
  --emb_model "BAAI/bge-small-en-v1.5" \
  --cs 150 \
  --co 20 \
  --bs_emb 2048 \
  --output_dir "../vector_stores/covid/base_"

Load model embedding
Using device: cuda
Total number of documents: 2019
Creating passages
  0%|                                                  | 0/2019 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (655 > 512). Running this sequence through the model will result in indexing errors
100%|███████████████████████████████████████| 2019/2019 [02:56<00:00, 11.45it/s]
Total number of passages: 138891
Removing duplicate passages
Total number of passages created: 8397
Creating vector store
Load model embedding : BAAI/bge-small-en-v1.5
Using device: cuda
Generando embeddings: 100%|███████████████████████| 5/5 [00:07<00:00,  1.47s/it]
✅ Índice FAISS creado exitosamente.
💾 Vector store saved in ../vector_stores/covid/base_vs_covid_150_20


In [7]:
import sys
import os
sys.path.append('../src')
from vector_stores.faiss import VectorStoreFaiss
vector_store = VectorStoreFaiss.load_local("../vector_stores/covid/base_vs_covid_150_20")
results = vector_store.buscar_por_batches(['What is the main cause of HIV-1 infection in children?'],2)
results

Load model embedding : BAAI/bge-small-en-v1.5
Using device: cuda
💾 Vector store loaded from../vector_stores/covid/base_vs_covid_150_20


🔍 Buscando: 100%|███████████████████████████████| 1/1 [00:00<00:00, 107.03it/s]


[(['We carried out an association study of DC-SIGNR polymorphism in 197 infants born to untreated HIV-1-infected mothers recruited in Harare, Zimbabwe. Among them, 97 infants were HIV-1-infected and 100 infants remained uninfected. Of the 97 HIV-1-infected infants, 57 were infected IU, 11 were infected IP, and 17 were infected PP. Timing of infection was not determined for 12 HIV-1-infected infants. Baseline characteristics of mothers and infants are presented in Table 1 . Maternal age and CD4 cell count, child sex, mode of delivery, duration of membrane rupture and gestational age were similar among all groups. However, maternal viral load',
   'The single strongest risk factor for pneumonia is HIV infection, which is especially prevalent in children in sub-Saharan Africa. HIV-infected children have 6 times increased odds of developing severe pneumonia or of death compared to HIV-uninfected children [52] . Since the effective prevention of mother-to-child transmission of HIV, there is

In [15]:
vector_store.embedding_model_name_or_path

'BAAI/bge-small-en-v1.5'

## Train embedding

In [1]:
import sys
sys.path.append('../src')
from utils.data_for_train_emb import load_and_prepare_datasets
train_dataset, val_dataset, test_dataset = load_and_prepare_datasets('covid')
train_dataset[0]

Loading dataset splits for covid
Train: 1292
Val: 323
Test: 404
Datasets loaded and prepared.


Map:   0%|          | 0/1292 [00:00<?, ? examples/s]

Map:   0%|          | 0/323 [00:00<?, ? examples/s]

Map:   0%|          | 0/404 [00:00<?, ? examples/s]

Datasets loaded and prepared.


{'q_id': 1891,
 'question': 'As of 26 January 2020, what countries had sporadic cases?',
 'relevant_docs': '0. As of 26 January 2020, the still ongoing outbreak had resulted in 2066 (618 of them are in Wuhan) confirmed cases and 56 (45 of them were in Wuhan) deaths in mainland China [4] , and sporadic cases exported from Wuhan were reported in Thailand, Japan, Republic of Korea, Hong Kong, Taiwan, Australia, and the United States, please see the World Health Organization (WHO) news release via https://www.who.int/csr/don/en/ from 14 to 21 Jan'}

In [2]:
!python  ../scripts/train_embedding.py \
  --name_dataset "covid" \
  --model_name "BAAI/bge-small-en-v1.5" \
  --new_model_name "bge-small-covid" \
  --epochs 10 \
  --batch_size 128 \
  --output_dir "../models/covid/embedding/"

[34m[1mwandb[0m: Currently logged in as: [33mdinho15971[0m ([33mdinho15971-unicamp[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
Starting main process...
Loading dataset splits for covid
Train: 1292
Val: 323
Test: 404
Datasets loaded and prepared.
Datasets loaded and prepared.
Creating evaluator...
Evaluator created.
[34m[1mwandb[0m: Tracking run with wandb version 0.19.10
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/local1/ronaldinho/projects/solving_problems/test_sbbd/notebooks/wandb/run-20250430_171033-8x7k36ta[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mbge-small-covid_10e_128bs[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/dinho15971-unicamp/SBBD_embeddings[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/dinho15971-unicamp/SBBD_embeddings/runs/8x7k36ta[0m
Loading model: BAAI/bge-small-en-v1.5
Model and loss f

## Evaluate embedding

In [3]:
!python  ../scripts/evaluate_embedding.py \
  --name_dataset "covid" \
  --output_dir "../results/covid/" \
  --models_dir "../models/covid/embedding/"

Cargando modelos...
Models loaded
Loading datasets...
Loading dataset splits for covid
Train: 1292
Val: 323
Test: 404
Datasets loaded and prepared.
Datasets loaded and prepared.
Loaded dataset
Creating evaluator...
Evaluator created.
Evaluating models
Save results...


## create new vector store

In [4]:
!python ../scripts/create_vector_store.py \
  --dataset "covid" \
  --emb_model "../models/covid/embedding/bge-small-covid_10e_128bs" \
  --cs 150 \
  --co 20 \
  --bs_emb 2048 \
  --output_dir "../vector_stores/covid/ft_"

Load model embedding
Using device: cuda
Total number of documents: 2019
Creating passages
  0%|                                                  | 0/2019 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (655 > 512). Running this sequence through the model will result in indexing errors
100%|███████████████████████████████████████| 2019/2019 [02:56<00:00, 11.44it/s]
Total number of passages: 138891
Removing duplicate passages
Total number of passages created: 8397
Creating vector store
Load model embedding : ../models/covid/embedding/bge-small-covid_10e_128bs
Using device: cuda
Generando embeddings: 100%|███████████████████████| 5/5 [00:07<00:00,  1.48s/it]
✅ Índice FAISS creado exitosamente.
💾 Vector store saved in ../vector_stores/covid/ft_vs_covid_150_20


In [8]:
import sys
import os
sys.path.append('../src')
from vector_stores.faiss import VectorStoreFaiss
vector_store = VectorStoreFaiss.load_local("../vector_stores/covid/ft_vs_covid_150_20")
results = vector_store.buscar_por_batches(['What is the main cause of HIV-1 infection in children?'],2)
results

Load model embedding : ../models/covid/embedding/bge-small-covid_10e_128bs
Using device: cuda
💾 Vector store loaded from../vector_stores/covid/ft_vs_covid_150_20


🔍 Buscando: 100%|███████████████████████████████| 1/1 [00:00<00:00, 111.32it/s]


[(['The single strongest risk factor for pneumonia is HIV infection, which is especially prevalent in children in sub-Saharan Africa. HIV-infected children have 6 times increased odds of developing severe pneumonia or of death compared to HIV-uninfected children [52] . Since the effective prevention of mother-to-child transmission of HIV, there is a growing population of HIV-exposed children who are uninfected; their excess risk of pneumonia, compared to HIV unexposed children, has been described as 1.3-to 3.4-fold higher [53] [54] [55] [56] [57] .',
   'We carried out an association study of DC-SIGNR polymorphism in 197 infants born to untreated HIV-1-infected mothers recruited in Harare, Zimbabwe. Among them, 97 infants were HIV-1-infected and 100 infants remained uninfected. Of the 97 HIV-1-infected infants, 57 were infected IU, 11 were infected IP, and 17 were infected PP. Timing of infection was not determined for 12 HIV-1-infected infants. Baseline characteristics of mothers and 

In [9]:
vector_store.embedding_model_name_or_path

'../models/covid/embedding/bge-small-covid_10e_128bs'

## Phi fine tuning

In [1]:
import sys
sys.path.append('../src')
from utils.data_for_train_phi import get_dataset_for_train_phi
from vector_stores.faiss import VectorStoreFaiss
vector_store = VectorStoreFaiss.load_local("../vector_stores/covid/ft_vs_covid_150_20")
train_ds, test_ds = get_dataset_for_train_phi('covid', True, vector_store,4, 8)

Load model embedding : ../models/covid/embedding/bge-small-covid_10e_128bs
Using device: cuda
💾 Vector store loaded from../vector_stores/covid/ft_vs_covid_150_20
Creating dataset for covid
Loading dataset splits for covid
Train: 1292
Val: 323
Test: 404
Datasets loaded and prepared.


🔍 Buscando: 100%|████████████████████████████████████████████████████████████████████| 162/162 [00:02<00:00, 80.48it/s]
🔍 Buscando: 100%|██████████████████████████████████████████████████████████████████████| 41/41 [00:00<00:00, 94.63it/s]


In [2]:
print(train_ds[0]['text'])

Instruct:  Using the information in the context, answer the question as concisely and faithfully as possible.If the context does not provide enough information to answer confidently, answer based on the most likely interpretation from the given text.

Context:

Document 0:. As of 26 January 2020, the still ongoing outbreak had resulted in 2066 (618 of them are in Wuhan) confirmed cases and 56 (45 of them were in Wuhan) deaths in mainland China [4] , and sporadic cases exported from Wuhan were reported in Thailand, Japan, Republic of Korea, Hong Kong, Taiwan, Australia, and the United States, please see the World Health Organization (WHO) news release via https://www.who.int/csr/don/en/ from 14 to 21 January 2020
Document 1:The first three cases detected were reported in France on 24 January 2020 and had onset of symptoms on 17, 19 and 23 January respectively [10] . The first death was reported on 15 February in France. As at 21 February, nine countries had reported cases ( Figure) : Be

In [1]:
!python  ../scripts/ft_phi.py \
  --new_model_name "phi_2_rag_covid_k3_5e_10bs" \
  --num_epochs 5 \
  --batch_size 10 \
  --dataset_name "covid" \
  --include_docs \
  --top_k 3 \
  --save_path "../models/covid/adapters/" \
  --vector_store_path "../vector_stores/covid/ft_vs_covid_150_20"

[34m[1mwandb[0m: Currently logged in as: [33mdinho15971[0m ([33mdinho15971-unicamp[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.19.10
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/local1/ronaldinho/projects/solving_problems/test_sbbd/notebooks/wandb/run-20250430_232204-hf76a9qp[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mphi_2_rag_covid_k3_5e_10bs[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/dinho15971-unicamp/SBBD_phi-2-adapters[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/dinho15971-unicamp/SBBD_phi-2-adapters/runs/hf76a9qp[0m
Usando dispositivo: cuda
Load model embedding : ../models/covid/embedding/bge-small-covid_10e_128bs
Using device: cuda
💾 Vector store loaded from../vector_stores/covid/ft_vs_covid_150_20
Using k = 3 passages
Creating dataset for covid


## Inference

In [1]:
!python ../scripts/inference_rag.py \
  --lora_adapter_path "../models/covid/adapters/best_phi_2_rag_covid_k3_5e_10bs" \
  --max_new_tokens 80 \
  --vector_store_path "../vector_stores/covid/ft_vs_covid_150_20" \
  --dataset_name "covid" \
  --output_csv_path "../results/covid/full_ft_covid.csv" \
  --bs_emb 50 \
  --bs_gen 8 \
  --top_k 10 \
  --use_rag

Using device: cuda
Loading base model from LoRA adapter: ../models/covid/adapters/best_phi_2_rag_covid_k3_5e_10bs
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 18.40it/s]
Load model embedding : ../models/covid/embedding/bge-small-covid_10e_128bs
Using device: cuda
💾 Vector store loaded from../vector_stores/covid/ft_vs_covid_150_20
Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DeepseekV3ForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3F