### 💻 set up your environment

###### your source data should a csv/json path or a dataframe and contain **at least** one column with plaintext

###### ⚡️ assuming for each transformation that you have the step-wise data ready to go, this block is all you need to initialize!

In [1]:
from magnet.utils import Utils, _f

raw_dir = "./raw"
cleaned_dir = "./data"
source_data_file = "knowledge_base_export.csv"
plaintext_column = "clean"

No GPU being used. Careful, inference might be very slow!


[93m🚨 WARN: CUDA is not available on this machine.[0m


### 📑 create sentences from plaintext
###### we set an output filename, an input directory, and an output directory with our `Processor` class

###### then we load the specific raw data file into memory

In [None]:
from magnet.processing import Processor
sentences_filename = 'knowledge_base_sentences'
kb_sentence_proc = Processor(sentences_filename, raw_dir, cleaned_dir)
kb_sentence_proc.load(source_data_file)

[94mℹ️ INFO: Processor init[0m
[92m🌊 SUCCESS: loaded - /Users/dylanalloy/VSCode/LLM/kb-embeddings/lib/magnet.git/raw/knowledge_base_export.csv[0m


##### 🥳 great! let's process our data, _fast_

##### ⚡️ first we extract sentences for our embedding model to get initial scores and examples from

###### don't forget to declare your plaintext column's name!

In [None]:
kb_sentence_proc.export_with_sentences(plaintext_column)

[96m☕️ WAIT: get coffee or tea - 5726 processing...[0m


  0%|          | 0/5725 [00:00<?, ?it/s]

100%|██████████| 5725/5725 [05:53<00:00, 16.17it/s]


[92m🌊 SUCCESS: 🗳️  - /Users/dylanalloy/VSCode/LLM/kb-embeddings/lib/magnet.git/data/knowledge_base_sentences.json[0m


##### 📊 now let's score our sentences against those found in random batches of documents!

###### 📖 1️⃣ `split` by default is 16 which uses said fraction of your data to create examples from.

###### 📖 2️⃣ we then create a subsampling of our newly scored data. this is a requirement for sorting positive and negative samples later when we export finetuning datasets. 

###### we're going to use `FinePrep` class to prepare our data for finetuning training runs

###### don't forget to declare your `group_by` column in `generate_scored_data` as well as the name of the original plaintext column so we can persist it across datasets

###### **✨ this scoring can be done with any `sentence-transformers` model you like, not necessarily the one you are finetuning (`model=''`)! you can also insert a custom** `prompt` **if the model benefits from it ✨**

###### (for example, when using `bge-large-en-v1.5` for `retrieval` instead of `similarity` tuning, a prompt is required)

###### 🚨 `generate_scored_data` can take some time if `use_multiprocessing` is not enabled, and using it is compute-intensive

###### ℹ️ multiprocessing is not needed if you are using CUDA

In [None]:
from magnet.finetune import FinePrep

sentences_data_file = 'knowledge_base_sentences.json'
scored_sentences_filename = 'scored_knowledge_base_sentences'
group_by = 'answerId'
task = 'similarity'
kb_prepper = FinePrep(
    scored_sentences_filename
    , cleaned_dir
    , cleaned_dir
)
kb_prepper.load(sentences_data_file)
kb_prepper.generate_scored_data(
    group_by
    , plaintext_column
    , split=32
    , use_multiprocessing=True
    , task=task
)

[94mℹ️ INFO: FinePrep init[0m
[92m🌊 SUCCESS: loaded - /Users/dylanalloy/VSCode/LLM/kb-embeddings/lib/magnet.git/data/knowledge_base_sentences.json[0m
[96m☕️ WAIT: get coffee or tea - 178 (1/32 of your data) processing...[0m
[93m🚨 WARN: 1/10 processes started from index 0 to 17/178 (17)[0m
[93m🚨 WARN: 2/10 processes started from index 17 to 34/178 (17)[0m
[93m🚨 WARN: 3/10 processes started from index 34 to 51/178 (17)[0m
[93m🚨 WARN: 4/10 processes started from index 51 to 68/178 (17)[0m
[93m🚨 WARN: 5/10 processes started from index 68 to 85/178 (17)[0m
[93m🚨 WARN: 6/10 processes started from index 85 to 102/178 (17)[0m
[93m🚨 WARN: 7/10 processes started from index 102 to 119/178 (17)[0m
[93m🚨 WARN: 8/10 processes started from index 119 to 136/178 (17)[0m
[93m🚨 WARN: 9/10 processes started from index 136 to 153/178 (17)[0m
[93m🚨 WARN: 10/10 processes started from index 153 to 178/178 (17)[0m


[92m 🌊 SUCCESS: sample 11 - comparing 1023801 🧮 1003579[0m: 100%|██████████| 17/17 [1:32:33<00:00, 326.70s/it]it]
[92m 🌊 SUCCESS: sample 9 - comparing 1021282 🧮 1015704[0m: 100%|██████████| 17/17 [1:39:15<00:00, 350.32s/it]t]  
[92m 🌊 SUCCESS: sample 0 - comparing 1024244 🧮 1016594[0m: 100%|██████████| 17/17 [1:57:30<00:00, 414.74s/it] ]
[92m 🌊 SUCCESS: sample 20 - comparing 1020751 🧮 1018459[0m: 100%|██████████| 17/17 [1:58:24<00:00, 417.91s/it]]
[92m 🌊 SUCCESS: sample 3 - comparing 1021216 🧮 1022778[0m: 100%|██████████| 17/17 [2:24:14<00:00, 509.08s/it]]]] 
[92m 🌊 SUCCESS: sample 16 - comparing 1016281 🧮 1015251[0m: 100%|██████████| 17/17 [2:34:52<00:00, 546.63s/it] ]
[92m 🌊 SUCCESS: sample 3 - comparing 1011984 🧮 1008179[0m: 100%|██████████| 17/17 [2:45:50<00:00, 585.30s/it]]   
[92m 🌊 SUCCESS: sample 12 - comparing 1003460 🧮 1016268[0m: 100%|██████████| 17/17 [2:48:02<00:00, 593.10s/it]
[92m 🌊 SUCCESS: sample 10 - comparing 1011782 🧮 1018454[0m: 100%|██████████| 2

[92m🌊 SUCCESS: saved to - /Users/dylanalloy/VSCode/LLM/kb-embeddings/lib/magnet.git/data/scored_knowledge_base_sentences.json[0m


##### 🤯 amazing and easy! 

###### let's now use these scores to create positive and negative training examples for `BAAI/bge-large-en-v1.5` finetuning by creating a default 0.7 quantile split

###### ℹ️ check out a given model's documentation for more on the quantitative parts!

In [None]:
from magnet.finetune import FinePrep

finetuned_data_file = 'finetune_kb_dataset'

finetune_prepper = FinePrep(finetuned_data_file, cleaned_dir, cleaned_dir)
finetune_prepper.load('scored_knowledge_base_sentences.json')
finetune_prepper.generate_training_data()

[94mℹ️ INFO: FinePrep init[0m
[92m🌊 SUCCESS: loaded - /Users/dylanalloy/VSCode/LLM/kb-embeddings/lib/magnet.git/data/scored_knowledge_base_sentences.json[0m


[93m 💙 ⣷: processed  - "Dr. Philip N. Jefferson took the oath of office as Vice Chair of the Board of Governors of the Federal Reserve System on Wednesday."[0m: 100%|██████████| 75864/75864 [01:08<00:00, 1111.35it/s]                                                                          ]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

[92m🌊 SUCCESS: written - /Users/dylanalloy/VSCode/LLM/kb-embeddings/lib/magnet.git/data/finetune_kb_dataset.jsonl[0m


#### "but wait 🫵, there's more! 🤪" _('hard negative' mining)_

##### using Meta's `faiss` we can create a performant vector index for deriving 'hard negatives' to assist in extracting meaning from the quantile splits, then collate the two datasets for one mega dataset with very good representation without overfitting on our knowledge base!

###### 🧠 **intuition**: mining hard positives with a simple quantile split works to get an edge in our finetuning dataset because positive correlations are relativistically rare in a specific domain by the 80th percentile (`quant` default). less insight comes from negative examples that have relevance scores in the 20th percentile because this can be due to length differences in the samples, low standard deviation in the sample set, etc.

###### 📝 **takeaway**: mining hard negatives is an efficient way to achieve a similarly robust representation of negatively correlated contents from your knowledge base for training as we do the positively correlated

In [None]:
input_file='finetune_kb_dataset.jsonl'
output_file='fn_hn_kb'
finetune_prepper.find_knn_neg(
    model='BAAI/bge-large-en-v1.5'
    , input_file=input_file
    , output_file=output_file
    , use_gpu=Utils().check_cuda()
    , sample_range=[0,500]
    , num_hard_negatives=15
)

[93m🚨 WARN: CUDA is not available on this machine.[0m
[96m☕️ WAIT: inferencing embeddings for corpus - 2529[0m
[96m☕️ WAIT: inferencing embedding for queries - 75852[0m
[92m🌊 SUCCESS: create index and search[0m


Batches: 100%|██████████| 1186/1186 [00:05<00:00, 202.44it/s]


[92m🌊 SUCCESS: written - /Users/dylanalloy/VSCode/LLM/kb-embeddings/lib/magnet.git/data/fn_hn_kb.jsonl[0m


In [None]:
Utils().upload_to_s3(
    './data/fn_hn_kb.jsonl'
    , ('AWS_CLIENT_KEY', 'AWS_SECRET_KEY')
    , 'bucket_name'
    , 'finetuning_data'
)

[93m🚨 WARN: uploading to S3 - ./data/fn_hn_kb.jsonl[0m
[92m🌊 SUCCESS: uploaded - bucket_name/finetuning_data/fn_hn_kb.jsonl[0m
