### 💻 set up your environment

###### your source data should a csv/json path or a dataframe and contain **at least** one column with plaintext

###### ⚡️ assuming for each transformation that you have the step-wise data ready to go, this block is all you need to initialize!

In [1]:
from magnet.processing import Processor
source_data_file = "./raw/knowledge_base_export.csv"

### 📑 create sentences from plaintext
###### we set an output filename, an input directory, and an output directory with our `Processor` class

###### then we load the specific raw data file into memory

In [2]:
kb_sentence_proc = Processor()
kb_sentence_proc.load(source_data_file)

[92m🌊 SUCCESS: loaded - ./raw/knowledge_base_export.csv[0m


##### 🥳 great! let's process our data, _fast_

##### ⚡️ first we extract sentences for our embedding model to get initial scores and examples from

###### don't forget to declare your plaintext column's name!

In [3]:
kb_sentence_proc.export_as_sentences('./data/sentences.parquet','clean','answerId')

[96m☕️ WAIT: get coffee or tea - 5726 processing...[0m
[92m🌊 SUCCESS: saved - ./data/sentences.parquet[0m


#### 🧮 indexing data 

in `magnet`, we have different submodules responsible for different parts of building our "data field" of vectors.

import `charge` to create a "Pole" and index your documents to it. 

In [None]:
from magnet.ize import charge
charge = charge.Pole()
charge.index_document_embeddings(df=kb_sentence_proc.df)
charge.save_embeddings('./data/sentence_embeddings.index')

##### 📊 now let's score our sentences against those found in random batches of documents!

###### 📖 1️⃣ `split` by default is 16 which uses said fraction of your data to create examples from.

###### 📖 2️⃣ we then create a subsampling of our newly scored data. this is a requirement for sorting positive and negative samples later when we export finetuning datasets. 

###### we're going to use `FinePrep` class to prepare our data for finetuning training runs

###### don't forget to declare your `group_by` column in `generate_scored_data` as well as the name of the original plaintext column so we can persist it across datasets

###### **✨ this scoring can be done with any `sentence-transformers` model you like, not necessarily the one you are finetuning (`model=''`)! you can also insert a custom** `prompt` **if the model benefits from it ✨**

###### (for example, when using `bge-large-en-v1.5` for `retrieval` instead of `similarity` tuning, a prompt is required)

###### 🚨 `generate_scored_data` can take some time if `use_multiprocessing` is not enabled, and using it is compute-intensive

###### ℹ️ multiprocessing is not needed if you are using CUDA

In [4]:
from magnet.finetune import FinePrep
import os
import pandas as pd

df = pd.read_parquet('./data/sentences.parquet')
sentences_data_file = 'sentences.parquet'
task = 'similarity'
kb_prepper = FinePrep()
kb_prepper.load(os.path.join('./data/',sentences_data_file))
kb_prepper.generate_training_data(out_dir='./data/',k=64, index='./data/sentence_embeddings.index')

[92m🌊 SUCCESS: loaded - ./data/sentences.parquet[0m


  0%|          | 0/16996 [00:00<?, ?it/s]

[92m🌊 SUCCESS: index loaded - ./data/sentence_embeddings.index[0m


[92m 💜 ⣽: processed  - "If unable to locate the invoice, Spectrum Mobile devices: For dual-SIM capable devices, be sure to obtain the PRIMARY IMEI."[0m:   4%|▎         | 607/16996 [17:30<7:52:53,  1.73s/it]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ]                                                                                                                                                   

In [None]:
# Utils().upload_to_s3(
#     './data/fn_hn_kb.jsonl'
#     , ('AWS_CLIENT_KEY', 'AWS_SECRET_KEY')
#     , 'bucket_name'
#     , 'finetuning_data'
# )

[93m🚨 WARN: uploading to S3 - ./data/fn_hn_kb.jsonl[0m
[92m🌊 SUCCESS: uploaded - bucket_name/finetuning_data/fn_hn_kb.jsonl[0m
