### 💻 set up your environment

import `Processor` to restructure and clean your data in a way which allows for rigorous, accurate tuning according to your goals

###### ℹ️ your source data must be csv/json/parquet path OR a dataframe object and contain **at least** one column with plaintext

###### ⚡️🧲 assuming for each transformation that you have the step-wise data ready to go, this block is all you need to initialize!

In [None]:
from magnet.filings import Processor
source_data_file = "./raw/kb_export_clean.parquet"

### 📑 create sentences from plaintext

we set an an input file with a `Processor` class to create `filings` out of our data

###### ℹ️ you do not need an `id` column, we will make a document-level, integer-wise one for each of your sentences automatically, but keep this in mind for re-indexing your embeddings back to sentences or documents!

###### ℹ️ then we load the specific raw data file into memory! this is to protect your sequential workflow requirements but if you need to conserve resources during steps, look into the `.unload()` function!

In [None]:
filings = Processor()
filings.load(source_data_file)

##### 🥳 great! let's process our data, _fast_

⚡️🧲 first we extract sentences for our embedding model to get initial scores and examples which we call `filings`

###### ℹ️ don't forget to declare your plaintext column's name! we do not persist this between objects

In [None]:
filings.export_as_sentences('./data/sentences.parquet','clean','id')

#### 🧮 indexing data 

in `magnet.ize`, we have different submodules responsible for different parts of building our "data field" of vectors.

import `charge` to create a "Pole" and index your sentences to it using your embedding model.
you can think of the vectors as electrons which belong to charged particles in a magnetic pole, or you can ignore the metaphor completely!

then we save our embeddings to re-use later and upload/share.

In [None]:
from magnet.ize import charge
import pandas as pd

sentences = pd.read_parquet('./data/sentences.parquet')
charge = charge.Pole()
charge.index_document_embeddings(df=sentences)
charge.save_embeddings('./data/sentence_embeddings.index')

##### 🤯 amazing and easy! 

let's now use these scores to create positive and negative training examples. here we search, re-index back to our sentences in plaintext with their IDs, and return the results as positive and negative examples

in `magnet.ron`, we have functions to create tuned versions of your `Pole` after charging with sentences from your knowledge base. You can think of this as energy interacting with a prism and orienting certain features in certain angles (you'll understand this more as we dive into embedding model use-cases!). In our example we will do positively and negatively correlated samples for finetuning on a `similarity` task from our vast volume of data (energy).

###### ℹ️ we use a random distribution the size of your `split` argument from your data! you can also change the amount of your index is allowed to pass through your prism with `k` argument

Let's create our `Prism` class from `tune`

###### ℹ️ check out a given model's documentation for more on how to format your data, ours defaults to the format required to tune `BAAI/bge-large-en-v1.5`

In [None]:
from magnet.ron import tune
import pandas as pd

sentences = pd.read_parquet('./data/sentences.parquet')
task = 'similarity'
kb_prepper = tune.Prism()
kb_prepper.load(sentences)
kb_prepper.generate_training_data(out_dir='./data/', index='./data/sentence_embeddings.index')

#### 🎈☁️ upload to S3 on AWS

how easy! not much to say here other than this will upload your entire processed data folder when done (we assume you got your cleaned data from somewhere!)

In [None]:
from magnet.utils import Utils

Utils().upload_to_s3(
    './data/'
    , ('AWS_CLIENT_KEY', 'AWS_SECRET_KEY')
    , 'bucket_name'
    , 'finetuning_data'
)