### 💻 set up your environment

import `Processor` to restructure and clean your data in a way which allows for rigorous, accurate tuning according to your goals

###### ℹ️ your source data must be csv/json/parquet path OR a dataframe object and contain **at least** one column with plaintext

###### ⚡️🧲 assuming for each transformation that you have the step-wise data ready to go, this block is all you need to initialize!

In [1]:
from magnet.filings import Processor
source_data_file = "./raw/kb_export_clean.parquet"

2023-09-23 19:41:31.644208: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-23 19:41:32.679424: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-23 19:41:32.680198: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-

### 📑 create sentences from plaintext

we set an an input file with a `Processor` class to create `filings` out of our data

###### ℹ️ you do not need an `id` column, we will make a document-level, integer-wise one for each of your sentences automatically, but keep this in mind for re-indexing your embeddings back to sentences or documents!

###### ℹ️ then we load the specific raw data file into memory! this is to protect your sequential workflow requirements but if you need to conserve resources during steps, look into the `.unload()` function!

In [2]:
filings = Processor()
filings.load(source_data_file)

[92m🌊 SUCCESS: loaded - ./raw/kb_export_clean.parquet[0m


##### 🥳 great! let's process our data, _fast_

⚡️🧲 first we extract sentences for our embedding model to get initial scores and examples which we call `filings`

###### ℹ️ don't forget to declare your plaintext column's name! we do not persist this between objects

In [3]:
filings.export_as_sentences('./data/sentences.parquet','clean','id')

[96m☕️ WAIT: get coffee or tea - 13103 processing...[0m


  1%|          | 120/13103 [00:02<03:34, 60.47it/s]

#### 🧮 indexing data 

in `magnet.ize`, we have different submodules responsible for different parts of building our "data field" of vectors.

import `charge` to create a "Pole" and index your sentences to it using your embedding model.
you can think of the vectors as electrons which belong to charged particles in a magnetic pole, or you can ignore the metaphor completely!

then we save our embeddings to re-use later and upload/share.

In [4]:
from magnet.ize import charge
import pandas as pd

sentences = pd.read_parquet('./data/sentences.parquet')
charge = charge.Pole()
charge.index_document_embeddings(df=sentences)
charge.save_embeddings('./data/sentence_embeddings.index')

[92m 🌊 SUCCESS: embedded sentence 930554[0m: 100%|██████████| 930555/930555 [4:12:35<00:00, 61.40it/s]    


[96m☕️ WAIT: indexing 930555 objects[0m
[92m🌊 SUCCESS: index created[0m
[92m🌊 SUCCESS: embeddings saved to ./data/sentence_embeddings.index[0m


##### 🤯 amazing and easy! 

let's now use these scores to create positive and negative training examples. here we search, re-index back to our sentences in plaintext with their IDs, and return the results as positive and negative examples

in `magnet.ron`, we have functions to create tuned versions of your `Pole` after charging with sentences from your knowledge base. You can think of this as energy interacting with a prism and orienting certain features in certain angles (you'll understand this more as we dive into embedding model use-cases!). In our example we will do positively and negatively correlated samples for finetuning on a `similarity` task from our vast volume of data (energy).

###### ℹ️ we use a random distribution the size of your `split` argument from your data! you can also change the amount of your index is allowed to pass through your prism with `k` argument

Let's create our `Prism` class from `tune`

###### ℹ️ check out a given model's documentation for more on how to format your data, ours defaults to the format required to tune `BAAI/bge-large-en-v1.5`

In [5]:
from magnet.ron import tune
import pandas as pd

sentences = pd.read_parquet('./data/sentences.parquet')
task = 'similarity'
kb_prepper = tune.Prism()
kb_prepper.load(sentences)
kb_prepper.generate_training_data(out_dir='./data/', index='./data/sentence_embeddings.index')

[92m🌊 SUCCESS: loaded -                                                  sentences     id
0        Joseph Stilwell disclaims beneficial ownership...      0
1        Joseph Stilwell disclaims beneficial ownership...      0
2        Joseph Stilwell disclaims beneficial ownership...      0
3        ﻿ownershipDocument: schemaVersion: X0508docume...      0
4        ip: isDirector: 0isOfficer: 0isTenPercentOwner...      0
...                                                    ...    ...
1248817  Exhibit Number Exhibit Description 1.1 Form of...  13102
1248818   Pickett dated May 1, 2022 10.3+ Consulting Ag...  13102
1248819  cates a management contract or any compensator...  13102
1248820  By: /s/ Tim Pickett Tim Pickett Chief Executiv...  13102
1248821  d all documents in connection therewith, with ...  13102

[1248822 rows x 2 columns][0m


  0%|          | 0/78051 [00:00<?, ?it/s]

[92m🌊 SUCCESS: index loaded - ./data/sentence_embeddings.index[0m


[33m 📊 ⣽: processed  - "44 Nonfarm animal careta kers...................................................................................................... 163 66.3Gaming services workers........................................................................................................ ."[0m:   0%|          | 10/78051 [00:25<46:49:03,  2.16s/it]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

[95m ❤️ ⢿: processed  - "Any amendment, repeal or modification of this Article IX, or the adoption of any provision of these Bylawsinconsistent with this Article IX, whether by action of the Board of Directors or the shareholders of the Corporation, shall not applyto or adversely affect any right or protection of a director of the Corporation existing at the time of such amendment, repeal, modificationor adoption."[0m:   3%|▎         | 2224/78051 [1:17:34<44:04:42,  2.09s/it]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

KeyboardInterrupt: 

#### 🎈☁️ upload to S3 on AWS

how easy! not much to say here other than this will upload your entire processed data folder when done (we assume you got your original data from somewhere!)

In [None]:
from magnet.utils import Utils

Utils().upload_to_s3(
    './data/'
    , ('AWS_CLIENT_KEY', 'AWS_SECRET_KEY')
    , 'bucket_name'
    , 'finetuning_data'
)