### 💻 set up your environment

import `Processor` to restructure and clean your data in a way which allows for rigorous, accurate tuning according to your goals

###### ℹ️ your source data must be csv/json/parquet path OR a dataframe object and contain **at least** one column with plaintext

###### ⚡️🧲 assuming for each transformation that you have the step-wise data ready to go, this block is all you need to initialize!

In [1]:
from magnet.filings import Processor
source_data_file = "./raw/kb_export_clean.csv"

### 📑 create sentences from plaintext

we set an an input file with a `Processor` class to create `filings` out of our data

###### ℹ️ you do not need an `id` column, we will make a document-level, integer-wise one for each of your sentences automatically, but keep this in mind for re-indexing your embeddings back to sentences or documents!

###### ℹ️ then we load the specific raw data file into memory! this is to protect your sequential workflow requirements but if you need to conserve resources during steps, look into the `.unload()` function!

In [2]:
filings = Processor()
filings.load(source_data_file)

[92m🌊 SUCCESS: loaded - ./raw/kb_export_clean.csv[0m


##### 🥳 great! let's process our data, _fast_

⚡️🧲 first we extract sentences for our embedding model to get initial scores and examples which we call `filings`

###### ℹ️ don't forget to declare your plaintext column's name! we do not persist this between objects

In [4]:
filings.export_as_sentences('./data/filings.parquet','clean','file', nlp=False)

[96m☕️ WAIT: get coffee or tea - 65822 processing...[0m


100%|██████████| 65822/65822 [01:57<00:00, 562.42it/s] 


[92m🌊 SUCCESS: saved - ./data/filings.parquet[0m


### Next-generation data communication & processing with NATS

Suppose you have a massive volume of data and this workload needs to be distributed? 
Magnet supports [NATS](https://nats.io) in a clever abstraction for sharing data transformation receipts, and the data itself.

In [None]:
from magnet.ic import field
charge = field.Charge("my-user:T0pS3cr3t@nats-cluster.workspace.svc.cluster.local") # your NATS server

#### 🎈☁️ upload to S3 on AWS

how easy! not much to say here other than this will upload your entire processed data folder when done (we assume you got your original data from somewhere!)

In [None]:
from magnet.utils import Utils

Utils().upload_to_s3(
    './data/'
    , ('AWS_CLIENT_KEY', 'AWS_SECRET_KEY')
    , 'bucket_name'
    , 'finetuning_data'
)