### 💻 set up your environment

import `Processor` to restructure and clean your data in a way which allows for rigorous, accurate tuning according to your goals

###### ℹ️ your source data must be csv/json/parquet path OR a dataframe object and contain **at least** one column with plaintext

In [None]:
from magnet.filings import Processor
source_data_file = "./raw/kb_export_clean.csv"

### 📑 create chunks from documents

we set an an input file with a `Processor` class to create `filings` out of our data

###### ℹ️ you do not need an `id` column, we will make a document-level, integer-wise one for each of your sentences automatically, but keep this in mind for re-indexing your embeddings back to sentences or documents!

###### ℹ️ then we load the raw data file into memory!

In [None]:
filings = Processor()
filings.load(source_data_file)

##### 🥳 great! let's process our data, _fast_

⚡️🧲 first we extract sentences for our embedding model to get initial scores and examples which we call `filings`

###### ℹ️ don't forget to declare your plaintext column's name! we do not persist this between objects

In [None]:
await filings.process('./data/filings.parquet','clean','file', nlp=False)

### 🛰️ next-generation data communication & processing with NATS

###### ℹ️ suppose you have a massive volume of data and this workload needs to be distributed 

##### `magnet` supports [NATS](https://nats.io) in a clever abstraction for sharing data.

In [None]:
from magnet.ic import field
nats_cluster = field.Charge("my-user:T0pS3cr3t@192.168.2.69") # your NATS cluster hostname & basic auth
clustered_filings = Processor(field=nats_cluster)

##### 🧲 load data

In [None]:
clustered_filings.load(source_data_file)

##### 🧲 at another workstation, run the following to receive processed data 📡

In [None]:
from magnet.ic import field
reso = field.Resonator("my-user:T0pS3cr3t@192.168.2.69")
await reso.on()

##### 🧲 export, but this time your data will also stream to a different node running `magnet` 🧲 as it's being written to the current one

In [None]:
await clustered_filings.process('./data/filings.parquet','clean','file', nlp=False)

#### 🎈☁️ upload to S3

upload your entire processed data folder when done (we assume you got your original data from somewhere!)

In [None]:
from magnet.utils import Utils

Utils().upload_to_s3(
    './data/'
    , ('AWS_CLIENT_KEY', 'AWS_SECRET_KEY')
    , 'bucket_name'
    , 'finetuning_data'
)