In [5]:
# Install wandb
!pip install wandb --quiet

In [6]:
!wandb --version

wandb, version 0.13.9


In [7]:
import os
import wandb
import pandas as pd

from sklearn.model_selection import StratifiedGroupKFold

import params

Initialise a run to track de split

In [9]:
run = wandb.init(
    project=params.WANDB_PROJECT,
    entity=params.ENTITY,
    job_type="data_split"
)

[34m[1mwandb[0m: Currently logged in as: [33mmarioparreno[0m. Use [1m`wandb login --relogin`[0m to force relogin


In the previous Notebook we saved our data to an Artifact

We will use it now and track the lineage of our dataset in this way

In [10]:
raw_data_at = run.use_artifact(
    'marioparreno/mlops-wandb-course/oranges:latest',
    type='raw_data'
)
artifact_dir = raw_data_at.download()

[34m[1mwandb[0m: Downloading large artifact oranges:latest, 2521.61MB. 796 files... 


OSError: [Errno 22] Invalid argument: './artifacts/oranges:v0'

To create the splits we will need the data filenames and labels

We have already that information at our dataset or also at the Table we created for our EDA

Let's retrieve it from our EDA Table

In [None]:
orig_eda_table = raw_data_at.get("eda_table")

Now we can access the information from our Table easily with the `get_column` method

In [None]:
fnames = orig_eda_table.get_column("File_Name")

Given the filenames we are going to split the dataset

To do so, we are going to create an additional csv file with the information about image and corresponding split

First we are going to create an auxiliary column `fold`

In [None]:
df = pd.DataFrame()
df['File_Name'] = fnames
df['fold'] = -1

Now we are going to fill the `fold` column generating splits

In [None]:
cv = StratifiedGroupKFold(n_splits=10)
for i, (traind_idxs, test_idxs) in enumerate(cv.split(fnames, y)):
    df.loc[test_idxs, ['fold']] = i

Given the `fold` information we are going to define the splits for training, validation and test

In [None]:
df['Stage'] = 'train'
df.loc[df.fold == 0, ['Stage']] == 'test'
df.loc[df.fold == 1, ['Stage']] == 'valid'
del df['fold']
df.Stage.value_counts()

Finally we save locally the data split

In [None]:
df.to_csv('data_split.csv', index=False)

And log the data split by using an Artifact. Create the Artifact

In [None]:
processed_data_at = wandb.Artifact(
    params.PROCESSED_DATA_AT,
    type="split_data"
)

Add the data relevant to the splitted dataset: 
- The raw data (we could process it, etc)
- The split information
- The dataset (labels) information

In [None]:
processed_data_at.add_file('data_split.csv')  # The split information
processed_data_at.add_dir()  # The raw data (we could process it, etc)

We are going to save the split information by using the Table object from W&B

To do so we are going to take our previous EDA table wich contains valuable information, such as the labels, and add the `Stage` information. For that, we are going to create a new W&B Table with the `File_Name` and `Stage` information

In [None]:
data_split_table = wandb.Table(
    dataframe=df[['File_Name', 'Stage']]
)

And create a join table, merging the information from our previous table and the split table information one, joining by the `File_Name`

In [None]:
join_table = wandb.JoinedTable(
    orig_eda_table,
    data_split_table,
    'File_Name'
)

In [None]:
processed_data_at.add(join_table, "eda_table_data_split")

Now we can log our Artifact and finish the run

In [None]:
run.log_artifact(processed_data_at)
run.finish()