## Data Preparation

In this notebook, we will prepare the data to later train our deep learning model. To do so, 
- we willl start a new W&B  **run** and use our data artifact
- split the data and save the splits into a new W&B  Artifacts
- join information about the split with our EDA Table

In [1]:
import os, warnings
import wandb

import pandas as pd
from fastai.vision.all import *
from sklearn.model_selection import StratifiedGroupKFold
import params
warnings.filterwarnings('ignore')



Use the artifact we prevoiusly saved to W&B (we are storing artifact names and other global parameters in params)

In [2]:
run = wandb.init(project=params.WANDB_PROJECT, entity = params.ENTITY, job_type = "data_split")

[34m[1mwandb[0m: Currently logged in as: [33mtwelvve[0m. Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016689079749994562, max=1.0â€¦

In [3]:
raw_data_at = run.use_artifact(f'{params.RAW_DATA_AT}:latest')
path = Path(raw_data_at.download())

[34m[1mwandb[0m: Downloading large artifact bdd_simple_1k:latest, 813.75MB. 4007 files... 
[34m[1mwandb[0m:   4007 of 4007 files downloaded.  
Done. 0:0:14.6


In [4]:
path.ls()

(#5) [Path('artifacts/bdd_simple_1k:v0/images'),Path('artifacts/bdd_simple_1k:v0/eda_table.table.json'),Path('artifacts/bdd_simple_1k:v0/media'),Path('artifacts/bdd_simple_1k:v0/LICENSE.txt'),Path('artifacts/bdd_simple_1k:v0/labels')]

To split data between training , testing and validation , we need file names , groups or some thing like that  and target (here we use our rare class bicycle for stratification). We previously saved these columns to EDA table, so lets retrieve it from the table now. 

In [5]:
orig_eda_table = raw_data_at.get("eda_table")

[34m[1mwandb[0m: Downloading large artifact bdd_simple_1k:latest, 813.75MB. 4007 files... 
[34m[1mwandb[0m:   4007 of 4007 files downloaded.  
Done. 0:0:0.3


In [6]:
fnames = orig_eda_table.get_column("File_Name")
groups = [s.split('-')[0] for s in fnames]
y = orig_eda_table.get_column("bicycle")

Now we will split the data into train(80%) , validataion(10%), and test (10%) sets. As we do that, we need to be carefull to :
- avoid leakage : for that reason we are grouping data according to video identifier (we want to make sure our model can generalize o new cars or video frames)
- handle the label imbalance : for that reason we startify data with our target column 

We will use sklearn's  **StratifiedgroupKFold** to split the data into 10 folds and assign 1 fold for test, 1 for validation and the rest for training.

In [14]:
df = pd.DataFrame()
df["File_Name"] = fnames
df['fold'] = -1

In [15]:
cv = StratifiedGroupKFold(n_splits=10)
for i, (train_idxs, test_idxs) in enumerate(cv.split(fnames, y, groups)):
    df.loc[test_idxs, ['fold']] = i

In [16]:
df.head()

Unnamed: 0,File_Name,fold
0,4e79e7cc-5d215a40.jpg,8
1,75b3cdd3-693bf40c.jpg,9
2,9b656e8f-c53b0000.jpg,4
3,8b4c6631-b27e8388.jpg,9
4,0d207cff-4d92f256.jpg,6


In [17]:
df['Stage'] = 'train'
df.loc[df.fold == 0, ['Stage']] = 'test'
df.loc[df.fold == 1, ['Stage']] = 'valid'
del df['fold']
df.Stage.value_counts()

train    800
test     100
valid    100
Name: Stage, dtype: int64

In [18]:
df.to_csv('data_split.csv', index= False)

Now to create a new Artifact and add our data there.

In [19]:
processed_data_at = wandb.Artifact(params.PROCESSED_DATA_AT, type = "split_data")

In [20]:
processed_data_at.add_file("data_split.csv")
processed_data_at.add_dir(path)

[34m[1mwandb[0m: Adding directory to artifact (./artifacts/bdd_simple_1k:v0)... Done. 81.6s


Finally, the split info may be relevant for our analysis - rather than uploading images again , we will save the split info to a new table and join it with EDA table we created prevoiusly.

In [21]:
data_split_table = wandb.Table(dataframe= df[["File_Name", "Stage"]])

In [22]:
join_table = wandb.JoinedTable(orig_eda_table, data_split_table, "File_Name")

Lets add it to our artifact, log it and finish our **run**

In [23]:
processed_data_at.add(join_table, "eda_table_data_split")

ArtifactManifestEntry(path='eda_table_data_split.joined-table.json', digest='q4/KsAgEIJsld6wUqrUJcQ==', ref=None, birth_artifact_id=None, size=127, extra={}, local_path='/home/l3gion/.local/share/wandb/artifacts/staging/tmpcacmromk')

In [24]:
run.log_artifact(processed_data_at)
run.finish()