# Data preparation for model training

In this notebook we prepare the dataset to later train our deep learning model. To do this we need to:

- Start a new W&B `run` and use our raw data `artifact`
- Split the data and save the splits into a new W&B `artifact`
- Join the information about the split with the W&B EDA `Table` ([see link - might need permission](https://wandb.ai/doc93/mlops-course-001/reports/Exploration-of-BDD1K-Autonomous-Vehicle-dataset--Vmlldzo1MDUzNjU1)), created in the notebook `01_ExplorDataAnalysis.ipynb`

In [1]:
import os, warnings
import wandb

import pandas as pd
from fastai.vision.all import *
from sklearn.model_selection import StratifiedGroupKFold

import params
warnings.filterwarnings('ignore')

# Download dataset from W&B artifact

Start a new `W&B run` so that we can reproduce data processing if needed

In [2]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="data_split")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33md-oliver-cort[0m ([33mdoc93[0m). Use [1m`wandb login --relogin`[0m to force relogin


Use the `artifact` that we previously saved to W&B

In [5]:
# Get latest version of the arifact (artifact names, etc. stored in params)
raw_data_artifact = run.use_artifact(f'{params.RAW_DATA_AT}:latest')
path = Path(raw_data_artifact.download())

[34m[1mwandb[0m: Downloading large artifact bdd_simple_1k:latest, 846.60MB. 4007 files... 
[34m[1mwandb[0m:   4007 of 4007 files downloaded.  
Done. 0:0:11.3


In [6]:
# Check that artifact has downloaded properly
path.ls()

(#5) [Path('artifacts/bdd_simple_1k:v1/images'),Path('artifacts/bdd_simple_1k:v1/labels'),Path('artifacts/bdd_simple_1k:v1/LICENSE.txt'),Path('artifacts/bdd_simple_1k:v1/eda_table.table.json'),Path('artifacts/bdd_simple_1k:v1/media')]

# Data split 

When splitting the data into train, validation, test sets, we need to take into account what we learned in the EDA stage ([see link - might need permission](https://wandb.ai/doc93/mlops-course-001/reports/Exploration-of-BDD1K-Autonomous-Vehicle-dataset--Vmlldzo1MDUzNjU1)).
 
Data got stored in a W&B `table`. From this table we need to read columns: of file names, groups (derived from the file name) and target (here we use our rare class bicycle for stratification).

In [11]:
# Retrive EDA table from the raw data artifact
orig_eda_table = raw_data_artifact.get("eda_table")

[34m[1mwandb[0m: Downloading large artifact bdd_simple_1k:latest, 846.60MB. 4007 files... 
[34m[1mwandb[0m:   4007 of 4007 files downloaded.  
Done. 0:0:10.4


In [12]:
# Get filenames from the EDA table column "File_Name"
fnames = orig_eda_table.get_column("File_Name")
# Or
# fnames = os.listdir(path/'images')

# Get first part of file name (the group)
groups = [s.split('-')[0] for s in fnames]

In [14]:
# Use the "bicycle" column from the EDA table for stratification
y = orig_eda_table.get_column("bicycle")

We will split the data into train (80%), validation (10%) and test (10%) sets. We need to be careful to:

- Avoid `leakage`: by grouping data according to video identifier (we want to make sure that the model can generalize to new cars or video frames)

- Handle the label `imbalance`: by stratifying data with our target column ("bicycle")

In [29]:
df = pd.DataFrame()
df['File_Name'] = fnames
df['fold'] = -1

In [30]:
# We use sklearn's `StratifiedGroupKFold` to split the data into 10 folds 
# - assign 1 fold for test, 1 for validation and the rest for training
cv = StratifiedGroupKFold(n_splits=10)
for i, (train_idxs, test_idxs) in enumerate(cv.split(fnames, y, groups)):
    df.loc[test_idxs, ['fold']] = i

In [32]:
df['Stage'] = 'train'
df.loc[df.fold == 0, ['Stage']] = 'test'
df.loc[df.fold == 1, ['Stage']] = 'valid'
del df['fold']
df.Stage.value_counts()

Stage
train    800
valid    100
test     100
Name: count, dtype: int64

In [36]:
# Save dataframe to a csv  file
df.to_csv('data_split.csv', index=False)

# Create new artifact

In [37]:
# params.PROCESSED_DATA_AT contains name of new dataset artefact
processed_data_at = wandb.Artifact(params.PROCESSED_DATA_AT, type="split_data")

In [39]:
processed_data_at.add_file('data_split.csv', name='data_split.csv')
processed_data_at.add_dir(path)
print(path)

[34m[1mwandb[0m: Adding directory to artifact (./artifacts/bdd_simple_1k:v1)... Done. 3.4s


artifacts/bdd_simple_1k:v1


The split information is relevant for our analyses. 

Rather than uploading images again, we save the split information to a new table and join it with EDA table we created previously.

In [42]:
# Create W&B table containing stage info (train/val/test) as a column
data_split_table = wandb.Table(dataframe=df[['File_Name', 'Stage']])

# Join new W&B table with EDA W&B table (avoid saving images again)
# - structure: wandb.JoinedTable(table_1, table_2, join_key)
join_table = wandb.JoinedTable(orig_eda_table, data_split_table, "File_Name")

In [44]:
# Add new W&B table to the artifact
processed_data_at.add(join_table, "eda_table_data_split")

<wandb.sdk.artifacts.artifact_manifest_entry.ArtifactManifestEntry at 0x151122070>

# Log artifact to W&B and finish `run`. 

In [45]:
run.log_artifact(processed_data_at)
run.finish()

