# Data preparation

In this notebook we will prepare the data to later train our deep learning model. To do so,
* we will start a new W&B `run` and use our raw data artifact
* split the data and save the splits into a new W&B Artifact
* join information about the split with our Label Table

In [1]:
import os, warnings
import wandb

import pandas as pd
from fastai.vision.all import *
from sklearn.model_selection import StratifiedGroupKFold

# import from file in the parent directory
import sys
sys.path.append('../')
import params
warnings.filterwarnings('ignore')

In [2]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="data_split")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mapopov[0m ([33mijc-amp[0m). Use [1m`wandb login --relogin`[0m to force relogin


Let's use artifact we previously saved to W&B (we're storing artifact names and other global parameters in `params`).

In [3]:
raw_data_at = run.use_artifact(f'{params.RAW_DATA_AT}:latest')
path = Path(raw_data_at.download())

[34m[1mwandb[0m: Downloading large artifact TCGA-COAD:latest, 342.03MB. 2831 files... 
[34m[1mwandb[0m:   2831 of 2831 files downloaded.  
Done. 0:0:8.0


In [4]:
path.ls()

(#3) [Path('artifacts/TCGA-COAD:v0/media'),Path('artifacts/TCGA-COAD:v0/labels_table.table.json'),Path('artifacts/TCGA-COAD:v0/patches')]

To split data between training, testing and validation, we need file names, groups (derived from the file name) and target label. We previously saved these columns to `labels_table`, so let's retrieve it from the table now. 

In [5]:
fnames = os.listdir(path/'patches') # names of all patches
groups = ['_'.join(s.split('_')[:-3]) for s in fnames] # names of all slides
print(f"We have {len(fnames)} patches from {len(np.unique(groups))} slides")

We have 1415 patches from 626 slides


In [6]:
orig_labels_table = raw_data_at.get('labels_table')

[34m[1mwandb[0m: Downloading large artifact TCGA-COAD:latest, 342.03MB. 2831 files... 
[34m[1mwandb[0m:   2831 of 2831 files downloaded.  
Done. 0:0:7.9


In [8]:
y = orig_labels_table.get_column('Label')

Now we will split the data into train (80%), validation (10%) and test (10%) sets. As we do that, we need to be careful to:

- *avoid leakage*: for that reason we are grouping aptches according to the slice they come from
- handle the *label imbalance*: for that reason we stratify data with our target column

We will use sklearn's `StratifiedGroupKFold` to split the data into 10 folds and assign 1 fold for test, 1 for validation and the rest for training.

In [13]:
df = pd.DataFrame()
df['Fname'] = fnames
df['fold'] = -1

In [14]:
cv = StratifiedGroupKFold(n_splits=10)
for i, (train_idxs, test_idxs) in enumerate(cv.split(fnames, y, groups)):
    df.loc[test_idxs, ['fold']] = i

In [15]:
df['Split'] = 'train'
df.loc[df.fold == 0, ['Split']] = 'test'
df.loc[df.fold == 1, ['Split']] = 'valid'
del df['fold']
df.Split.value_counts()

Split
train    1127
test      145
valid     143
Name: count, dtype: int64

In [16]:
df.to_csv('../data_split.csv', index=False)

In [17]:
df.columns, orig_labels_table.columns

(Index(['Fname', 'Split'], dtype='object'),
 ['Image', 'Label', 'Fname', 'Split'])

We will now create a new artifact and add our data there.

In [30]:
processed_data_at = wandb.Artifact(params.PROCESSED_DATA_AT, type="split_data")

In [31]:
processed_data_at.add_file('../data_split.csv')
processed_data_at.add_dir(path)

[34m[1mwandb[0m: Adding directory to artifact (./artifacts/TCGA-COAD:v0)... Done. 0.8s


Finally, the split information may be relevant for our analyses - rather than uploading images again, we will save the split information to a new table and join it with EDA table we created previously. 

In [32]:
data_split_table = wandb.Table(dataframe=df[['Fname', 'Split']])

In [33]:
orig_labels_table.columns

['Image', 'Label', 'Fname', 'Split']

In [36]:
data_split_table.columns

['Fname', 'Split']

In [39]:
join_table = wandb.JoinedTable(orig_labels_table, data_split_table, "Fname")

Let's add it to our artifact, log it and finish our `run`.

In [40]:
processed_data_at.add(join_table, "labels_table_data_split")

ArtifactManifestEntry(path='labels_table_data_split.joined-table.json', digest='yIHTX4VMfx/UwBrsnioHXA==', size=126, local_path='/home/anton/.local/share/wandb/artifacts/staging/tmp1i_ab94l')

In [41]:
run.log_artifact(processed_data_at)
run.finish()