<a href="https://colab.research.google.com/github/wandb/edu/blob/main/mlops-001/lesson1/01_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{course-lesson1} -->

# EDA 
<!--- @wandbcode{course-lesson1} -->

In this notebook, we will download a sample of the [BDD100K](https://www.bdd100k.com/) semantic segmentation dataset and use W&B Artifacts and Tables to version and analyze our data. 

In [None]:
DEBUG = False # set this flag to True to use a small subset of data for testing

We have defined some global configuration parameters in the `params.py` file. `ENTITY` should correspond to your W&B Team name if you work in a team, replace it with `None` if you work individually. 

In [None]:
import numpy as np
import csv
import os

from pathlib import Path
from tqdm import tqdm

import params
import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mgcpage[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [3]:
path = Path('C:/Users/Griffin/Documents/datasets/nsynth/nsynth-valid')
with open(path/'examples.json') as f:
    labels = json.l

Here we define several functions to help us process the data and upload it as a `Table` to W&B. 

In [12]:
def _create_table():
    "Create a table with the dataset"
    table = wandb.Table(columns=["File_Name",
                                 "Audio",
                                 "Split",
                                 "Caption_1",
                                 "Caption_2",
                                 "Caption_3",
                                 "Caption_4",
                                 "Caption_5"])
    return table

def _add_data_split(table, audio_dir, captions, split):
        for audio_file in tqdm(audio_dir.iterdir()):
            table.add_data(audio_file.name,
                           wandb.Audio(str(audio_file)),
                           split,
                           captions[audio_file.name]['caption_1'],
                           captions[audio_file.name]['caption_2'],
                           captions[audio_file.name]['caption_3'],
                           captions[audio_file.name]['caption_4'],
                           captions[audio_file.name]['caption_5'])

We will start a new W&B `run` and put everything into a raw Artifact.

In [5]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="upload")
raw_data_at = wandb.Artifact(params.RAW_DATA_AT, type="raw_data")

In [6]:
raw_data_at.add_file(path/'LICENSE', name='LICENSE')

ArtifactManifestEntry(path='LICENSE', digest='ONQirIosnDXCiCMlduLoEA==', ref=None, birth_artifact_id=None, size=1874, extra={}, local_path='C:\\Users\\Griffin\\AppData\\Local\\wandb\\wandb\\artifacts\\staging\\tmpp81dw_yb')

Let's add the images and label masks.

In [8]:
raw_data_at.add_dir(path/'clotho_audio_development', name='audio_train')
raw_data_at.add_file(path/'clotho_captions_development.csv', name='captions_train')
raw_data_at.add_file(path/'clotho_metadata_development.csv', name='metadata_train')
raw_data_at.add_dir(path/'clotho_audio_validation', name='audio_val')
raw_data_at.add_file(path/'clotho_captions_validation.csv', name='captions_val')
raw_data_at.add_file(path/'clotho_metadata_validation.csv', name='metadata_val')

[34m[1mwandb[0m: Adding directory to artifact (C:\Users\Griffin\Documents\datasets\clotho\clotho_audio_development)... Done. 767.7s
[34m[1mwandb[0m: Adding directory to artifact (C:\Users\Griffin\Documents\datasets\clotho\clotho_audio_validation)... Done. 109.2s


ArtifactManifestEntry(path='metadata_val', digest='LgEEJ8VrHOYAiw8D9BBIzg==', ref=None, birth_artifact_id=None, size=224803, extra={}, local_path='C:\\Users\\Griffin\\AppData\\Local\\wandb\\wandb\\artifacts\\staging\\tmpxi0h8c33')

Let's get the file names of images in our dataset and use the function we defined above to create a W&B Table. 

In [13]:
table = _create_table()
_add_data_split(table, path/'clotho_audio_development', captions_train, 'train')
_add_data_split(table, path/'clotho_audio_validation', captions_val, 'val')

3839it [03:56, 16.25it/s]
1045it [00:50, 20.62it/s]


Finally, we will add the Table to our Artifact, log it to W&B and finish our `run`. 

In [14]:
raw_data_at.add(table, "eda_table")

ArtifactManifestEntry(path='eda_table.table.json', digest='gYaZkB3hia05NEo2Am+YOw==', ref=None, birth_artifact_id=None, size=2658126, extra={}, local_path='C:\\Users\\Griffin\\AppData\\Local\\wandb\\wandb\\artifacts\\staging\\tmpkzjkgnjk')

In [15]:
run.log_artifact(raw_data_at)
run.finish()