<a href="https://colab.research.google.com/github/wandb/edu/blob/main/mlops-001/lesson1/01_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{course-lesson1} -->

# EDA 
<!--- @wandbcode{course-lesson1} -->

In this notebook, we will download a sample of the [BDD100K](https://www.bdd100k.com/) semantic segmentation dataset and use W&B Artifacts and Tables to version and analyze our data. 

In [1]:
DEBUG = False # set this flag to True to use a small subset of data for testing

We have defined some global configuration parameters in the `params.py` file. `ENTITY` should correspond to your W&B Team name if you work in a team, replace it with `None` if you work individually. 

In [1]:
import numpy as np
import csv
import os
import json

from pathlib import Path
from tqdm import tqdm

import params
import wandb
wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mgcpage[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [2]:
path = Path('C:/Users/Griffin/Documents/datasets/nsynth/nsynth-valid')
with open(path/'examples.json') as f:
    examples = json.load(f)

Here we define several functions to help us process the data and upload it as a `Table` to W&B. 

In [9]:
def _create_table(audio_dir):
    "Create a table with the dataset"
    table = wandb.Table(columns=["File_Name",
                                 "Audio",
                                 "Instrument_Family",
                                 "Instrument_Source"])
    for audio_file in tqdm(audio_dir.iterdir()):
        table.add_data(audio_file.stem,
                       wandb.Audio(str(audio_file)),
                       examples[audio_file.stem]['instrument_family'],
                       examples[audio_file.stem]['instrument_source'])
    return table

We will start a new W&B `run` and put everything into a raw Artifact.

In [4]:
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="upload")
raw_data_at = wandb.Artifact(params.RAW_DATA_AT, type="raw_data")

Let's add the images and label masks.

In [5]:
raw_data_at.add_dir(path/'audio', name='audio_valid')
raw_data_at.add_file(path/'examples.json', name='examples_valid')

[34m[1mwandb[0m: Adding directory to artifact (C:\Users\Griffin\Documents\datasets\nsynth\nsynth-valid\audio)... Done. 173.7s


ArtifactManifestEntry(path='examples_valid', digest='O4e9ze9DpUjSAWEOVVfJ5A==', ref=None, birth_artifact_id=None, size=8838509, extra={}, local_path='C:\\Users\\Griffin\\AppData\\Local\\wandb\\wandb\\artifacts\\staging\\tmp8t_qrctn')

Let's get the file names of images in our dataset and use the function we defined above to create a W&B Table. 

In [10]:
table = _create_table(path/'audio')

12678it [00:07, 1709.34it/s]


Finally, we will add the Table to our Artifact, log it to W&B and finish our `run`. 

In [11]:
raw_data_at.add(table, "eda_table")

ArtifactManifestEntry(path='eda_table.table.json', digest='e994ycZesQoGW90eIy45sg==', ref=None, birth_artifact_id=None, size=2727565, extra={}, local_path='C:\\Users\\Griffin\\AppData\\Local\\wandb\\wandb\\artifacts\\staging\\tmplpib_6gj')

In [12]:
run.log_artifact(raw_data_at)
run.finish()