<a href="https://colab.research.google.com/github/tcapelle/edu/blob/main/mlops-001/lesson1/01_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{course-lesson1} -->

# EDA 
<!--- @wandbcode{course-lesson1} -->

In this notebook, we will download a sample of the [BDD100K](https://www.bdd100k.com/) semantic segmentation dataset and use W&B Artifacts and Tables to version and analyze our data. 

In [1]:
# INSTALL WANDB
# !pip install wandb -qq

In [2]:
from fastai.vision.all import *
import params
import wandb

In [3]:
import fastai
fastai.__version__


'2.7.10'

In [4]:
from fastai.vision.all import *

We have defined some global configuration parameters in the `params.py` file. `ENTITY` should correspond to your W&B Team name if you work in a team, replace it with `None` if you work individually. 

In the section below, we will use `untar_data` function from `fastai` to download and unzip our datasets. 

In [5]:
URL = 'file:///Users/jobdulo/Documents/mlops/content/endPrototype/archive.zip'

In [6]:
path = Path(untar_data(URL, force_download=True))

In [7]:
path.ls()

(#3) [Path('/Users/jobdulo/.fastai/data/archive/images'),Path('/Users/jobdulo/.fastai/data/archive/labels_class_dict.csv'),Path('/Users/jobdulo/.fastai/data/archive/masks')]

In [8]:
(path/'masks').ls()

(#715) [Path('/Users/jobdulo/.fastai/data/archive/masks/0101492.png'),Path('/Users/jobdulo/.fastai/data/archive/masks/0102170.png'),Path('/Users/jobdulo/.fastai/data/archive/masks/3001667.png'),Path('/Users/jobdulo/.fastai/data/archive/masks/5000191.png'),Path('/Users/jobdulo/.fastai/data/archive/masks/6000166.png'),Path('/Users/jobdulo/.fastai/data/archive/masks/6000172.png'),Path('/Users/jobdulo/.fastai/data/archive/masks/4000086.png'),Path('/Users/jobdulo/.fastai/data/archive/masks/6000199.png'),Path('/Users/jobdulo/.fastai/data/archive/masks/0101121.png'),Path('/Users/jobdulo/.fastai/data/archive/masks/6000012.png')...]

Here we define several functions to help us process the data and upload it as a `Table` to W&B. 

In [22]:
def label_func(fname):
    return (fname.parent.parent/"masks")/f"{fname.stem}.png"

def get_classes_per_image(mask_data, class_labels):
    unique = list(np.unique(mask_data))
    result_dict = {}
    for _class in class_labels.keys():
        result_dict[class_labels[_class]] = int(_class in unique)
    return result_dict

def _create_table(image_files, class_labels):
    "Create a table with the dataset"
    labels = [str(class_labels[_lab]) for _lab in list(class_labels)]
    table = wandb.Table(columns=["File_Name", "Images", "Dataset"] + labels)
    
    for i, image_file in progress_bar(enumerate(image_files), total=len(image_files)):
        image = Image.open(image_file)
        print(image_file)
        print(label_func(image_file))
        mask_data = np.array(Image.open(label_func(image_file)))
#         print("shape: ", mask_data.shape[0])
        multipl = mask_data.shape[1]*mask_data.shape[2]
#         print("mutl: ",multipl)
        mult1 = mask_data.shape[0]
        mask_data =  np.reshape(mask_data, (mult1, multipl))
#         print("Mask data: ", mask_data)
        class_in_image = get_classes_per_image(mask_data, class_labels)
        table.add_data(
            str(image_file.name),
            wandb.Image(
                    image,
                    masks={
                        "predictions": {
                            "mask_data": mask_data,
                            "class_labels": class_labels,
                        }
                    }
            ),
            "archive", # we don't have a dataset split yet
            *[class_in_image[_lab] for _lab in labels]
        )
    
    return table

We will start a new W&B `run` and put everything into a raw Artifact.

In [10]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mdulo[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [11]:
import params

In [12]:
# start a new wandb run
run = wandb.init(project=params.WANDB_PROJECT, entity=params.ENTITY, job_type="data_upload")

# create an artifact
raw_data_at = wandb.Artifact(params.RAW_DATA_AT, type="raw_data")

[34m[1mwandb[0m: Currently logged in as: [33mdulo[0m. Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016691012499999994, max=1.0…

In [13]:
# raw_data_at.add_file(path/'LICENSE.txt', name='LICENSE.txt')

Let's add the images and label masks.

In [14]:
# ADD FOLDERS TO ARTIFACT
raw_data_at.add_dir(path/'images', name='images')
raw_data_at.add_dir(path/'masks', name='masks')

[34m[1mwandb[0m: Adding directory to artifact (/Users/jobdulo/.fastai/data/archive/images)... Done. 0.1s
[34m[1mwandb[0m: Adding directory to artifact (/Users/jobdulo/.fastai/data/archive/masks)... Done. 0.1s


Let's get the file names of images in our dataset and use the function we defined above to create a W&B Table. 

In [15]:
DEBUG = False # set this flag to True to use a small subset of data for testing

In [16]:
image_files = get_image_files(path/"images", recurse=False)
print(image_files)

# sample a subset if DEBUG
# if DEBUG: image_files = image_files[:10]

[Path('/Users/jobdulo/.fastai/data/archive/images/4000086.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/6000199.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/0101121.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/6000172.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/6000166.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/0102170.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/3001667.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/5000191.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/0101492.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/0103468.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/3001061.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/2000042.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/6000238.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/1001195.jpg'), Path('/Users/jobdulo/.fastai/data/archive/images/9003836.jpg'), Path('/Users/jobdulo/.fastai/data/archi

In [17]:
print(params.BDD_CLASSES)

{0: 'sky', 1: 'tree', 2: 'road', 3: 'grass', 4: 'water', 5: 'building', 6: 'mountain', 7: 'foreground', 8: 'unknown'}


In [23]:
table = _create_table(image_files, params.BDD_CLASSES)

/Users/jobdulo/.fastai/data/archive/images/4000086.jpg
/Users/jobdulo/.fastai/data/archive/masks/4000086.png
/Users/jobdulo/.fastai/data/archive/images/6000199.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000199.png
/Users/jobdulo/.fastai/data/archive/images/0101121.jpg
/Users/jobdulo/.fastai/data/archive/masks/0101121.png
/Users/jobdulo/.fastai/data/archive/images/6000172.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000172.png
/Users/jobdulo/.fastai/data/archive/images/6000166.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000166.png
/Users/jobdulo/.fastai/data/archive/images/0102170.jpg
/Users/jobdulo/.fastai/data/archive/masks/0102170.png
/Users/jobdulo/.fastai/data/archive/images/3001667.jpg
/Users/jobdulo/.fastai/data/archive/masks/3001667.png
/Users/jobdulo/.fastai/data/archive/images/5000191.jpg
/Users/jobdulo/.fastai/data/archive/masks/5000191.png
/Users/jobdulo/.fastai/data/archive/images/0101492.jpg
/Users/jobdulo/.fastai/data/archive/masks/0101492.png
/Users/jobdulo/.fas

/Users/jobdulo/.fastai/data/archive/images/1000882.jpg
/Users/jobdulo/.fastai/data/archive/masks/1000882.png
/Users/jobdulo/.fastai/data/archive/images/6000160.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000160.png
/Users/jobdulo/.fastai/data/archive/images/6000174.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000174.png
/Users/jobdulo/.fastai/data/archive/images/6000148.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000148.png
/Users/jobdulo/.fastai/data/archive/images/5000197.jpg
/Users/jobdulo/.fastai/data/archive/masks/5000197.png
/Users/jobdulo/.fastai/data/archive/images/5000183.jpg
/Users/jobdulo/.fastai/data/archive/masks/5000183.png
/Users/jobdulo/.fastai/data/archive/images/3002340.jpg
/Users/jobdulo/.fastai/data/archive/masks/3002340.png
/Users/jobdulo/.fastai/data/archive/images/3001891.jpg
/Users/jobdulo/.fastai/data/archive/masks/3001891.png
/Users/jobdulo/.fastai/data/archive/images/0000952.jpg
/Users/jobdulo/.fastai/data/archive/masks/0000952.png
/Users/jobdulo/.fas

/Users/jobdulo/.fastai/data/archive/images/0004774.jpg
/Users/jobdulo/.fastai/data/archive/masks/0004774.png
/Users/jobdulo/.fastai/data/archive/images/8003836.jpg
/Users/jobdulo/.fastai/data/archive/masks/8003836.png
/Users/jobdulo/.fastai/data/archive/images/6000139.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000139.png
/Users/jobdulo/.fastai/data/archive/images/9004294.jpg
/Users/jobdulo/.fastai/data/archive/masks/9004294.png
/Users/jobdulo/.fastai/data/archive/images/6000111.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000111.png
/Users/jobdulo/.fastai/data/archive/images/6000105.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000105.png
/Users/jobdulo/.fastai/data/archive/images/5000119.jpg
/Users/jobdulo/.fastai/data/archive/masks/5000119.png
/Users/jobdulo/.fastai/data/archive/images/6000313.jpg
/Users/jobdulo/.fastai/data/archive/masks/6000313.png
/Users/jobdulo/.fastai/data/archive/images/5000125.jpg
/Users/jobdulo/.fastai/data/archive/masks/5000125.png
/Users/jobdulo/.fas

KeyboardInterrupt: 

Finally, we will add the Table to our Artifact, log it to W&B and finish our `run`. 

In [19]:
raw_data_at.add(table, "eda_table")

<ManifestEntry digest: CsM4SHeMU/E1LWvHXkzBAA==>

In [20]:
run.log_artifact(raw_data_at)
run.finish()

wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
wandb: Network error (TransientError), entering retry loop.
