
# Accessing PAD Datasets

**Welcome to our Quick Start Guide!** In this notebook, we'll walk you through how to access datasets from the [PaperAnalyticalDeviceND dataset registry](https://github.com/PaperAnalyticalDeviceND/pad_dataset_registry) for model training.

You'll find detailed instructions on setting up your environment, installing necessary dependencies, exploring available datasets, downloading your chosen dataset, storing it, and visualizing its metadata.

**User-Friendly:** Though tailored for Google Colab, this guide is compatible with any environment that supports Python 3.9 or newer.

Should you have any questions or require further assistance, please feel free to reach out to pmoreira@nd.edu.

Enjoy exploring the datasets and happy modeling!



# Setup Enviroment

In [None]:
# Install dependencies
!pip install dvc dvc-gdrive &> /dev/null

DEV_FNAME = 'metadata_dev.csv'
TEST_FNAME = 'metadata_test.csv'
DEV_IMAGES_PATH = 'dev_images'
TEST_IMAGES_PATH = 'test_images'
REPORT_PATH = 'report'

# **List** Datasets

In [None]:
!dvc list  https://github.com/PaperAnalyticalDeviceND/pad_dataset_registry datasets

# **Download** a dataset from the previous dataset list

In [None]:
# Add to `dataset_name` one of the listed datasets
dataset_name = 'FHI2020_Stratified_Sampling'

In [None]:
import csv, os
import requests
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

def download_file(url, filename, images_path):
    """Download a file from a URL and save it to a local file."""
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        path = os.path.join(images_path, filename)
        with open(path, 'wb') as f:
            for chunk in response.iter_content(1024):
                f.write(chunk)

def download_files_from_csv_file(file_path, images_path):
    """Download files in parallel based on URLs from a CSV file with a progress bar."""
    # Open the CSV file and parse its content
    with open(file_path, newline='') as csvfile:
        rows = list(csv.DictReader(csvfile)) # Convert to list for tqdm

        # Initialize tqdm for the progress bar
        pbar = tqdm(total=len(rows), desc="Downloading files")

        def update(*args):
            # Update the progress bar by one each time a file is downloaded
            pbar.update()

        # Use ThreadPoolExecutor to download files in parallel
        with ThreadPoolExecutor(max_workers=10) as executor:
            futures = []
            for row in rows:
                url = row['url']
                filename = row['image_name']
                # Schedule the download task
                future = executor.submit(download_file, url, filename, images_path)
                future.add_done_callback(update)
                futures.append(future)

            # Wait for all futures to complete
            for future in futures:
                future.result()

        # Close the progress bar
        pbar.close()


import os

# create a folder to save all dataset files
os.mkdir(dataset_name)

# Folder to save the images inside the dataset folder
images_path = os.path.join(dataset_name, DEV_IMAGES_PATH)
os.mkdir(images_path)

# Path to save the dev metadata file inside the dataset folder
dev_metadata_path = os.path.join(dataset_name, DEV_FNAME)

# Path to save the test metadata file inside the dataset folder
test_metadata_path = os.path.join(dataset_name, TEST_FNAME)

# Download the DEV metadata file
!dvc get  https://github.com/PaperAnalyticalDeviceND/pad_dataset_registry datasets/$dataset_name/$DEV_FNAME -o  $dataset_name/$DEV_FNAME

# Start downloading image files for the dev set
download_files_from_csv_file(dev_metadata_path, images_path)

# Uncomment to download the TEST metadata file
#!dvc get https://github.com/PaperAnalyticalDeviceND/pad_dataset_registry datasets/$dataset_name/$TEST_FNAME -o  $dataset_name/$TEST_FNAME

# Uncomment to start downloading image files for the test set
# download_files_from_csv_file(test_metadata_path, images_path)


# **Save** the dataset

> ## Save it in a folder in your Google Drive (recomended)

In [None]:
from google.colab import drive
drive.mount('/content/drive')
my_path = "/content/drive/MyDrive/"

!cp -r $dataset_name/ $my_path

Mounted at /content/drive


> ## Or save it on your computer (slow)



> Uncomment the lines so you can save the dataset in your computer



In [None]:
# from google.colab import files

# !zip -r $dataset_name.zip $dataset_name/ &> /dev/null
# files.download(f"{dataset_name}.zip")

# Visualize the metadata

In [None]:
# Visualize the metadata using pandas
import pandas as pd

data = pd.read_csv(dev_metadata_path)

data