# Accessing PAD Datasets


---

**Welcome to our Quick Start Guide!** In this notebook, we'll guide you on how to access datasets from the [Paper Analytical Device (PAD) Dataset Registry](https://github.com/PaperAnalyticalDeviceND/pad_dataset_registry) for model training.

This guide includes detailed steps on setting up your environment, installing necessary dependencies, exploring available datasets, downloading your selected dataset, storing it locally, and visualizing its metadata.

**User-Friendly:** Designed with Google Colab in mind, this guide is also compatible with any environment supporting Python 3.9 or newer.

If you have any questions or need further assistance, please feel free to contact pmoreira@nd.edu.

Enjoy your journey through the datasets and happy modeling!


---



# Setup Enviroment

In [None]:
# Setup Environment

# Install dependencies required for the project. This includes DVC for data version control,
# and DVC-GDrive to enable storage integration with Google Drive for dataset storage.
!pip install dvc dvc-gdrive &> /dev/null

# The line below is commented out because these packages are already installed
# in the Colab; uncomment if you are on your local computer
# !pip install pandas opencv-python matplotlib

# Define constants for file and directory names to be used in the project.
DEV_FNAME = 'metadata_dev.csv'  # Name of the development dataset metadata file.
TEST_FNAME = 'metadata_test.csv'  # Name of the test dataset metadata file.
DEV_IMAGES_PATH = 'images_dev'  # Directory path for development dataset images.
TEST_IMAGES_PATH = 'images_test'  # Directory path for test dataset images.
REPORT_PATH = 'report'  # Directory path for storing reports generated from analyses.


# **List** Datasets

In [None]:
# Use the `dvc list` command to retrieve a list of datasets available in the PAD Dataset Registry.
!dvc list https://github.com/PaperAnalyticalDeviceND/pad_dataset_registry datasets

# **Download** a dataset from the previous dataset list


## Functions

In [None]:
##****************************************************************************##
# Downloading functions
##****************************************************************************##
import csv, os
import requests
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm


def download_file(url, filename, images_path):
    """Download a file from a URL and save it to a local file."""
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        path = os.path.join(images_path, filename)
        with open(path, 'wb') as f:
            for chunk in response.iter_content(1024):
                f.write(chunk)

def download_files_from_csv_file(file_path, images_path):
    """Download files in parallel based on URLs from a CSV file with a progress bar."""
    # Open the CSV file and parse its content
    with open(file_path, newline='') as csvfile:
        rows = list(csv.DictReader(csvfile)) # Convert to list for tqdm

        # Initialize tqdm for the progress bar
        pbar = tqdm(total=len(rows), desc="Downloading files")

        def update(*args):
            # Update the progress bar by one each time a file is downloaded
            pbar.update()

        # Use ThreadPoolExecutor to download files in parallel
        with ThreadPoolExecutor(max_workers=10) as executor:
            futures = []
            for row in rows:
                url = row['url']
                filename = row['image_name']
                # Schedule the download task
                future = executor.submit(download_file, url, filename, images_path)
                future.add_done_callback(update)
                futures.append(future)

            # Wait for all futures to complete
            for future in futures:
                future.result()

        # Close the progress bar
        pbar.close()


##****************************************************************************##
# Preprocessing functions: Extract RGB Information for FHI360
##****************************************************************************##
import os
import csv
import math
import warnings
import cv2 as cv
import pandas as pd
import urllib.request
from datetime import datetime
import ssl

SAVE_DIR = './pixel_data/'
REQS = {'LOG':'log.txt'}

HORIZONTAL_BORDER = 12
VERTICAL_BORDER = 0

BLACK_THRESH_S = 35
BLACK_THRESH_V = 70

!touch temp.png # Creates a temporary file for image processing

#Takes a list of pixels and a BGR image and returns the average
# Lab pixel values
def px_avgPixelsLAB(pixels, img):
  """Calculate the average Lab pixel values for a list of pixels in a BGR image."""
  workingImg = cv.cvtColor(img, cv.COLOR_BGR2Lab)
  totalL = 0
  totalA = 0
  totalB = 0
  for pixel in pixels:
    x = pixel[0]
    y = pixel[1]
    l, a, b = workingImg[x,y,:]
    totalL += l
    totalA += a
    totalB += b
  if len(pixels) != 0:
    totalL /= len(pixels)
    totalA /= len(pixels)
    totalB /= len(pixels)
  return int(totalL + 0.5), int(totalA + 0.5), int(totalB + 0.5)


#Takes a list of pixels and a BGR image and returns the average
# RGB pixel values
def px_avgPixels(pixels, img):
  """Calculate the average RGB pixel values for a list of pixels in a BGR image."""
  totalB = 0
  totalG = 0
  totalR = 0
  for pixel in pixels:
    x = pixel[0]
    y = pixel[1]
    b,g,r = img[x,y,:]
    totalB += b
    totalG += g
    totalR += r
  if len(pixels) != 0:
    totalB /= len(pixels)
    totalG /= len(pixels)
    totalR /= len(pixels)
  return int(totalR + 0.5), int(totalG + 0.5), int(totalB + 0.5)


#Takes a distance from a center and returns a weight between 0 and 1
# determined by cosine such that a point at the cetner has weight 1,
# and a point at the extremes has weight ~0.
def intFind_cosCorrectFactor(dx, dy, centerX, centerY):
  """Determine a weight between 0 and 1 based on distance from center, using a cosine function."""
  relevantD = max((dx/centerX), dy/centerY)
  relevnatDRads = (math.pi/2) * relevantD
  return math.cos(relevnatDRads)

#Takes a HSV image and returns a list of the most intense pixels in it,
# after applying filtering to minimize black bars on the edges
def intFind_findMaxIntensitiesFiltered(img):
  """Return a list of pixels with maximum intensity, filtering out black bars on edges."""
  imgS = img[:,:,1]
  imgV = img[:,:,2]
  maxI = 0
  maxSet = []
  centerX = imgS.shape[0]/2
  centerY = imgS.shape[1]/2

  for i in range(imgS.shape[0]):
    dX = abs(centerX-i)
    for j in range(imgS.shape[1]):
      dY = abs(centerY-j)
      sF = intFind_cosCorrectFactor(dX,dY,centerX,centerY)
      cS = sF*imgS[i,j]
      cV = sF*imgV[i,j]
      if cS <= BLACK_THRESH_S and cV <= BLACK_THRESH_V:
        pass
      else:
        maxSet.append((i,j))
  return maxSet

def fm_genIndex(regions, ColorList = ['R', 'G', 'B']):
  """Generate an index for CSV column headers based on the number of regions and color list."""

  index = ['Image', 'Contains', 'Drug %', 'PAD S#']
  for letter in ['A','B','C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']:
    for j in range(1,regions+1):
      for color in ColorList:
        tempStr = letter+str(j)+'-'+color
        index.append(tempStr)
  return index

def fm_checkFormating(dir=SAVE_DIR, errorsFile=None):
  """Check formatting and required files in the specified directory."""
  if not os.path.isdir(dir):
    os.mkdir(dir)
  if errorsFile is None:
    errors = open(dir+REQS['LOG'], 'a')
  else:
    errors = errorsFile
  files = os.listdir(dir)
  #print(files)
  for item in REQS.values():
    if item not in files:
      errorString = str.format("Required file %s not found, creating.\n" %(item))
      errors.write(errorString)
      warnings.warn(errorString)
      if item is REQS['LOG']:
        temp = open(dir+item, 'w')
        temp.close()
      else:
        os.mkdir(dir+item)
  if errorsFile is None:
    errors.close()

def _regionGen(regions, region):
  """Generate start and end points for a given region."""
  start = 359
  totalLength = 273
  regionStart = start + math.floor(totalLength * (region/regions)) + VERTICAL_BORDER
  regionEnd = start + math.floor(totalLength * ((region+1)/regions)) - VERTICAL_BORDER
  return regionStart, regionEnd

def _fullRoutine(img, roiFunc, df, RGB=True, regions=3):
  """Complete routine for processing an image and extracting pixel information."""

  letters = ['A','B','C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
  rList = []
  gList = []
  bList = []
  imgHSV = cv.cvtColor(img, cv.COLOR_BGR2HSV)
  for lane in range(1,13):
    laneStart = 17 + (53*lane)+HORIZONTAL_BORDER
    laneEnd = 17 + (53*(lane+1))-HORIZONTAL_BORDER
    letter = letters[lane-1]
    for region in range(regions):
      regionStart, regionEnd = _regionGen(regions, region)
      roi = imgHSV[regionStart:regionEnd,laneStart:laneEnd,:]
      rgbROI = img[regionStart:regionEnd,laneStart:laneEnd,:]
      pixels = roiFunc(roi)
      tempString = letter + str(region+1) + "-"
      #Switches between RGB and Lab
      if(RGB):
        r, g, b = px_avgPixels(pixels, rgbROI)
        df[tempString+'R'] = r
        df[tempString+'G'] = g
        df[tempString+'B'] = b
      else:
        l, a, blu = px_avgPixelsLAB(pixels, rgbROI)
        df[tempString+'L'] = l
        df[tempString+'a'] = a
        df[tempString+'b'] = blu
  return df


def addIndex(runSettings):
  """Add index information to run settings based on regions and color mode (RGB or Lab)."""
  for setting in runSettings:
    regions = runSettings[setting]['regions']
    if(runSettings[setting]['RGB']):
      runSettings[setting]['Index'] = fm_genIndex(regions)
    else:
      runSettings[setting]['Index'] = fm_genIndex(regions, ['L','a','b'])
  return runSettings

def regionRoutine(target, runSettings, save_dir=SAVE_DIR):
  """Read a CSV file and process images according to the run settings."""

  ssl._create_default_https_context = ssl._create_unverified_context
  
  startTime = datetime.now()
  url = 'https://pad.crc.nd.edu'
  dest = 'temp.png'
  fm_checkFormating(save_dir)
  errors = open(save_dir+REQS['LOG'], 'a')
  print("Starting...")
  with open(target) as csvfile:
    csvreader = csv.reader(csvfile, )
    i = 0
    next(csvreader)  # Skip the first line
    for row in csvreader:
      cTime = datetime.now()
      i+=1
      try:
        urllib.request.urlretrieve(row[5], dest)
        #urllib.request.urlretrieve(row[7], dest)
        img = cv.imread(dest)
        if (1250, 730, 3) != img.shape and (1220, 730, 3) != img.shape:
          errorString = str.format("Error with file %s. Expected shape %s, found shape %s.\n" %(file, '(1250, 730, 3) or (1220, 730, 3)', str(img.shape)))
          errors.write(errorString)
          warnings.warn(errorString)
        else:
          for setting in runSettings:
            data = {}
            data = _fullRoutine(img, intFind_findMaxIntensitiesFiltered, data, runSettings[setting]['RGB'], runSettings[setting]['regions'])
            #data['Image'] = row[0]
            #data['Contains'] = row[1]
            #data['Drug %'] = row[18]
            #data['PAD S#'] = row[17]

            data['Image'] = row[0]
            data['Contains'] = row[2]
            data['Drug %'] = row[3]
            data['PAD S#'] = row[1]


            df = pd.DataFrame(data, columns=runSettings[setting]['Index'], index=[data['Image']])
            if(not os.path.exists(save_dir+setting)):
              df.to_csv(save_dir+setting, mode='w', header=True)
            else:
              df.to_csv(save_dir+setting, mode='a', header=False)
          elapsedTime = datetime.now() - cTime
          print("Finished image ",row[0]," in ",elapsedTime)
      except Exception as e:
        errorString = str.format("Error %s with file %s.\n" %(str(e), row[0]))
        errors.write(errorString)
        warnings.warn(errorString)
      os.remove(dest)
    errors.close()
    endTime = datetime.now()
    regions = 3+12+20
    print('Time: ',endTime-startTime, ' time saved = ',i*regions*13/60.0)

## Define parameters

In [None]:
## Parameters
# Specify the name of the dataset to use. This should match one of the datasets listed in the previous step.
dataset_name = 'FHI2020_Stratified_Sampling'

# Toggle whether to download files from the Development (DEV) set.
download_dev = True

# Toggle whether to download files from the Test (TEST) set.
download_test = False

# Decide if original images from the cards should be downloaded.
# This is useful if detailed analysis of the original images is required.
download_original_images = False

# Enable downloading and preprocessing of pixel data to extract RGB information,
# which is essential for certain types of analysis, such as FHI360.
download_pixel_data = True

# Define the number of regions to segment the image into during preprocessing.
# This is only required if `download_pixel_data` is set to True.
num_regions = 10

## Initial Setup

# Create a directory to store all dataset files.
# If the directory already exists, it will not throw an error due to `exist_ok=True`.
os.makedirs(dataset_name, exist_ok=True)

# Specify the path to save the processed pixel data.
# This organizes the output from preprocessing, facilitating easier analysis.
pixel_data_path = os.path.join(dataset_name, "pixel_data/")


## Get DEV set

In [None]:
# **** DEV ****

# Proceed if downloading of the development dataset is enabled
if download_dev:
    # Download the DEV metadata file using DVC. `--force` ensures the latest version is downloaded.
    !dvc get --force https://github.com/PaperAnalyticalDeviceND/pad_dataset_registry datasets/$dataset_name/$DEV_FNAME -o $dataset_name/$DEV_FNAME

    # Define the path to save the dev metadata file within the dataset directory
    dev_metadata_path = os.path.join(dataset_name, DEV_FNAME)

    # Check if original images from the DEV set should be downloaded
    if download_original_images:
        # Specify the folder within the dataset directory to save the images
        images_path = os.path.join(dataset_name, DEV_IMAGES_PATH)

        # Ensure the directory exists; creates it if it does not
        os.makedirs(images_path, exist_ok=True)

        # Initiate the download of image files for the dev set based on the metadata
        download_files_from_csv_file(dev_metadata_path, images_path)

    # Check if pixel data should be processed and extracted for RGB information analysis
    if download_pixel_data:
        # Name of the output file where the RGB data will be saved
        output_fname = 'rgb__dev.csv'

        # The input file (metadata path) to process for RGB data extraction
        input_fname = dev_metadata_path

        # Define settings for the RGB data extraction process, including number of regions
        runs = {f"{num_regions}_region_{output_fname}": {'RGB': True, 'regions': num_regions}}

        # Process the specified input file and save the RGB data to the defined output directory
        regionRoutine(input_fname, addIndex(runs), save_dir=pixel_data_path)


## Get TEST set

In [None]:
# **** TEST ****

# Proceed if downloading of the test dataset is enabled
if download_test:
    # Download the TEST metadata file using DVC. The command fetches the latest version of the file.
    !dvc get https://github.com/PaperAnalyticalDeviceND/pad_dataset_registry datasets/$dataset_name/$TEST_FNAME -o $dataset_name/$TEST_FNAME

    # Define the path to save the test metadata file within the dataset directory
    test_metadata_path = os.path.join(dataset_name, TEST_FNAME)

    # Check if original images from the TEST set should be downloaded
    if download_original_images:
        # Specify the folder within the dataset directory to save the images
        images_path = os.path.join(dataset_name, TEST_IMAGES_PATH)

        # Ensure the directory exists; creates it if it does not
        os.makedirs(images_path, exist_ok=True)

        # Initiate the download of image files for the test set based on the metadata
        download_files_from_csv_file(test_metadata_path, images_path)

    # Check if pixel data should be processed and extracted for RGB information analysis
    if download_pixel_data:
        # Name of the output file where the RGB data will be saved
        output_fname = 'rgb__test.csv'

        # The input file (metadata path) to process for RGB data extraction
        input_fname = test_metadata_path

        # Define settings for the RGB data extraction process, including the number of regions
        runs = {f"{num_regions}_region_{output_fname}": {'RGB': True, 'regions': num_regions}}

        # Process the specified input file and save the RGB data to the defined output directory
        regionRoutine(input_fname, addIndex(runs), save_dir=pixel_data_path)


# **Save** the dataset

## **Save Data to Google Drive (Recommended)**

To ensure the safety and accessibility of your dataset, we recommend saving it directly to a folder in your Google Drive. This section guides you through the process of mounting your Google Drive in this environment and copying the dataset folder to it.

In [None]:
from google.colab import drive

# Mount Google Drive to access its file system.
drive.mount('/content/drive')

# Specify the path in Google Drive where you want to save the dataset.
my_path = "/content/drive/MyDrive/"

# Copy the entire dataset directory to the specified path in Google Drive.
!cp -r $dataset_name/ $my_path

# Confirm the dataset has been copied and provide a direct link to Google Drive.
print(f"\nNow you can find the Dataset folder named `{dataset_name}` in your Google Drive.")
print("For quick access to your Google Drive, visit: https://drive.google.com/drive/u/0/my-drive")


> ## Or save it on your computer (slow)



> Uncomment the lines so you can save the dataset in your computer



In [None]:
# from google.colab import files

# !zip -r $dataset_name.zip $dataset_name/ &> /dev/null
# files.download(f"{dataset_name}.zip")

# Visualize the metadata

In [None]:
# Visualize the metadata using pandas
import pandas as pd

data = pd.read_csv(dev_metadata_path)s

data