# Downloading raw HRTEM images of nanoparticles (and their corresponding labels)

As described in "Developing Robust Neural Networks for HRTEM Image Analysis" (manuscript link tba), we have acquired and manually labeled multiple HRTEM images of nanoparticles with controlled sample and imaging conditions. This repository of HRTEM images, their labels, and metadata are available via NERSC [here](https://portal.nersc.gov/project/m3795/hrtem-generalization/). All images are raw data taken from a OneView camera, and saved in the dm3 format, which can be opened using the [ncempy Python package](https://openncem.readthedocs.io/en/development/ncempy.html), as well as outside of Python using Digital Micrograph (also known as Gatan Microscopy Suite) or ImageJ/Fiji. While the uploaded datasets have associated jpeg images, we do not recommend using those images for anything other than quickly viewing the dataset as we have used a variety of colormap mapping procedures.   

All images are 4096 x 4096 pixels in size, and about 67Mb each. All labels are single-channel images, where a pixel value of 1 corresponds to our estimate of the nanoparticle location. Below is an example of a HRTEM image of 5nm Au nanoparticles taken at 330kX magnification (0.02 nm/pixel), and its corresponding (hand-)label.
<p align="center">
    <img src="/imgs/20221109_Au_UTC_330kx_2640e_0p1596s_03.jpeg" width="400" height="400" />
    <img src="/imgs/20221109_Au_UTC_330kx_2640e_0p1596s_03_label.png" width="400" height="400" />
</p>

Since this is a rather large repository of images, we have written some code below that enables downloading subsets of this entire database without having to download the entire ~40Gb of data. 

This general repository sharing and download structure was made possible with help from Alexander Rakowski.

# Import packages and routines

In [1]:
# Import packages
import pandas as pd
import os
import requests
import shutil
import functools
from tqdm import tqdm

In [2]:
# This is a routine that creates a pandas DataFrame with the data that meets the subset requirements
def subset_datafile(dataframe,subset_reqs):
    # dataframe is the pandas DataFrame with all of the metadata
    # subset_reqs is a list of tuples with the metadata requirements for the subset
    # The tuples either have 2 inputs (metadata, value) or 3 inputs (metadata, value, margin), and value can either be a string or quantitative number
    subset = dataframe
    subset_reqs_list = subset_reqs.copy()
    while len(subset_reqs_list)>0:
        listing = subset_reqs_list.pop()
        if len(listing)==2:
            heading = listing[0]
            value = listing[1]
            threshold = 1e-5
        else:
            heading = listing[0]
            value = listing[1]
            threshold = listing[2]
        if isinstance(value, str):
            subset = subset[subset[heading].isin([value])]
        else:
            diff = subset[heading]-value
            subset = subset[diff.abs()<threshold]
    return subset

In [3]:
#This function takes the subset and downloads the associated dm3 files and labels from NERSC
def download_images_NERSC(subset,download_location = 'hrtem_files', download_labels=True, label_ending='_label.png'):
    # Make folder in which images and labels will be downloaded 
    if not os.path.exists(download_location):
        os.mkdir(download_location)
    
    #NERSC URL (is constant unless project folder gets moved)
    nersc_url = 'https://portal.nersc.gov/project/m3795/hrtem-generalization/'
    
    # Tracker variables for how many imgs and labels are downloaded
    num_imgs = 0
    num_lbls = 0
    
    # For every file in the subset dataframe
    for i in range(subset.shape[0]):
        #Get the i-th file
        file = subset.iloc[i] 
        
        # if local folder for this file doesn't exist, make it
        if not os.path.exists(os.path.join(download_location,file['Folder'])):
            os.mkdir(os.path.join(download_location,file['Folder']))
        
        # define the locations
        download_url = nersc_url + '/' + file['Folder'] + '/' + file['File name']
        local_path = os.path.join(download_location,file['Folder'],file['File name'])
        
        # grab and download the data
        # only download if file is not already downloaded
        if not os.path.exists(local_path):
            response = requests.get(download_url, stream=True)
            
            # adding a progress bar
            file_size = int(response.headers.get('content-length',0))
            desc = "(Unknown total file size)" if file_size == 0 else ""
            response.raw.read = functools.partial(response.raw.read, decode_content=True)  # Decompress if needed
            
            if response.status_code == 200: #check if there's a file at the download_url provided
                with tqdm.wrapattr(response.raw, "read", total=file_size, desc=desc) as r_raw:
                    with open(local_path, 'wb') as output_file:
                        shutil.copyfileobj(r_raw, output_file)
                        del response
                num_imgs +=1
        
        # if you also want to download the corresponding label
        if download_labels == True:
            # make Labels folder in session folder, if not already there
            if not os.path.exists(os.path.join(download_location,file['Folder'],'Labels')):
                os.mkdir(os.path.join(download_location,file['Folder'],'Labels'))
            
            # Label locations
            lbl_name = file['File name'].split('.')[0]+label_ending
            lbl_download_url = nersc_url + '/' + file['Folder'] + '/Labels/' + lbl_name
            lbl_local_path = os.path.join(download_location,file['Folder'],'Labels',lbl_name)
            
            # Download label
            if not os.path.exists(lbl_local_path):
                response = requests.get(lbl_download_url, stream=True)
                
                # adding a progress bar
                file_size = int(response.headers.get('content-length',0))
                desc = "(Unknown total file size)" if file_size == 0 else ""
                response.raw.read = functools.partial(response.raw.read, decode_content=True)  # Decompress if needed
                
                if response.status_code == 200: #check if there's a file at the download_url provided
                    with tqdm.wrapattr(response.raw, "read", total=file_size, desc=desc) as r_raw:
                        with open(lbl_local_path, 'wb') as output_file:
                            shutil.copyfileobj(r_raw, output_file)
                            del response
                    num_lbls +=1
                    
    print(str(num_imgs)+' images and ' + str(num_lbls) + ' labels were downloaded')
    
    return None

# Example of Downloading Subset of Images from Repository

Here, we're going to walk through an example of downloading a subset of images with specified metadata attributes from the greater image repository. As an example, we're only going to download images of 5nm Ag nanoparticles taken at 330kX magnification (0.02 nm/pixel) at approximately the same dosage of 423 e/A^2 (dataset used in Figure 3d of the paper). 

First, we need to import the metadata information:

In [4]:
#Import the spreadsheet of filenames and image attributes
spreadsheet_file = 'Dataset_metadata.csv'
file_list = pd.read_csv(spreadsheet_file)

Let's look at how the metadata is stored:

In [5]:
file_list.head()

Unnamed: 0,File name,Date,Material,Nanoparticle Size (nm),Nanoparticle Shape,Support,Instrument,Dosage (e/A2),Pixel Scale (nm),Folder
0,Au_UTC_205kX_2630e_0p1596s_01.dm3,08/26/21,Au,5.0,Sphere,UT Carbon,Team05,420,0.033,2021_08_26 5nm Au nanoparticles on C/Ultrathin...
1,Au_UTC_205kX_2630e_0p1596s_02.dm3,08/26/21,Au,5.0,Sphere,UT Carbon,Team05,420,0.033,2021_08_26 5nm Au nanoparticles on C/Ultrathin...
2,Au_UTC_205kX_2630e_0p1596s_03.dm3,08/26/21,Au,5.0,Sphere,UT Carbon,Team05,420,0.033,2021_08_26 5nm Au nanoparticles on C/Ultrathin...
3,Au_UTC_205kX_2630e_0p1596s_04.dm3,08/26/21,Au,5.0,Sphere,UT Carbon,Team05,420,0.033,2021_08_26 5nm Au nanoparticles on C/Ultrathin...
4,Au_UTC_205kX_2630e_0p1596s_05.dm3,08/26/21,Au,5.0,Sphere,UT Carbon,Team05,420,0.033,2021_08_26 5nm Au nanoparticles on C/Ultrathin...


Here we can see the various attributes over which we can sort and subset the data. As mentioned above, we want to grab all the images of 5nm Ag nanoparticles at specified microscope conditions.

In [6]:
subset_reqs = [('Material', 'Ag'), 
              ('Nanoparticle Size (nm)', 5),
              ('Dosage (e/A2)', 423, 20), 
              ('Pixel Scale (nm)', 0.02)]

The subset requirements needs to be a list of tuples, with each tuple specifying how to sort or subset the metadata. These tuples can either have 2 or 3 entires. The first entry specifies the metadata header, or the attribute you wish to sort by. The second entry specifies the value that all subset entries need to have. The third value (if specified) gives the margin of the metadata value; so for instance, in this example, we will take any images that have a dosage within 423 $\pm$ 20 e/A^2. 

Now, we can create a new DataFrame object that only has data with the metadata values we've specified above:

In [7]:
subset_ag = subset_datafile(file_list,subset_reqs)
#Print out the files that are in this subset, makes it easy to double check we have the correct files
display(subset_ag)

Unnamed: 0,File name,Date,Material,Nanoparticle Size (nm),Nanoparticle Shape,Support,Instrument,Dosage (e/A2),Pixel Scale (nm),Folder
239,20220202_Ag_UTC_330kx_2650e_0p1596s_01.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC
240,20220202_Ag_UTC_330kx_2650e_0p1596s_02.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC
241,20220202_Ag_UTC_330kx_2650e_0p1596s_03.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC
242,20220202_Ag_UTC_330kx_2650e_0p1596s_04.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC
243,20220202_Ag_UTC_330kx_2650e_0p1596s_05.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC
244,20220202_Ag_UTC_330kx_2650e_0p1596s_06.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC
245,20220202_Ag_UTC_330kx_2650e_0p1596s_07.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC
246,20220202_Ag_UTC_330kx_2650e_0p1596s_08.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC
247,20220202_Ag_UTC_330kx_2650e_0p1596s_09.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC
248,20220202_Ag_UTC_330kx_2650e_0p1596s_10.dm3,02/02/22,Ag,5.0,Sphere,UT Carbon,Team05,423,0.02,2022_02_02 5nm Ag nanoparticles on UTC


We see that there are a total of 14 images. Let's download just the image data locally. By default, the code below will download the files and folder structures into a new folder called "hrtem_files". You can change this download location in the function input values. Similarly, by default, the code will also download the corresponding labels. For now, we will turn this off.

In [9]:
download_images_NERSC(subset_ag, download_location = 'hrtem_files', download_labels=False)

100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 118MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 118MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 117MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 115MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 118MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 117MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 118MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 118MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 117MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 117MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 118MB/s]
100%|███████████████████████████████████████| 64.6M/64.6M [00:00<00:00, 118MB/s]
100%|███████████████████████

14 images and 0 labels were downloaded





The raw dm3 files should now be downloaded locally. If you wanted to also download the corresponding labels, you can run this again with download_labels=True. The code has a built-in check such that it will not download and overwrite image files if they already exist. Therefore, if you decided to download another subset, which happens to also include some of the files that you have already downloaded, the code will not re-download the old files. 

In [10]:
download_images_NERSC(subset_ag,download_location = 'hrtem_files', download_labels=True)

100%|██████████████████████████████████████| 33.7k/33.7k [00:00<00:00, 8.27MB/s]
100%|██████████████████████████████████████| 45.6k/45.6k [00:00<00:00, 26.4MB/s]
100%|██████████████████████████████████████| 54.6k/54.6k [00:00<00:00, 49.6MB/s]
100%|██████████████████████████████████████| 29.8k/29.8k [00:00<00:00, 19.3MB/s]
100%|██████████████████████████████████████| 28.2k/28.2k [00:00<00:00, 19.1MB/s]
100%|██████████████████████████████████████| 36.1k/36.1k [00:00<00:00, 20.9MB/s]
100%|██████████████████████████████████████| 34.3k/34.3k [00:00<00:00, 20.1MB/s]
100%|██████████████████████████████████████| 24.4k/24.4k [00:00<00:00, 19.9MB/s]
100%|██████████████████████████████████████| 24.0k/24.0k [00:00<00:00, 2.42MB/s]
100%|██████████████████████████████████████| 26.6k/26.6k [00:00<00:00, 22.7MB/s]
100%|██████████████████████████████████████| 26.4k/26.4k [00:00<00:00, 1.70MB/s]
100%|██████████████████████████████████████| 35.8k/35.8k [00:00<00:00, 25.8MB/s]
100%|███████████████████████

0 images and 14 labels were downloaded





Note that subset_ag is the Dataframe of the files that make up this dataset, and so you can either save this for later use, or just use the all encompassing Dataset_metadata.csv file. 