# PATHOverview file lister
This notebook lists all the ndpi files in a given directory and all subdirectories and creates the required Excel file for addition of slide details. Zoom, crop, rotation information can be read from an .ndpa file (see below).

To use:
1. Change the master_dir to the folder you want to list from.
2. Change output_dir to the folder you want your PATHOverview output images generated to. This folder must already exist. An Excel file will be created in this folder with PATHOverview parameters.
3. Run all cells
4. Edit the generated Excel file as needed then use it as the input for [pathoverview_interactive_example.ipynb](pathoverview_interactive_example.ipynb) or [pathoverview_create_figures.ipynb](pathoverview_create_figures.ipynb).


The Excel file generated can contain the following patameters for figure generation (* = required):<br>
[Parametes can also be set for all slides at the time of figure generation using pathofigure.fig_defaults[param]]
- Worksheet 'slides'
  - root*: folder containing the slide file
  - file*: filename for the slide file
  - label: label to be placed on figure output
  - page*: page number to put this slide on
  - order*: position of this slide on page (starting from 0)
  - rotation: rotation of this slide image in degrees
  - mirror: True/1 or False/0 for flipping the image
  - zoom_point: tuple; Point on the slide to zoom in on. Relative to width/height of the slide
  - crop: tuple; ((tuple of point to centre crop image on), width, height). Relative to width/height of the slide
  - fig_type:
    - None: overview (cropped and rotated) image with inset zoom
    - 'inverted': zoom with inset overview (cropped and rotated)
    - 'raw': raw overview of the slide (not cropped or rotated)
    - 'slide': raw with an image of the slide label
  - panel_size: tuple; (with, height) of the output image in pixels
  - add_inset: True/1 or False/0 to add an inset image
  - inset_size: tuple; (width, height) of the inset image in pixels
  - zoom_real_size: width in um of the zoom image
  - scale_bar: width in um of the main image scale bar
  - inset_scale_bar: width in um of the inset image scale bar
  - sb_label: True/1 or False/0; add a size label to the scale bars
  - wb_point: a point on the slide to take the background color from
  - mpp_x: microns per pixel scale information, this is read from ndpi files but must be specified for flat .tif images.
  - fill_color: color to use a background when rotating images
  - sb_position: position of scale bar: 'bl' (bottom-left), 'br', 'tl', 'tr'
  - sb_color: color of scale bar (default '#000000')
  - bw: border width in pixels
  - b_color: border color
  - label_xpad: pixels to nudge label from left/right edge depending on alignment
  - label_ypad: pixels to nudge label from top/bottom edge depending on alignment
  - label_position: as sb_position
  - label_size: in pixels, defaults to 1/12 of image height
  - label_color: umm... color of the label!
- Worksheet 'pages'
  - page*: number for this page (to map slides to)
  - title1: title string. This will be evaluated with \**globals at time of use
  - title2: title 2nd line
  - title3: title 3rd line
  - footer: footer string. Evaluated as title
  - filename*: filename with path for output file
  - figsize: tuple; size of the page in inches (for compatability to matplotlib)
  - n_x: number of images wide per page
  - n_y: number of images high per page

In [1]:
# os package handles the file listing
import os

# Pandas handles excel sheet <-> dataframe data handling
import pandas as pd
import numpy as np

# The python builtin Path is used to handle filepaths nicely between windows and mac
from pathlib import Path

import pathoverview
# Import pathofigure to use default layout variables to set up pages
from pathoverview import pathofigure
# Import slide_obj to extract slide dimensions for annotation processing
from pathoverview import slide_obj

In [2]:
# With older OpenSlide install methods on windows, the path to your OpenSlide binaries 
# must be defined before first slide_obj creation.
# This is not requred/used on mac or newer OpenSlide installs.

# pathoverview.OPENSLIDE_PATH = Path(r'C:\openslide\bin')

In [3]:
# Path to the directory from which ndpi files will be listed
master_dir = Path(r"./test_files")

# Path to the directory to which files will be output. PATHOverview will fail if this does not exist.
output_dir = Path(r"./test_files")

# Name / path to the Excel file for output of the required dataframes
excel_output = Path(output_dir,"slide_lister_output.xlsx")

In [4]:
# Any files with the strings listed below in their filename will be ignored
exclude_contains = ["EXTRA", "EXCLUDE", "duplicates", "controls"]

# Default figure layout to assign slides to pages automatically
# For example, if this notebook is run on a dir containing 
# 100 images 5 pages of 4x6 images will be specified
n_x = pathofigure.fig_defaults["n_x"] # default = 4
n_y = pathofigure.fig_defaults["n_y"] # default = 6

# Overwrite figure layout
# n_x = 1
# n_y = 1

In [5]:
# A blank list to contain the files found, this is converted into a dataframe below
ndpi_list = []

# First walk through directories / folders
for root, dirs, files in os.walk(Path(master_dir)):

    # if root doesn't contain one of the excluded strings...
    if not any(excl in root for excl in exclude_contains):
    
        # Iterate through each file
        for file in files:
            
            # check if it is an ndpi file but not an annotation file
            if ".ndpi" in file and ".ndpa" not in file:
                
                # if file doesn't contain one of the excluded strings...
                if not any(excl in file for excl in exclude_contains):
                    
                    # add it to the ndpi_list
                    ndpi_list.append((root,file))

# How many files were found?
len(ndpi_list)

2

In [6]:
# Make a dataframe from the ndpi_list
# this contains the filename and the directory for each file
slides_df = pd.DataFrame(ndpi_list,columns=["root","file"])

# Show this df
slides_df

Unnamed: 0,root,file
0,test_files,OS-1.ndpi
1,test_files,OS-2.ndpi


In [7]:
# Sort the list by filename
slides_df = slides_df.sort_values(["file"]).reset_index(drop= True)
slides_df

Unnamed: 0,root,file
0,test_files,OS-1.ndpi
1,test_files,OS-2.ndpi


In [8]:
# Add the required columns (as blanks)
slides_df[["label","page","order","rotation","mirror","zoom_point","crop","panel_size",
          "fig_type","add_inset","inset_size","zoom_real_size","scale_bar","inset_scale_bar",
           "sb_label","wb_point"
          ]] = None

# add links to Excel file for easy navigation to ndpi file
# to use these links on mac you need to drag the folder from finder onto the 
# excel icon after opening the workbook (add the folder to the Excel sandbox)
slides_df["rel_path"] = slides_df.apply(lambda L: Path(os.path.relpath(Path(L.root, L.file), excel_output.parent)).as_posix(), axis=1)
slides_df["link"] = '=HYPERLINK("' + slides_df['rel_path'] + '", "Link")'

# assign pages and order for a filled default layout
max_panels = n_x * n_y
# use integer division to assign 'page' number to slides in order
slides_df['page'] = slides_df.groupby(np.arange(len(slides_df))//max_panels).ngroup()
# assign position of slides using cumulative count on each page group
slides_df['order'] = slides_df.groupby("page").cumcount()

slides_df

Unnamed: 0,root,file,label,page,order,rotation,mirror,zoom_point,crop,panel_size,fig_type,add_inset,inset_size,zoom_real_size,scale_bar,inset_scale_bar,sb_label,wb_point,rel_path,link
0,test_files,OS-1.ndpi,,0,0,,,,,,,,,,,,,,OS-1.ndpi,"=HYPERLINK(""OS-1.ndpi"", ""Link"")"
1,test_files,OS-2.ndpi,,0,1,,,,,,,,,,,,,,OS-2.ndpi,"=HYPERLINK(""OS-2.ndpi"", ""Link"")"


### Read annotation files
This cell will read .ndpa files with the same name as .ndpi slides to extract zoom, crop, rotation and background information.

Annotation information extracted:
- zoom_point: must be a pin named 'zoom'
- crop: can be 
   - rectangle named 'crop'. This will set rotation and crop bounds
   - pin named 'crop': This will be the centre of a crop box of full slide size. Use this if defining fixed 'crop_real_width' figures
- rotation: must be a horizontal L to R ruler named 'rotation'. Use this when crop not defined or is a pin
- background/whitebalance: a pin named 'background'. A 300x300um sample will be used to calculate background color

In [9]:
# use python built in xml.etree to read ndpa file
import xml.etree.ElementTree as ET
from math import degrees, atan2, hypot

n_files = 0
for ix, row in slides_df.iterrows():
    ndpa_file = Path(row["root"], row["file"]+".ndpa")
    
    if Path.exists(ndpa_file):
        with slide_obj(Path(row.get("root",""),row.get("file"))) as sld:
            ETroot = ET.parse(ndpa_file).getroot()
            for viewstate in ETroot.findall("ndpviewstate"):
                title = viewstate.find("title").text
                ann_type = viewstate.find("annotation").get("displayname")
                
                if title == "zoom" and ann_type == "AnnotatePin":
                    zoom_x = int(float(viewstate.find("annotation").find("x").text))
                    zoom_y = int(float(viewstate.find("annotation").find("y").text))
                    zoom_point = sld.ndpa_to_relative((zoom_x, zoom_y))
                    slides_df.at[ix, "zoom_point"] = zoom_point
                    
                elif title == "crop" and ann_type == "AnnotatePin":
                    crop_x = int(float(viewstate.find("annotation").find("x").text))
                    crop_y = int(float(viewstate.find("annotation").find("y").text))
                    crop_point = sld.ndpa_to_relative((crop_x, crop_y))
                    slides_df.at[ix, "crop"] = (crop_point,1,1)
                    
                elif title == "background" and ann_type == "AnnotatePin":
                    bg_x = int(float(viewstate.find("annotation").find("x").text))
                    bg_y = int(float(viewstate.find("annotation").find("y").text))
                    bg_point = sld.ndpa_to_relative((bg_x, bg_y))
                    slides_df.at[ix, "wb_point"] = bg_point
                    
                elif title == "rotation" and ann_type == "AnnotateRuler":
                    x1 = int(float(viewstate.find("annotation").find("x1").text))
                    y1 = int(float(viewstate.find("annotation").find("y1").text))
                    x2 = int(float(viewstate.find("annotation").find("x2").text))
                    y2 = int(float(viewstate.find("annotation").find("y2").text))
                    dx = x2 - x1
                    dy = y2 - y1
                    rot = degrees(atan2(dy,dx)) % 360
                    slides_df.at[ix, "rotation"] = rot
                    
                elif title == "crop" and ann_type == "AnnotateRectangle":
                    points = [] # will be top-left, bl, br, tr
                    for point in viewstate.find("annotation").find("pointlist").iterfind("point"):#.get("pointlist"):
                        points.append((int(float(point.find("x").text)), int(float(point.find("y").text))))
                    #do trig for rotation
                    dx = points[3][0] - points[0][0]
                    dy = points[3][1] - points[0][1]
                    rot = degrees(atan2(dy,dx)) % 360
                    slides_df.at[ix, "rotation"] = rot
                    #find midpoint
                    crop_x = (points[0][0] + points[2][0]) / 2
                    crop_y = (points[0][1] + points[2][1]) / 2
                    crop_point = sld.ndpa_to_relative((crop_x, crop_y))
                    #find rel width / height
                    w = hypot(points[3][0] - points[0][0], points[3][1] - points[0][1])
                    h = hypot(points[1][0] - points[0][0], points[1][1] - points[0][1])
                    wnm = sld.slide.dimensions[0] * sld.mpp_x * 1000
                    hnm = sld.slide.dimensions[1] * sld.mpp_x * 1000
                    crop = (crop_point, w/wnm, h/hnm)
                    slides_df.at[ix, "crop"] = crop
                    
        n_files += 1

print(f"NDPA files read: {n_files}")

NDPA files read: 2


In [10]:
slides_df

Unnamed: 0,root,file,label,page,order,rotation,mirror,zoom_point,crop,panel_size,fig_type,add_inset,inset_size,zoom_real_size,scale_bar,inset_scale_bar,sb_label,wb_point,rel_path,link
0,test_files,OS-1.ndpi,,0,0,0.0,,"(2.1943359314891498e-08, 6.0167100760837555e-09)","((0.5616008712109375, 0.41987101660427517), 0....",,,,,,,,,,OS-1.ndpi,"=HYPERLINK(""OS-1.ndpi"", ""Link"")"
1,test_files,OS-2.ndpi,,0,1,0.0,,,"((0.48318689522901964, 0.5154225310384115), 1, 1)",,,,,,,,,,OS-2.ndpi,"=HYPERLINK(""OS-2.ndpi"", ""Link"")"


In [11]:
# write slides_df to the excel output, creating the excel file if it does not exist
if os.path.exists(excel_output):
    with pd.ExcelWriter(excel_output, mode="a", engine="openpyxl", if_sheet_exists="replace") as writer:
        slides_df.to_excel(writer, sheet_name="slides") 
else:
    with pd.ExcelWriter(excel_output, engine="openpyxl") as writer:
        slides_df.to_excel(writer, sheet_name="slides")

### Add a pages worksheet

In [12]:
# Create empty df
pages_df = pd.DataFrame()

# Add required columns
pages_df[["page","title1","title2","title3","footer","filename","figsize","n_x","n_y"]] = None

# Add an entry for all pages used in slides_df
pages_df["page"] = slides_df["page"].unique()

# Specify the footer to include page number
pages_df["footer"] = "Page " + pages_df["page"].astype(str) + ". Generated by PATHOverview."

# Generate the output filename
# Using '+' allows pandas to work at the df level. 
# The path is formatted by PATHOverview at use so here it is system independent 
pages_df["filename"] = str(output_dir) + "\\page" + pages_df["page"].astype(str) + ".png"

pages_df["n_x"] = n_x
pages_df["n_y"] = n_y

# Show the df below
pages_df

Unnamed: 0,page,title1,title2,title3,footer,filename,figsize,n_x,n_y
0,0,,,,Page 0. Generated by PATHOverview.,test_files\page0.png,,4,6


In [13]:
# write pages_df to the excel output, creating the excel file is it does not exist
if os.path.exists(excel_output):
    with pd.ExcelWriter(excel_output, mode="a", engine="openpyxl", if_sheet_exists="replace") as writer:
        pages_df.to_excel(writer, sheet_name="pages") 
else:
    with pd.ExcelWriter(excel_output, engine="openpyxl") as writer:
        pages_df.to_excel(writer, sheet_name="pages")