<font size=8>  Setting up the Collection Space Navigator

In this How-To guide you will produce all necessary files to create a custom version of the Collection Space Navigator (CSN).

Project link: https://collection-space-navigator.github.io/ 

> Note: We highly recommend to first get familiar with the example collection before trying your with your own data.

<font size=6> 1) Prepare Collection Data

Loads the dataset and sets up the metadata.



In [None]:
#@title Import Libraries
#@markdown Installs all necessary libraries! Optional libraries will be installed only if needed.

import json, math, os, io

try:
  import pandas as pd
except:
  print("Installing pandas via Pip")
  !pip install pandas --quiet
  import pandas as pd

try:
  import numpy as np
except:
  print("Installing numpy via Pip")
  !pip install numpy --quiet
  import numpy as np

try:
  from tqdm import tqdm
except:
  print("Installing tqdm via Pip")
  !pip install tqdm --quiet
  from tqdm import tqdm

try:
  import ipywidgets as widgets
  from ipywidgets import interactive,HBox,VBox,Label
except:
  print("Installing ipywidgets via Pip")
  !pip install ipywidgets --quiet
  import ipywidgets as widgets
  from ipywidgets import interactive,HBox,VBox,Label

try:
  from IPython.display import display
except:
  print("Installing IPython via Pip")
  !pip install ipython --quiet
  from IPython.display import display

try:
  from sklearn.preprocessing import StandardScaler
except:
  print("Installing sklearn via Pip")
  !pip install sklearn --quiet
  from sklearn.preprocessing import StandardScaler

In [None]:
#@title (optional) Mount Google Drive
#@markdown >Note: This is not necessary if you work with example data. Mounting Google Drive gives your Colab Notebook instance access to your data, and does not share access with the CSN authors.
def mount_gdrive(v):
    try:
      from google.colab import drive
      # drive.mount(drive_path,force_remount=False)
      buttonGDrive.description="mounting..."
      buttonGDrive.disabled=True
      drive.mount('/content/gdrive',force_remount=True)
    except:
      print("...error mounting drive")
      buttonGDrive.description="mounting failed!"
    else:
      buttonGDrive.description="successfully mounted"

layoutButtons = {'width': '210px'}
buttonGDrive = widgets.Button(description='mount Google Drive',icon='check',indent=True,layout=layoutButtons)
buttonGDrive.on_click(mount_gdrive)

display(buttonGDrive)


In [None]:
#@title Define INPUT
#@markdown >Note: Running this cell opens a dialog in which the input data can be defined. Use the provided example data (recommended first) or your own.


style = {'description_width': '250px'}
layout = {'width': '600px', 'justify-content': 'lex-satrt'}
# layoutButtons = {'width': '210px'}
useExample = widgets.Checkbox(value=True,description='use example data',indent=True)
datasetTitle = widgets.Text(placeholder='title of the dataset', description='Title:', style=style, layout=layout, value = "Testset")
description = widgets.Textarea(placeholder='Short description of the dataset and method(s)', description='Description (optional):', style=style, layout=layout, value="")
embeddingsLocation = widgets.Text(placeholder='path to embeddings file (.csv)', description='Embeddings Filepath (optional):', style=style, layout=layout, value = "CSN/example_data/embeddings_testset.csv")
metadataLocation = widgets.Text(placeholder='path to metadata file (.csv)', description='Metadata Filepath:', style=style, layout=layout, value = "CSN/example_data/metadata_testset.csv")
imageLocation = widgets.Text(placeholder='path to image collection folder', description='Image Folder:', style=style, layout=layout, value = "CSN/example_data/testset_images/")
imageWebLocation = widgets.Text(placeholder='URL prefix to public image directory', description='Image URL prefix:',style=style, layout=layout, value =  "https://github.com/Collection-Space-Navigator/CSN/raw/main/example_data/testset_images/")
# buildTool = widgets.RadioButtons(options=[('create new tool and dataset',1), ('add new dataset to existing tool',2)],value=1,description='Building:',style=style, layout=layout)
# uploadFile = widgets.FileUpload(accept='.json',multiple=False,description="upload 'datasets_config.json'", layout=layoutButtons)
# buttonGDrive = widgets.Button(description='mount Google Drive',icon='check',indent=True,layout=layoutButtons)
# buttonGDrive.on_click(mount_gdrive)
# i = interactive(getOptions, BUILD = buildTool)
if os.path.exists("/content/gdrive/MyDrive/"):
    imageLocation.value = "/content/gdrive/MyDrive/CSN/example_data/testset_images/"
    embeddingsLocation.value = "/content/gdrive/MyDrive/CSN/example_data/embeddings_testset.csv"
    metadataLocation.value = "/content/gdrive/MyDrive/CSN/example_data/metadata_testset.csv"  
else:
    imageLocation.value = "CSN/example_data/testset_images/"
    embeddingsLocation.value = "CSN/example_data/embeddings_testset.csv"
    metadataLocation.value = "CSN/example_data/metadata_testset.csv"  

imageWebLocation.value =  "https://github.com/Collection-Space-Navigator/CSN/raw/main/example_data/testset_images/"


subset = widgets.Checkbox(value=False,description='make subset',indent=True)
subsetSize = widgets.BoundedIntText(value=2048,min=10,max=9999999, step=1,description='Subset size:')
def makeSubset(SUBSET):
    if SUBSET:
        display(subsetSize)
    else:
        subsetSize.value == None
i = interactive(makeSubset, SUBSET = subset)
left = VBox([datasetTitle, description, embeddingsLocation, metadataLocation, imageLocation, imageWebLocation])
right = VBox([useExample, i])
display(HBox([left,right]))



In [None]:
#@title Load INPUT files
#@markdown Loads and checks all files.

#@markdown >Note: Example data will be downloaded only if needed.

if useExample.value == True:

  if os.path.exists("gdrive/MyDrive/"):
    print("using gdrive/MyDrive")
    %cd gdrive/MyDrive

  if not os.path.exists("CSN"):
    os.makedirs("CSN")
    print("created new directory 'CSN'")
  %cd CSN
  try:
    !git init
    !git remote add -f origin https://github.com/Collection-Space-Navigator/CSN
    !git config core.sparseCheckout true
    !echo "example_data" >> .git/info/sparse-checkout
    print("downloading example dataset...")
    !git pull origin main
    print("...done")
  except Exception as e:
    print(e)
  %cd -

imagNumb = len(os.listdir(imageLocation.value))
print(f'found {imagNumb} files in {imageLocation.value}')

metadata = pd.read_csv(metadataLocation.value, skipinitialspace=True)
if subset:
    metadata = metadata[:subsetSize.value]
metaNumb = len(metadata)
print(f'found {metaNumb} entries in {metadataLocation.value}')

if embeddingsLocation.value != "":
  embeddings = pd.read_csv(embeddingsLocation.value, skipinitialspace=True)
  embeddings = embeddings.loc[:, embeddings.columns!='id']
  if subset:
    embeddings = embeddings[:subsetSize.value]
  vecNumb = len(embeddings)
  print(f'found {vecNumb} entries in {embeddingsLocation.value}')

  if metaNumb == vecNumb:
    if vecNumb <= imagNumb:
      print("Looks ok.")
      print()
      print(f'Embedding file contains {vecNumb} vectors in {len(embeddings.columns)} dimensions.')
      print("Metadata Head:")
      print(metadata.head())
    else:
      print()
      print("ERROR: number of images is smaller than number of vectors")

if metaNumb <= imagNumb:
  print("Looks ok.")
  print("Metadata Head:")
  print(metadata.head())
else:
  print()
  print("ERROR: number of images and metadata elements don't match!")
foldername = datasetTitle.value.lower().replace(" ","_")
print()
print(f'Creating new dataset directory: CSN/{foldername}...')
if not os.path.exists(f"CSN/{foldername}"):
    os.makedirs(f"CSN/{foldername}")
    print("... success")
else:
    print("... folder already exists (might overwrite existing files)")


In [None]:
#@title Assign metadata fields
#@markdown Choose which field names in the metadata file should be used.   

#@markdown >Note: Select multiple values using shift+ctrl+mouseclick, shift+command+mouseclick, or shift+arrow keys.


filenameColumn = widgets.Dropdown(description="Image filenames (JPG or PNG):",options=[mf for mf in metadata.columns if pd.api.types.is_string_dtype(metadata[mf]) and metadata[mf].str.endswith((".jpg",".JPEG","JPG",".jpeg",".png",".PNG")).all()], style=style, layout=layout)
classColumns = widgets.SelectMultiple(options=[mf for mf in metadata.columns if pd.api.types.is_integer_dtype(metadata[mf]) and mf != "index"],description='optional: Cluster data (integers):', style=style, layout=layout)
infoColumns = widgets.SelectMultiple(options=[mf for mf in metadata.columns if mf != "index"],description='Info fields (display in preview):', style=style, layout=layout)
sliderColumns = widgets.SelectMultiple(options=[mf for mf in metadata.columns if pd.api.types.is_numeric_dtype(metadata[mf]) and mf != "index"],description='Slider data (floats or integers):', style=style, layout=layout)
filterColumns = widgets.SelectMultiple(options=[mf for mf in metadata.columns if pd.api.types.is_string_dtype(metadata[mf]) and mf != 'URL'],description='optional: Filter & Search fields (string):', style=style, layout=layout)
if useExample.value == True:
  infoColumns.value = ("Prompt", "Colors", "Contrast", "File Size")
  sliderColumns.value = ("Colorfulness", "Colors", "Contrast", "File Size")
  filterColumns.value = ("Prompt",)
  classColumns.value = ("Class",)
left = VBox([filenameColumn, infoColumns, sliderColumns])
right = VBox([filterColumns, classColumns])
display(HBox([left,right]))


----------------

<font size=6> 2) Prepare Image Data

To handle large amounts of images efficiently, the CSN uses sprite sheets with multiple thumbnails behind the scenes. These sprite-sheets need to be generated.

In [None]:
#@title Generate sprite sheets
#@markdown > Note: only needed for new datasets or to update existing tiles (skip this part if you already generated them)

from PIL import Image

# parameters for tiles
tileSize = 2048  # size of tile
tileRows = 32  # rows per tile
columns = tileRows  # columns per |tile
squareSize = int(tileSize/tileRows)
imgPerTile = tileRows*columns
numbTiles = math.ceil(len(metadata)/imgPerTile)



def resizeImgTile(image, squareSize):
    (w,h) = image.size
    max_dim = max(w, h)
    new_w = int(w/max_dim*squareSize)
    new_h = int(h/max_dim*squareSize)
    x_dif = int((squareSize - new_w) / 2)
    y_dif = int((squareSize - new_h) / 2)
    return image.resize((new_w-8, new_h-8), Image.ANTIALIAS),new_w,new_h,x_dif,y_dif

def generateTiles(ImgPaths,foldername,IMAGE_FOLDER, tileSize=2048, tileRows=32):
    # Create output folder if it doesn't exist
    os.makedirs(f"CSN/{foldername}", exist_ok=True)
    
    # Parameters for tiles
    squareSize = int(tileSize/tileRows)
    imgPerTile = tileRows*tileRows
    numbTiles = math.ceil(len(ImgPaths)/imgPerTile)
    
    for tileNum in tqdm(range(numbTiles), desc = "Generating tiles"):
        result = Image.new("RGBA", (tileSize, tileSize), (255, 0, 0, 0))
        currentIDX = 0
        for i in range(imgPerTile):
            img_idx = i+(tileNum*imgPerTile)
            if img_idx >= len(ImgPaths):
                break
            entry = ImgPaths[img_idx]
            try:
                image = Image.open(os.path.join(IMAGE_FOLDER, entry))
                # print(entry)
            except:
                print(f"Skipping invalid image file: {entry}")
                continue
            resizedImage,w,h,x_dif,y_dif = resizeImgTile(image, squareSize)
            r_result = Image.new("RGBA", (w, h), (1, 1, 1, 1))   # produces an almost transparent border to indicate clusters in the tool
            r_result.paste(resizedImage, (4,4))
            x = i % tileRows * squareSize + x_dif
            y = i // tileRows * squareSize + y_dif
            result.paste(r_result, (x, y, x + w, y + h))
        result = result.resize((tileSize, tileSize), Image.ANTIALIAS)
        # convert to 256 colors for faster loading online
        result = result.convert("P", palette=Image.ADAPTIVE, colors=256)
        result.save(f'CSN/{foldername}/tile_{tileNum}.png', "PNG", optimize=True)    

generateTiles(metadata[filenameColumn.value],foldername,imageLocation.value)

----------------

<font size=6> 3) Generate Mappings

Mappings are plots containing 2D coordinates (x,y) of the image objects. 

Here are several methods you can run. The Collection Space Navigator can handle many mappings but needs at least one to work.

In [None]:
#@title Prepare for mappings
#@markdown loads the normalization and scaling functions necessary for the steps below.
# Normalization and Scaling
mappings = []
minScale = -25
maxScale = 25

def normalize(embeddings):
    minX = min(embeddings, key=lambda x: x[0])[0]
    rangeX = max(embeddings, key=lambda x: x[0])[0] - minX
    minY = min(embeddings, key=lambda x: x[1])[1]
    rangeY = max(embeddings, key=lambda x: x[1])[1] - minY
    rangeScale = maxScale + 0.9999999999 - minScale
    for index, e in enumerate(embeddings):
        embeddings[index][0] =  (embeddings[index][0] - minX) / rangeX * rangeScale + minScale
        embeddings[index][1] = (embeddings[index][1] - minY) / rangeY * rangeScale + minScale
    return embeddings

def centerEmbeddings(embeddings):
    offsetA = (max(embeddings, key=lambda x: x[0])[0] + min(embeddings, key=lambda x: x[0])[0]) / 2
    offsetB = (max(embeddings, key=lambda x: x[1])[1] + min(embeddings, key=lambda x: x[1])[1]) / 2
    for index, e in enumerate(embeddings):
        embeddings[index][0] = embeddings[index][0] - offsetA
        embeddings[index][1] = embeddings[index][1] - offsetB
    return embeddings
    
class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)


<font size=5> 3.1 From metadata (optional)

In [None]:
#@title (optional) Create 2D plots
#@markdown Choose 2 metadadata fields (float or integer) and click "make plot". Repeat for every combination you want to add.
#@markdown >Note: Running this step opens a dialog in which you can chose X and Y dimensions from the available data and create additional 2D plots

def makePlot(v):
  A = AColumn.value
  B = BColumn.value
  plot = metadata[[A,B]]
  normalizedPlot = normalize(plot.values)
  centeredEmbedding = centerEmbeddings(normalizedPlot)
  filename = (A + "_" + B).replace(" ","")
  # save file
  with open(f'CSN/{foldername}/{filename}.json', "w") as out_file:
    out = json.dumps(centeredEmbedding, cls=NumpyEncoder)
    out_file.write(out)
  print(f"saved {filename}.json")
  mappings.append({"name": filename, "file": filename + ".json"})

AColumn = widgets.Dropdown(description="x-axis:",options=[mf for mf in metadata.columns if pd.api.types.is_numeric_dtype(metadata[mf]) and mf != "index"], style=style, layout=layout)
BColumn = widgets.Dropdown(description="y-axis:",options=[mf for mf in metadata.columns if pd.api.types.is_numeric_dtype(metadata[mf]) and mf != "index"], style=style, layout=layout)
button2DPlot = widgets.Button(description='make plot',icon='check')
button2DPlot.on_click(makePlot)
left = VBox([AColumn,BColumn])
right = VBox([button2DPlot])
HBox([left,right])

<font size=5> 3.2 From embeddings (optional)

In [None]:
#@title (optional) Run Principal Component Analysis (PCA)
components = 5 #@param {type:"number"}
add_slider = True #@param {type:"boolean"}
#@markdown >See PCA documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.eA.html


from sklearn.decomposition import PCA

def generate_PC(df,n,scale):
    print("performing PCA...")
    x = StandardScaler().fit_transform(df)
    pca = PCA(n_components=n)
    embedding = pca.fit_transform(x)
    if scale == True:
      normalized = normalize(embedding)
      centeredEmbedding = centerEmbeddings(normalized)
    else:
      centeredEmbedding = embedding
    print("...done")
    return centeredEmbedding

PCAEembedding = generate_PC(embeddings,components,True)
PCMap = PCAEembedding.reshape(-1,2)


# save file
with open(f'CSN/{foldername}/PCA.json', "w") as out_file:
    out = json.dumps(PCMap, cls=NumpyEncoder)
    out_file.write(out)
print(f"saved PCA.json")
mappings.append({"name": "PCA", "file": "PCA.json"})

# add columns to metadata for each component
for i in range(components):
  metadata[f"PC{i+1}"] = PCAEembedding[:,i]
  print(f"... added PC{i+1} to metadata")

# add slider for each component
if add_slider:
  sliderCols = list(sliderColumns.value)
  for i in range(components):
    sliderCols.append(f"PC{i+1}")


In [None]:
#@title (optional) Run UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
n_neighbors=15 #@param {type:"number"}
min_dist=0.18 #@param {type:"number"}
metric="correlation" #@param {type:"string"}
verbose=True #@param {type:"boolean"}
#@markdown >See UMAP documentation: https://umap-learn.readthedocs.io/en/latest/


try:
  import umap.umap_ as umap
except:
  print("Installing umap-learn via Pip")
  !pip install umap-learn --quiet
  import umap.umap_ as umap

def generateUMAP(df):
    print("generating UMAP...")
    scaled_penguin_data = StandardScaler().fit_transform(df)
    reducer = umap.UMAP(n_neighbors=n_neighbors,
                        min_dist=min_dist,
                        metric=metric,
                        verbose=verbose)
    embedding = reducer.fit_transform(scaled_penguin_data)
    normalized = normalize(embedding)
    centeredEmbedding = centerEmbeddings(normalized)
    print("...done")
    return centeredEmbedding

fullEmbeddings = generateUMAP(embeddings)
# save file
with open(f'CSN/{foldername}/UMAP.json', "w") as out_file:
    out = json.dumps(fullEmbeddings, cls=NumpyEncoder)
    out_file.write(out)
print(f"saved UMAP.json")
mappings.append({"name": "UMAP", "file": "UMAP.json"})

In [None]:
#@title (optional) Run t-SNE: t-distributed Stochastic Neighbor Embedding
n_components = 2 #@param {type:"number"}
verbose = 1 #@param {type:"number"}
random_state = 123 #@param {type:"number"}
#@markdown >See t-SNE documentation: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

from sklearn.manifold import TSNE

def generateTSNE(df):
    print("generating t-SNE...")
    x = StandardScaler().fit_transform(df)
    tsne = TSNE(n_components=n_components, verbose=verbose, random_state=random_state)
    embedding = tsne.fit_transform(x)
    normalized = normalize(embedding)
    centeredEmbedding = centerEmbeddings(normalized)
    print("...done")
    return centeredEmbedding

tsneEembedding = generateTSNE(embeddings)
# save file
with open(f'CSN/{foldername}/tSNE.json', "w") as out_file:
    out = json.dumps(tsneEembedding, cls=NumpyEncoder)
    out_file.write(out)
print(f"saved tSNE.json")
mappings.append({"name": "t-SNE", "file": "tSNE.json"})

In [None]:
#@title (optional) Create 2D plots
#@markdown Choose 2 metadadata fields (float or integer) and click "make plot". Repeat for every combination you want to add.
#@markdown >Note: Running this step opens a dialog in which you can chose X and Y dimensions from the available data and create additional 2D plots

def makePlot(v):
  A = AColumn.value
  B = BColumn.value
  plot = metadata[[A,B]]
  normalizedPlot = normalize(plot.values)
  centeredEmbedding = centerEmbeddings(normalizedPlot)
  filename = (A + "_" + B).replace(" ","")
  # save file
  with open(f'CSN/{foldername}/{filename}.json', "w") as out_file:
    out = json.dumps(centeredEmbedding, cls=NumpyEncoder)
    out_file.write(out)
  print(f"saved {filename}.json")
  mappings.append({"name": filename, "file": filename + ".json"})

AColumn = widgets.Dropdown(description="x-axis:",options=[mf for mf in metadata.columns if pd.api.types.is_numeric_dtype(metadata[mf]) and mf != "index"], style=style, layout=layout)
BColumn = widgets.Dropdown(description="y-axis:",options=[mf for mf in metadata.columns if pd.api.types.is_numeric_dtype(metadata[mf]) and mf != "index"], style=style, layout=layout)
button2DPlot = widgets.Button(description='make plot',icon='check')
button2DPlot.on_click(makePlot)
left = VBox([AColumn,BColumn])
right = VBox([button2DPlot])
HBox([left,right])

----------------

<font size=6> 4) Create Config Files

All customization and component settings are defined in the config files.

In [None]:
#@title Set Sliders
#@markdown Set the appearance of the range slider elements and histograms.

try:
  import distinctipy
except:
  print("Installing distinctipy via Pip")
  !pip install distinctipy --quiet
  import distinctipy
  
# check if sliderCols exists
try:
  sliderCols
except NameError:
  sliderCols = list(sliderColumns.value)
  
if len(sliderCols) > 0:    
  layoutCol = {'width': '110px'}
  sliderColorDict = {}
  left = [Label('display name')]
  middle = [Label('description text')]
  right = [Label('histogram color')]
  colors = distinctipy.get_colors(len(sliderCols),pastel_factor=1)
  for i, sliderName in enumerate(sliderCols):
    sliderColorDict[sliderName] = widgets.ColorPicker(concise=False,value=distinctipy.get_hex(colors[i]),layout=layoutCol)
    right.append(sliderColorDict[sliderName])
  sliderInfoDict = {}
  for sliderName in sliderCols:
    sliderInfoDict[sliderName] = widgets.Text(placeholder="info text for slider",layout=layout)
    middle.append(sliderInfoDict[sliderName])
  sliderNameDict = {}
  for sliderName in sliderCols:
    sliderNameDict[sliderName] = widgets.Text(placeholder="name of slider",value=sliderName)
    left.append(sliderNameDict[sliderName])
  print("\nSlider Settings:\n") 
  idx = VBox([Label('')]+[Label(f"{n}:") for n in sliderCols])
  left_box = VBox([l for l in left])
  middle_box = VBox([m for m in middle])
  right_box = VBox([r for r in right])
  display(HBox([idx,left_box,middle_box,right_box]))
else:
  print("No Cluster fields selected!")

In [None]:
#@title (optional) Set Cluster colors
#@markdown >Note: only necessary if categorical data was assigned for clusters

if len(classColumns.value) > 0:
  classColorDict = {}
  amount = len(classColumns.value)
  styleCol = {'description_width': '25px'}
  layoutCl = {'width': '135px'}
  allClasses = {}
  for className in classColumns.value:
    clusters = metadata[className].unique()
    allClasses[className] = len(clusters)
  l = sorted(allClasses.items(), key=lambda item: item[1])[0]
  length = l[1]
  allColors = {}
  colors = distinctipy.get_colors(length)
  col = 5
  row = math.ceil(length/col)
  i=0
  rows = []
  for r in range(0,col):
    newRow = []
    for c in range(0,row):
      # classColorDict[className] = widgets.ColorPicker(concise=True, value=distinctipy.get_hex(colors[i]))
      if i < len(colors):
        allColors[i] = widgets.ColorPicker(concise=False, description=str(i), value=distinctipy.get_hex(colors[i]),layout=layoutCl,style=styleCol)
        newRow.append(allColors[i])
        i+=1
    rows.append(VBox([nr for nr in newRow]))
  display(HBox(rows))
else:
  allColors = False
  print("No cluster was selected.")


In [None]:
#@title Create metadata.json
#@markdown Creates and saves the metadata.json file
#@markdown >Note: This step is necessary!


# modify image paths

try:
  sliderCols
except NameError:
  sliderCols = list(sliderColumns.value)

imageFolder = f'public/datasets/{foldername}/images/'
if useExample.value == True:
  metadata["URL"] = metadata[filenameColumn.value]
else:
  metadata["URL"] = imageFolder + metadata[filenameColumn.value]
metadataColumns = set(list(infoColumns.value) + sliderCols + list(filterColumns.value) + list(classColumns.value))
metadataColumns.add(filenameColumn.value)
metadata = metadata[metadataColumns]
metadata.reset_index(inplace=True)
# save metadata file
result = metadata.to_json(orient="records")
with open(f'CSN/{foldername}/metadata.json', "w") as f:
    f.write(result)
print("saved metadata.json")

In [None]:
#@title Calculate Histograms and create config files
#@markdown The CSN features Range Sliders with interactive histograms. This step calculates the necessary bins and prepares the data to display the histograms.
try:
    sliderCols
except NameError:
    sliderCols = list(sliderColumns.value)

def prepareBuckets(MIN,MAX, data):
    # prepare Slider Bar Historgram
    buckets = {}
    bucketsSize = {}
    bucketCount = 50
    if (MIN < 0):
        stepSize = (abs(MIN) + abs(MAX)) / bucketCount
    else:
        stepSize = abs((abs(MIN) - abs(MAX)) / bucketCount)
    for i in range(0, bucketCount):
        buckets[i] = []
        bucketsSize[i] = 0
    for index, e in enumerate(data):
        if (e == MAX):
            targetBucket = bucketCount-1
        else:
            targetBucket = math.floor((e - MIN) / stepSize)
        buckets[targetBucket].append(index)
        bucketsSize[targetBucket]+=1
    return {"histogram":list(bucketsSize.values()), "selections":list(buckets.values()), "range":[int(MIN),int(MAX)]}

def getBarChartData(df, selectionList):
    bucketData =  {} 
    for element in selectionList:
        print("preparing Slider Bar Historgram data", element)
        bucketData[element] = {str(element):{"histogram":[], "selections":[]}}
        bucketData[element] = prepareBuckets(df[element].min(),df[element].max(), df[element].values.tolist())
    return bucketData

def update_config(metadata,mappings):
    configData = {"title": datasetTitle.value, "datasetInfo": description.value, "metadata": "metadata.json", "embeddings": []}
    if mappings:
        configData["embeddings"] = mappings    
    configData["clusters"] = clusters
    configData["total"] = len(metadata)
    if tileSize:
        configData["sprite_side"] = tileRows
        configData["sprite_number"] = numbTiles
        configData["sprite_image_size"] = squareSize
        configData["sprite_actual_size"] = tileSize
    configData["sliders"] = sliderSetting
    if infoColumns.value:
        configData["info"] = infoColumns.value
    configData["search"] = searchFields
    if imageWebLocation.value.endswith("/"):
        configData["url_prefix"] = imageWebLocation.value
    else:
        configData["url_prefix"] = imageWebLocation.value + "/"

    return configData

def save_datasetsJSON():
  with open(f'CSN/datasets_config.json', "w") as fd:
    json.dump(datasetsJSON , fd)
  print("saved datasets_config.json")

def make_default(DEFAULT):
  datasetsJSON["default"] = DEFAULT
  print(f"changed default dataset to {datasetsJSON['data'][DEFAULT]['name']}")
  save_datasetsJSON()
  
BarChartData = getBarChartData(metadata,sliderCols)
with open(f'CSN/{foldername}/barData.json', "w") as f:
    json.dump(BarChartData , f)
print(f'saved barData.json')

sliderSetting = []

for k in sliderCols:
  dtype = 'float'
  if pd.api.types.is_integer_dtype(metadata[k]):
    dtype = 'int'
  slider = {"id":k,"title":sliderNameDict[k].value,"info":sliderInfoDict[k].value,"typeNumber":dtype,"color":sliderColorDict[k].value}
  sliderSetting.append(slider)
searchFields = []
for k in filterColumns.value:
  filter = {"columnField":k,"type":"selection"}
  searchFields.append(filter)
try:
    allColors
except NameError:
    clusters = {"clusterList":[],"clusterColors":[]}
else:
    clusters = {"clusterList":list(classColumns.value),"clusterColors":[allColors[g].value for g in allColors]}

configData = update_config(metadata,mappings)
with open(f'CSN/{foldername}/config.json', "w") as fb:
    json.dump(configData , fb)
print(f'saved config.json')
newDataset = {'name': datasetTitle.value, 'directory': foldername}

datasetsJSON = {"default": 0, "data": [newDataset]}
save_datasetsJSON()

print("\nContinue with building the Collection Space Navigator in the next step...")

----------------

<font size=6> 5) Build and use your custom Collection Space Navigator

Creates the final CSN tool and provides different options to test or use it.

In [None]:
#@title Build custom CSN
#@markdown >Note: This step pulls the build folder from the official Github repository (https://github.com/Collection-Space-Navigator/CSN)
import shutil

# if not os.path.exists(""):
#     os.makedirs("CSN")
#     print("created new directory 'CSN'")

print("pulling Collection Space Navigator from GitHub")
%cd CSN
!git init
!git remote add -f origin https://github.com/Collection-Space-Navigator/CSN
!git config core.sparseCheckout true
!echo "build" >> .git/info/sparse-checkout
!git read-tree -mu HEAD
!git pull origin main
%cd -

print("moving dataset to the Collection Space Navigator")
shutil.move(f"CSN/{foldername}", f"CSN/build/datasets/{foldername}")
shutil.move(f"CSN/datasets_config.json", f"CSN/build/datasets/datasets_config.json")

In [None]:
#@title (optional) Download
#@markdown >Note: Download your CSN version 'CSN_build.zip' to use it locally. Only needed in Google Colab.

from google.colab import files

!7z a CSN/CSN_build.zip CSN/build

files.download('/content/CSN/CSN_build.zip')


In [None]:
#@title (optional) Run proxy server in Google Colab (for quick testing)
#@markdown >Note: **This will run in Colab until stopped!** Sharing the link won't work and it might be a bit slow. Connection will close after few minutes.

try:
  from flask import Flask, render_template,send_from_directory
except:
  print("Installing flask via Pip")
  !pip install flask
  from flask import Flask, render_template,send_from_directory
  

import portpicker
import threading
import socket
import http.server
import socketserver
from functools import partial
from google.colab import output
from google.colab.output import eval_js

def server_entry():
    Handler = partial(http.server.SimpleHTTPRequestHandler, directory='/content/CSN/build')              
    httpd = socketserver.TCPServer(("", port), Handler)
    # Handle a single request then exit the thread.
    httpd.serve_forever()

port = portpicker.pick_unused_port()
thread = threading.Thread(target=server_entry)
thread.start()
output.serve_kernel_port_as_window(port)
port = portpicker.pick_unused_port()
proxy_URL = eval_js("google.colab.kernel.proxyPort(%d)" %port)

print("")
print(20*'#')
print("")
print(f"Use this url: {proxy_URL}")
print("")
print(20*'#')
print("")
print('starting server...')
print('Note: this will run in Colab until stopped!')


app = Flask(__name__,static_folder='/content/CSN/build/static',template_folder='/content/CSN/build')

@app.route('/<path:path>')
def send_report(path):
  # remove the replace in next to lines later later <-- important !!!!!!!!
  print("files",path)
  return send_from_directory('/content/CSN/build/', str(path))

@app.route('/CSN/static/<path:path>')
def send_report2(path):
  # remove the replace in next to lines later later <-- important !!!!!!!!
  print("files",path)
  return send_from_directory('/content/CSN/build/static/', str(path))

@app.route('/CSN/datasets/<path:path>')
def send_report3(path):
  # remove the replace in next to lines later later <-- important !!!!!!!!
  print("files",path)
  return send_from_directory('/content/CSN/build/datasets/', str(path))


@app.route("/")
def home():
    return render_template('index.html')
    
if __name__ == "__main__":
  app.run(debug=False, port=port)


In [None]:
#@title (optional) Run ngrok server
ngrok_Authtoken = 'YOUR_NGROK_TOKEN' #@param {type:"string"}
#@markdown >Note: **This will run until stopped!** Requieres an ngrok account. See: https://ngrok.com/

try:
  from pyngrok import ngrok
  from flask_ngrok import run_with_ngrok
  from flask import Flask, render_template,send_from_directory
except:
  print("Installing flask, flask_ngrok and pyngrok via Pip")
  !pip install flask flask_ngrok pyngrok
  from flask import Flask, render_template,send_from_directory
  from pyngrok import ngrok
  from flask_ngrok import run_with_ngrok
  
app = Flask(__name__,static_folder='/content/CSN/build/',template_folder='/content/CSN/build/')

ngrok.set_auth_token(ngrok_Authtoken)
run_with_ngrok(app)
@app.route('/<path:path>')
def send_report(path):
  # remove the replace in next to lines later later <-- important !!!!!!!!
  print("files",path)
  return send_from_directory('/content/CSN/build/', str(path))

@app.route('/CSN/static/<path:path>')
def send_report2(path):
  # remove the replace in next to lines later later <-- important !!!!!!!!
  print("files",path)
  return send_from_directory('/content/CSN/build/static/', str(path))
  
@app.route('/CSN/datasets/<path:path>')
def send_report3(path):
  # remove the replace in next to lines later later <-- important !!!!!!!!
  print("files",path)
  return send_from_directory('/content/CSN/build/datasets/', str(path))

@app.route("/")
def home():
    return render_template('index.html')
    
if __name__ == "__main__":
  app.run()

## (optional) Run in localhost

>Note: only works locally (not within Colab)

To run your CSN version on localhost, unzip your downloaded file, open a terminal, navigate to your CSN directory and run `serve -s build`

The CSN should be then accessible at http://localhost:3000 in your browser.



### (optional) Use as production web tool

We recommend to use GitHub for hosting your custom CSN version and an external server for hosting your image collection. Note that GitHub limits any dataset to 1000 files.

To deploy your version as a web tool in GitHub:

>Note: Make sure your GitHub branch is called `gh-pages` and has the GitHub Pages option set. See more about GitHub Pages here: https://pages.github.com/

1. clone the official CSN repository: https://github.com/Collection-Space-Navigator/CSN
2. install NVM: https://github.com/nvm-sh/nvm
3. install `node 14.21.2`: by running `nvm install v14.21.2`
4. replace the `CSN/build` folder with your own
5. in `package.json,` change `"homepage": "https://collection-space-navigator.github.io/CSN"` to your GitHub pages address
6. deploy the build folder to your GitHub pages by running `npm run deploy`

For more information and other deployment options, see https://create-react-app.dev/docs/deployment/

