# Andromeda in Jupyter

### Interactive Inverse Dimension Reduction 

This notebook implements interactive dimension reduction (DR) for exploratory analysis of high-dimensional data.
It uses a Multi-Dimensional Scaling (MDS) algorithm with a weighted distance metric. It enables both forward and inverse DR interaction. 

**MDS** projects high-dimensional data to a 2D scatterplot. A **weighted distance function** with user-specified weights on each dimension enables alternative projections that emphasize different dimensions. An **Inverse-DR** algorithm learns distance function weights for  user-constructed layouts of the data points.

### Instructions:

1. Run All
1. Proceed to the interactive plots near the bottom
1. There are three kinds of interactions:
    1. Select points in the DR plot and click Details to see data values.
    1. **Parametric interaction:** Adjust the weight sliders and click Apply to alter the projection plot.
    1. **Projection interaction:** Drag points in the projection plot, then click Learn to see learned weights, and click Copy to see the updated projection plot.
1. Be patient, its interactive matplotlib in python and Jupyter!

### Credits:

Authors: Han Liu and Chris North, Dept of Computer Science, Virginia Tech.

Based on: *Self JZ, Dowling M, Wenskovitch J, Crandell I, Wang M, House L, Leman S, North C. Observation-Level and Parametric Interaction for High-Dimensional Data Analysis. ACM Transactions on Interactive Intelligent Systems.  8(2), 2018.* https://infovis.cs.vt.edu/sites/default/files/observation-level-parametric_first_look_version.pdf


In [1]:
%matplotlib notebook

# interactive notebook format is required for the interactive plot

import numpy as np
import pandas as pd
import math
import random
import os
import cv2
import csv

from os import listdir
from os.path import isfile, join
from math import isnan

from sklearn.decomposition import PCA
from sklearn.manifold import MDS
import sklearn.metrics.pairwise

import matplotlib
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
from matplotlib.patches import FancyBboxPatch
from functools import partial
import ipywidgets as widgets
from ipywidgets import interact, Layout
from IPython.display import display, display
from IPython.display import Image
from ipywidgets import interact
import ipyplot

# Load and Pre-process Data

Change the **filename** to load a dataset.  CSV data file is expected to have a first column 'Name' that is used as the index, and header row of column names.  Numeric columns are used for projection.

In [2]:
mypath = r"./dataPointMonitor/"
finalDir = [mypath+f for f in listdir(mypath) if isfile(join(mypath, f))]
imageNames = []
for path in finalDir:
    imageNames.append(os.path.basename(os.path.normpath(path)))

In [3]:
filename = './csvFiles/OppoSiftFeatures_batch5.csv'
df = pd.read_csv(filename)
paths =  df["Image"].tolist()
paths[:] = map(lambda x: "./testTrainingData/" + x + ".jpg", paths) 

In [4]:
display(df.shape)
df.head(5)

(60, 65)

Unnamed: 0,Image,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,64
0,2098late4,0.021563,0.010908,0.003805,0.021055,0.004566,0.019533,0.005835,0.015728,0.018519,...,0.014967,0.006342,0.001015,0.025368,0.0,0.015728,0.017757,0.020548,0.015221,0.016489
1,2076late6,0.018315,0.011842,0.003029,0.011016,0.005095,0.013357,0.004544,0.022859,0.019278,...,0.013357,0.003718,0.001652,0.027954,0.002066,0.024373,0.019554,0.024511,0.023547,0.021069
2,2144disease7,0.008608,0.033477,0.005739,0.022956,0.001913,0.016738,0.02439,0.009565,0.005739,...,0.041607,0.015304,0.003826,0.009087,0.042085,0.004304,0.009087,0.005261,0.001913,0.007652
3,2076late5,0.018335,0.011787,0.004948,0.011641,0.007712,0.021682,0.005966,0.021682,0.01557,...,0.010477,0.00422,0.001892,0.021682,0.001601,0.020373,0.024738,0.020081,0.01979,0.017608
4,2044ready5,0.019162,0.003248,0.01429,0.012666,0.000974,0.015265,0.008769,0.024034,0.015914,...,0.004547,0.006496,0.004547,0.018837,0.001299,0.026632,0.016564,0.019812,0.01494,0.012342


In [5]:
# Use 'Name' column as index
#df.rename(columns={df.columns[0]:'Name'}, inplace=True)
df.set_index('Image', inplace=True)

# Sort rows and columns
# df.sort_index(axis=1, inplace=True)
df.sort_index(inplace=True)

df_numeric = df.select_dtypes(include='number')  #'int32' or 'int64' or 'float32' or 'float64'
df_category = df.select_dtypes(exclude='number') #'object'

# Z-score normalization
# normalized_df = (df_numeric - df_numeric.mean()) / df_numeric.std()
normalized_df = df_numeric  # do not normalize animal dataset, all columns are 0-100 scale

print('Data size (r,c) =', df_numeric.shape)
df_numeric.head(5)

Data size (r,c) = (60, 64)


Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,55,56,57,58,59,60,61,62,63,64
Image,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2024ready2,0.018459,0.00426,0.01136,0.018104,0.003195,0.029109,0.014554,0.011005,0.015264,0.00639,...,0.006035,0.012425,0.01278,0.011005,0.000355,0.01065,0.020234,0.011715,0.011715,0.017394
2028ready4,0.015136,0.002073,0.03898,0.01244,0.002073,0.009123,0.013062,0.017831,0.010989,0.002073,...,0.00311,0.007672,0.026954,0.019075,0.005183,0.0226,0.013062,0.015758,0.01016,0.012855
2029ready20,0.015863,0.006217,0.014362,0.01179,0.003215,0.012219,0.010075,0.016506,0.019078,0.006217,...,0.006217,0.008789,0.008789,0.022294,0.001715,0.021865,0.015648,0.018864,0.015434,0.016935
2031ready1,0.018903,0.003346,0.007528,0.010539,0.002509,0.011542,0.006357,0.030445,0.017732,0.002676,...,0.00368,0.004517,0.008029,0.033456,0.000836,0.027434,0.01723,0.024925,0.02275,0.014721
2033ready9,0.020528,0.004888,0.01002,0.008553,0.002688,0.022239,0.009531,0.01784,0.019306,0.003421,...,0.004154,0.008553,0.012219,0.020528,0.002444,0.015885,0.016373,0.01393,0.013196,0.015152


#  Dimension Reduction Model:  Weighted MDS

For DR, we use the Multi-Dimensional Scaling (MDS) algorithm on a weighted data space. **Dimension weights** are applied to the high-dimensional (HD) data.  Weights are normalized to sum to 1, so as to normalize the HD distances to roughly constant size space independent of p.

The **distance function for the high-dimensional (HD) data** is L1 manhattan distance. L1 is good for general purpose use with multi-dimensional quantitative datasets. 

The **distance function for the 2D projected points** is L2 Euclidean distance, which makes sense for human percpetion in the plot.

In [6]:
# Compute the distance matrix for the weighted high-dimensional data using L1 distance function.
#  Input HD data should already be weighted.
def distance_matrix_HD(dataHDw):  # dataHDw (pandas or numpy) -> distance matrix (numpy)
    dist_matrix = sklearn.metrics.pairwise.manhattan_distances(dataHDw)
    #m = pd.DataFrame(m, columns=dataHD.index, index=dataHD.index)  # keep as np array for performance
    return dist_matrix

# Compute the distance matrix for 2D projected data using L2 distance function.
def distance_matrix_2D(data2D):  # data2d (pandas or numpy) -> distance matrix (numpy)
    dist_matrix = sklearn.metrics.pairwise.euclidean_distances(data2D) 
    #m = pd.DataFrame(m, columns=data2D.index, index=data2D.index) # keep as np array for performance
    return dist_matrix

#def dist(x,y):
#    return np.linalg.norm(x-y, ord=2)

**MDS** projects the weighted high-dimensional data to 2D. Tune the algorithm's parameters for performance.

In [7]:
# Calculate the MDS stress metric between HD and 2D distances.  Uses numpy for efficiency.
def stress(distHD, dist2D):  #  distHD, dist2D (numpy) -> stress (float)
    #s = np.sqrt((distHD-dist2D).pow(2).sum().sum() / distHD.pow(2).sum().sum())  # pandas
    #s = np.sqrt(((distHD-dist2D)**2).sum() / (distHD**2).sum())   # numpy
    s = ((distHD-dist2D)**2).sum() / (distHD**2).sum()   # numpy, eliminate sqrt for efficiency
    return s

def compute_mds(dataHDw):  # dataHDw -> data2D (pandas)
    distHD = distance_matrix_HD(dataHDw)
    # Adjust these parameters for performance/accuracy tradeoff
    mds = sklearn.manifold.MDS(n_components=2, dissimilarity='precomputed', n_init=10, max_iter=1000)
    # Reduction algorithm happens here:  data2D is nx2 matrix
    data2D = mds.fit_transform(distHD)
    
    # Rotate the resulting 2D projection to make it more consistent across multiple runs.
    # Set the 1st PC to the y axis, plot looks better to spread data vertically with horizontal text labels
    pca = sklearn.decomposition.PCA(n_components=2)
    data2D = pca.fit_transform(data2D)
    data2D = pd.DataFrame(data2D, columns=['y','x'], index=dataHDw.index)
    
    data2D.stress_value = stress(distHD, distance_matrix_2D(data2D))
    return data2D

def dimension_reduction(dataHD, wts): # dataHD, wts -> data2D (pandas)
    # Normalize the weights to sum to 1
    wts = wts/wts.sum()
    
    # Apply weights to the HD data 
    dataHDw = dataHD * wts
    
    # DR algorithm
    data2D = compute_mds(dataHDw)

    # Compute row relevances as:  data dot weights
    # High relevance means large values in upweighted dimensions
    data2D['relevance'] = dataHDw.sum(axis=1)
    return data2D


min_weight, max_weight = 0.00001, 0.9999
init_weight = min_weight  # 1.0/len(normalized_df.columns) # initialize to min to make the sliders easier to use.
weights = pd.Series(init_weight, index=normalized_df.columns, name="Weight")  # the current weight list

df_2D = dimension_reduction(normalized_df, weights)   # the current projected data

In [8]:
weights.head(2)

1    0.00001
2    0.00001
Name: Weight, dtype: float64

In [9]:
pd.DataFrame(distance_matrix_HD(normalized_df * (weights/weights.sum())), 
             columns=normalized_df.index, index=normalized_df.index).head(2)

Image,2024ready2,2028ready4,2029ready20,2031ready1,2033ready9,2034ready5,2036ready18,2038ready15,2038ready3,2039ready12,...,2147disease15,2147disease2,2147disease4,2147disease5,2147disease6,2147disease7,2148disease11,2148disease12,2148disease13,2148disease17
Image,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2024ready2,0.0,0.00747,0.00547,0.007027,0.004327,0.005141,0.006532,0.005383,0.005771,0.005993,...,0.010087,0.007979,0.011489,0.010935,0.011735,0.007408,0.013341,0.00558,0.007071,0.006586
2028ready4,0.00747,0.0,0.004501,0.006786,0.00601,0.006115,0.006713,0.007773,0.007342,0.007029,...,0.014326,0.010439,0.015781,0.013379,0.014949,0.011916,0.017335,0.009778,0.01099,0.010895


In [10]:
print(df_2D.stress_value)
df_2D.head(2)

0.03527245887981413


Unnamed: 0_level_0,y,x,relevance
Image,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024ready2,0.000546,0.002817,0.015625
2028ready4,-0.003944,0.00644,0.015625


In [11]:
def centroid(df_2D,cat):
    x_coords = df_2D[df_2D.label==cat].x.values
    y_coords = df_2D[df_2D.label==cat].y.values
    _len = len(df_2D)
    centroid_x = sum(x_coords)/_len
    centroid_y = sum(y_coords)/_len
    return (centroid_x, centroid_y)

def var(*centroids):
    x = [c[0] for c in centroids]
#     print("x: ",x)
    y = [c[1] for c in centroids]
    _len = len(centroids)
#     print(_len)
#     print("sum x: ",sum(x))
    new_centroids_x = np.mean(x)
#     print(new_centroids_x)
    new_centroids_y = np.mean(y)
    var = sum((x - new_centroids_x)) ** 2 + sum((y - new_centroids_y)) ** 2
#     print(x - new_centroids_x)
#     print(_len)
    return var

# labels_df = df_2D.index.str.extract(r'([a-z]+)')
# labels_df.index = df_2D.index
# df_2D['label'] = labels_df

# ready_centroid = centroid(df_2D,'ready')
# disease_centroid = centroid(df_2D,'disease')
# late_centroid = centroid(df_2D,'late')

# var(ready_centroid,disease_centroid,late_centroid)

# Inverse Dimension-Reduction Learning Algorithm

Computes the inverse-Dimension-Reduction: given input 2D points, compute new weights.
Optimizes the MDS stress function that compares 2D pairwise distances (||$x_i-x_j||$) to weighted HD pairwise distances ($d_{ij}$):
![Stress](https://wikimedia.org/api/rest_v1/media/math/render/svg/7989b3afc0d8795a78c1631c7e807f260d9cfe68)

Technically, we compute the inverse weighted distance function. We shortcut the optimization by eliminating MDS from the process, and assume that the user input 2D distances are actually the desired HD distances, not the 2D distances after re-projection. Thus, given the input (HD) distances, we find weights that would produce these distances in the HD space.

In [12]:
# This method is used to propose a new weight for current column in a smart fashion
def new_proposal(current, step, direction):
    return np.clip(current + direction*step*random.random(), 0.00001, 0.9999)

# Repeatedly tries to modify each dim weight to see if it improves the stress, thus
# getting the weighted high-dim distances to more closely match the input 2D distances.
#   dataHD = high-dim data, as pandas
#   data2D = 2D data input, as pandas
#   weights = as pandas series, or None for weights[i]=1/p
def inverse_DR(dataHD, data2D, curWeights=None):  # -> new weights, as Series
    dist2D = distance_matrix_2D(data2D)  # compute 2D distances only once
    col_names = dataHD.columns
    dataHD = dataHD.to_numpy()  # use numpy for efficiency
    row, col = dataHD.shape
    
    if curWeights==None:
        curWeights = np.array([1.0/col]*col)  # default weights = 1/p
    else:
        curWeights = curWeights.to_numpy()
        curWeights = curWeights / curWeights.sum()  # Normalize weights to sum to 1
    newWeights = curWeights.copy()  # re-use this array for efficiency
    
    # Initialize state
    flag = [0]*col         # degree of success of a weight change
    direction = [1]*col  # direction to move a weight, pos or neg
    step = [1.0/col]*col   # how much to change each weight
    
    dataHDw = dataHD * curWeights   # weighted space, re-use this array for efficiency
    distHD = distance_matrix_HD(dataHDw)
    curStress = stress(distHD, dist2D)
    print('Starting stress =', curStress, 'Processing...')

    MAX = 500   # default setting of the number of iterations

    # Try to minorly adjust each weight to see if it reduces stress
    for i in range(MAX):
        for dim in range(col):            
            # Get a new weight for current column
            nw = new_proposal(curWeights[dim], step[dim], direction[dim])
            
            # Scale the weight list such that it sums to 1
            #newWeights = curWeights.copy()  # avoid extra copy op using math below
            #newWeights[dim] = nw
            #newWeights = newWeights / s
            s = 1.0  + nw - curWeights[dim]   # 1.0 == curWeights.sum()
            np.true_divide(curWeights, s, out=newWeights)  # transfers to other array, while doing /
            newWeights[dim] = nw / s
            
            # Apply new weights to HD data
            np.multiply(dataHD, newWeights, out=dataHDw)  # dataHDw = dataHD * newWeights; efficiently reuses dataHDw array
            distHD = distance_matrix_HD(dataHDw)

            # Get the new stress
            newStress = stress(distHD, dist2D)
            
            # If new stress is lower, then update weights and flag this success
            if newStress < curStress:
                temp = curWeights
                curWeights = newWeights
                newWeights = temp   # reuse the old array next iteration
                curStress = newStress
                flag[dim] = flag[dim] + 1
            else:
                flag[dim] = flag[dim] - 1
                direction[dim] = -direction[dim]  # Reverse course
    
            # If recent success, then speed up the step rate
            if flag[dim] >= 5:
                step[dim] = step[dim] * 2
                flag[dim] = 0
            elif flag[dim] <= -5:
                step[dim] = step[dim] / 2
                flag[dim] = 0
                
    print('Solution stress =', curStress, 'Done.')
    #print("weight", curWeights)
    #print("flag", flag)
    #print("dir", direction)
    #print("step", step)
    return pd.Series(curWeights, index=col_names, name="Weight")


# Visualization and UI code

Use these functions to create the GUI components in any cell.

## Sliders

In [13]:
def create_sliders(wts):
    # Create sliders, one for each dimension weight
    style = {'description_width': 'initial'}
    sliders = [widgets.FloatSlider(min=min_weight, max=max_weight, step=0.01, value=value, 
                                       description=label, style = style, 
                                   continuous_update=False, readout_format='.5f',
                                  layout=Layout(width='60%', height='80px'))
                   for (label, value) in wts.iteritems()]
    # Display sliders
    for s in sliders:
        display(s)
    return sliders

def create_slider_buttons(sliders, thresh=20, version=None): 
    create_slider_buttons.thresh = thresh
    apply_button = widgets.Button(description='Apply Slider Weights')
    reset_button = widgets.Button(description='Reset Plot')
    download_impbutton = widgets.Button(description='Save Important Weights')
    download_allbutton = widgets.Button(description='Save Weights')
    load_button = widgets.Button(description='Load Weights')
    


    # Callback functions
    def apply_button_clicked(change):
        # Use the slider values to re compute the DR and redraw the plot
        global weights, df_2D, paths   # Update weights and df_2D globals
        weights = pd.Series([s.value for s in sliders], index=normalized_df.columns, name='Weight')
        df_2D = dimension_reduction(normalized_df, weights)   
        
        # Re-draw the plot
        draw_plot(plot_ax, df_2D, paths, toggle_title.value)
    apply_button.on_click(apply_button_clicked)

    def reset_button_clicked(change):
        # Reset all sliders to initial value and re-compute DR and re-draw the plot
        for s in sliders:
            s.value = init_weight
        apply_button_clicked(change)
    reset_button.on_click(reset_button_clicked)
    
    def download_impbutton_clicked(change):
        # Download .csv file with reletively important weights
        thresh = create_slider_buttons.thresh
        csv_columns = ['Index','Weight']
        csv_file = "SliderWeights.csv"
        slider_dict = {}
        final_slider_dict = {}
        for s in sliders:
            slider_dict[s.description] = s.value
            
        for key in slider_dict.keys():
            if slider_dict[key] > (thresh/100)*max(slider_dict.values()):
                final_slider_dict[key] = slider_dict[key]
        (pd.DataFrame.from_dict(data=final_slider_dict, orient='index')
         .to_csv(csv_file, header=True))     
    download_impbutton.on_click(download_impbutton_clicked)
    
    def download_allbutton_clicked(change):
        newpath = r'./WeightsDirectory' 
        if not os.path.exists(newpath):
            os.makedirs(newpath)
        
        list_of_files = sorted( filter( lambda x: os.path.isfile(os.path.join(newpath, x)),
                        os.listdir(newpath)))
        if list_of_files[-1][-5].isnumeric() == False:
            version_number = 0
        else:
            version_number = int(list_of_files[-1][-5]) + 1

        # Download .csv file with reletively important weights
        thresh = create_slider_buttons.thresh
        csv_columns = ['Index','Weight']
        csv_file = "./WeightsDirectory/AllSliderWeights_Version_{version}.csv".format(version = str(version_number))
        slider_dict = {}
        for s in sliders:
            slider_dict[s.description] = s.value
        (pd.DataFrame.from_dict(data=slider_dict, orient='index')
          .to_csv(csv_file, header=True))     
    download_allbutton.on_click(download_allbutton_clicked)
    
    def load_button_clicked(change):   
        newpath = r'./WeightsDirectory' 
        list_of_files = sorted( filter( lambda x: os.path.isfile(os.path.join(newpath, x)),
                        os.listdir(newpath)))
        file_exists = False
        for file in list_of_files:
            if file[-5] != 'S' and int(file[-5]) == version:
                file_exists = True
                break
        if file_exists == False:
            print("Version not found")
            return
        
        filename = "./WeightsDirectory/AllSliderWeights_Version_{versionNum}.csv".format(versionNum = str(version)) 
        dict_from_csv_un = pd.read_csv(filename, header=None, index_col=0, squeeze=True).to_dict()
        dict_from_csv = {k: dict_from_csv_un[k] for k in dict_from_csv_un if not isnan(k)}
        
        for s in sliders:
            s.value = dict_from_csv[float(s.description)]
    load_button.on_click(load_button_clicked)
    
    
    # Display buttons
    display(apply_button)
    display(reset_button)
    display(download_impbutton)
    display(download_allbutton)
    display(load_button)

def create_checkbox():
    title_checkbox = widgets.Checkbox(True, description='Toggle Titles')
    image_checkbox = widgets.Checkbox(False, description='Toggle Images')
    
    def title_check_clicked(x):   
        global imageNames
        global paths
        image = image_checkbox.value
        draw_plot(plot_ax, df_2D, paths, x, image)
    interact(title_check_clicked, x=title_checkbox)
    
    def image_clicked(x):   
        global imageNames
        global paths
        title = title_checkbox.value
        draw_plot(plot_ax, df_2D, paths, title, x)
    interact(image_clicked, x=image_checkbox)
    return title_checkbox, image_checkbox

## Print Details Button

In [14]:
def create_detail_display():
    # Print selected points
    print_button = widgets.Button(description='Print selected points')
    print_output = widgets.Output()

    def print_button_clicked(change):
        print_output.clear_output()

        # Get list of selected points and print their source data values
        if point_attribute[1].value:
            selset = [c.index for c in plot_ax.ellipse if c.selected]
        else:
            selset = [c.index for c in plot_ax.circles if c.selected]
        with print_output:
            if len(selset) > 0:
                #print(df.iloc[selset, :].transpose())
                path_list = df.iloc[selset, :].transpose().columns.tolist()
                path_list_mod = []
                for image in path_list:
                    path_list_mod.append("./imagePreProcessing/" + image + ".png")
                ipyplot.plot_images(path_list_mod, labels = path_list)
                    
            else:
                print('Select points in the plot to see details here')
    print_button.on_click(print_button_clicked)

    display(print_button)
    display(print_output)
    return print_button, print_output


## Inverse DR Button and Plot

In [15]:
def create_inverse_button():
    inverse_button = widgets.Button(description='Learn New Weights')
    copy_button = widgets.Button(description='Copy to Sliders')

    def inverse_button_clicked(change):
        ax.clear()

        # Check minimum number of points moved
        if point_attribute[1].value:
            n = len([1 for c in plot_ax.ellipse if c.selected])
        else:
            n = len([1 for c in plot_ax.circles if c.selected])
        if n < 2:
            print('Need to select or move at least 2 points in the plot first.')
            return

        # Get selected data points
        if point_attribute[1].value:
            data2Dnew = pd.DataFrame([c.center for c in plot_ax.ellipse if c.selected], columns=['x','y'], 
                                    index=[c.label for c in plot_ax.ellipse if c.selected])
        else:
            data2Dnew = pd.DataFrame([c.center for c in plot_ax.circles if c.selected], columns=['x','y'], 
                                    index=[c.label for c in plot_ax.circles if c.selected])
            
        dataHDpart = normalized_df.loc[data2Dnew.index]

        # Learn new weights
        global weights
        weights = inverse_DR(dataHDpart, data2Dnew)
        
        # Display new weights as a bar chart
        weights.sort_index(ascending=False).plot.barh(ax=ax)
        ax.set_xlabel("Weight")
        fig.tight_layout()
    inverse_button.on_click(inverse_button_clicked)

    def copy_button_clicked(change):
        # Set sliders to reflect the learned weights and update the DR and plot accordingly
        global df_2D
        global paths
        for i,s in enumerate(sliders):
            s.value = weights[i]
        df_2D = dimension_reduction(normalized_df, weights)   
        draw_plot(plot_ax, df_2D, paths)
    copy_button.on_click(copy_button_clicked)
    display(inverse_button)
    display(copy_button)
    fig, ax = plt.subplots(figsize=(5,7))   # reserve a fig for the weights bar chart
    return inverse_button, copy_button, ax


## Draggable Dimension-Reduction 2D Plot

In [16]:
# Handles mouse drag interaction events in the plot, users can select and drag points.
class DraggablePoints_Image(object):
    def __init__(self, ax, artists):
        self.ax = ax
        self.artists = artists
        self.current_artist = None
        self.last_selected = None
        ax.selected_text.set_text('Selected: none')
        self.offset = (0, 0)
        # Set up mouse listeners
        ax.figure.canvas.mpl_connect('pick_event', self.on_pick)
        ax.figure.canvas.mpl_connect('motion_notify_event', self.on_motion)
        ax.figure.canvas.mpl_connect('button_release_event', self.on_release)

    def on_pick(self, event):
        # When point is clicked on (mouse down), select it and start the drag
        if self.current_artist is None:  # clicking on overlapped points sends multiple events
            self.last_selected = event.artist.index  # event.ind
            self.current_artist = event.artist
            event.artist.selected = True
            event.artist.set_facecolor('green')
            self.ax.selected_text.set_text("Selected: " + event.artist.label)
            x0, y0 = event.artist.center
            self.offset = (x0 - event.mouseevent.xdata), (y0 - event.mouseevent.ydata)            


    def on_motion(self, event):
        # When dragging, check if point is selected and valid mouse coordinates
        if (self.current_artist is not None) and (event.xdata is not None) and (event.ydata is not None):
            # Drag the point and its text label
            dx, dy = self.offset
            self.current_artist.center = x0, y0 = event.xdata + dx, event.ydata + dy
            self.current_artist.text.set_position((x0 - 0.0003/2, 
                                                   y0 + 0.0005))
            self.current_artist.ab.xybox = (x0,y0)
        
    def on_release(self, event):
        # When mouse is released, stop the drag
        self.current_artist = None

In [17]:
# Handles mouse drag interaction events in the plot, users can select and drag points.
class DraggablePoints_circles(object):
    def __init__(self, ax, artists):
        self.ax = ax
        self.artists = artists
        self.current_artist = None
        self.last_selected = None
        ax.selected_text.set_text('Selected: none')
        self.offset = (0, 0)
        # Set up mouse listeners
        ax.figure.canvas.mpl_connect('pick_event', self.on_pick)
        ax.figure.canvas.mpl_connect('motion_notify_event', self.on_motion)
        ax.figure.canvas.mpl_connect('button_release_event', self.on_release)

    def on_pick(self, event):
        # When point is clicked on (mouse down), select it and start the drag
        if self.current_artist is None:  # clicking on overlapped points sends multiple events
            self.last_selected = event.artist.index  # event.ind
            self.current_artist = event.artist
            event.artist.selected = True
            event.artist.savecolor = event.artist.get_facecolor()
            event.artist.set_facecolor('green')
            #event.artist.set_alpha(1.0)
            self.ax.selected_text.set_text("Selected: " + event.artist.label)
            x0, y0 = event.artist.center
            self.offset = (x0 - event.mouseevent.xdata), (y0 - event.mouseevent.ydata)

    def on_motion(self, event):
        # When dragging, check if point is selected and valid mouse coordinates
        if (self.current_artist is not None) and (event.xdata is not None) and (event.ydata is not None):
            # Drag the point and its text label
            dx, dy = self.offset
            self.current_artist.center = x0, y0 = event.xdata + dx, event.ydata + dy
            self.current_artist.text.set_position((x0 - self.current_artist.radius/2, 
                                                   y0 - self.current_artist.radius/2))
            #self.ax.figure.canvas.draw()  # slow
        
    def on_release(self, event):
        # When mouse is released, stop the drag
        self.current_artist = None
        #self.ax.figure.canvas.draw()

In [18]:
def create_plot(data2D, title=True):    
    # Initialize DR plot figure   
    fig, ax = plt.subplots(figsize= (10,10), dpi=80)

    ax.selected_text = ax.figure.text(0,0.005, 'Selected: none', wrap=True, color='green')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.figure.tight_layout()
    global imageNames
    
    
    # Store state data:
    ax.ellipse = None
    ax.dragpoint = None
    draw_plot(ax, data2D, imageNames, title)
    return ax


def image_preprocessing(paths):
    global finalDir
    newpath = r'./imagePreProcessing' 
    if not os.path.exists(newpath):
        os.makedirs(newpath)
    path = [i for i in finalDir if paths in i][0]

    src = cv2.imread(path, 1)
    tmp = cv2.cvtColor(src, cv2.COLOR_BGR2GRAY)        
    _,alpha = cv2.threshold(tmp,0,255,cv2.THRESH_BINARY)
    b, g, r = cv2.split(src)
    rgba = [b,g,r, alpha]
    dst = cv2.merge(rgba,4)
    cv2.imwrite("./imagePreProcessing/" + paths + ".png", dst)
    return "./imagePreProcessing/" + paths + ".png"

def getImage(path):
    return OffsetImage(plt.imread(path), zoom = .04)
        
def draw_plot(ax, data2D, paths, title=True, image=False):  
    if not image:
        # Re-draws the DR plot in the axes with the updated data2D
        ax.clear()
        labels_df = data2D.index.str.extract(r'([a-z]+)')
        labels_df.index = data2D.index
        data2D['label'] = labels_df

        ready_centroid = centroid(data2D,'ready')
        disease_centroid = centroid(data2D,'disease')
        late_centroid = centroid(data2D,'late')

        # Map data to circles, with x, y, and relevance->color
        wid = max(data2D.x.max() - data2D.x.min(), data2D.y.max() - data2D.y.min())  # max range of x,y axes
        cnorm = matplotlib.colors.Normalize(vmin=data2D.relevance.min(), vmax=data2D.relevance.max())
        ax.circles = data2D.apply(axis=1, func=lambda row: 
            matplotlib.patches.Circle(xy=(row.x, row.y), radius=wid/70, alpha=0.5, 
                                      color=plt.cm.plasma(cnorm(row.relevance)), picker=True))
        for i,c in enumerate(ax.circles):
            # Store state data:
            c.index, c.label, c.selected = i, data2D.index[i], False
            # Draw circles and text labels in plot
            ax.add_patch(c)
            
            if title:
                c.text = ax.text(c.center[0]-c.radius/2, c.center[1]-c.radius/2, c.label, color='black')
            else:
                c.text = ax.text(c.center[0]-c.radius/2, c.center[1]-c.radius/2, "", color='none')
        # Make plot circles draggable
        ax.dragpoint = DraggablePoints_circles(ax, ax.circles)

        # Clean up the plot
        ax.set_xticks([])
        ax.set_yticks([])
        ax.axis('equal')
    else:
        ax.clear()
        wid = max(data2D.x.max() - data2D.x.min(), data2D.y.max() - data2D.y.min())  # max range of x,y axes
        i = 0
        ax.ellipse = []
        for x0, y0, path in zip(data2D.x, data2D.y, data2D.index):
            ax.ellipse.append(matplotlib.patches.Ellipse(xy=(x0, y0), width = wid/40, height = wid/12, alpha=0.5, 
                                      color = 'none', picker=True))
            c = ax.ellipse[-1]
            c.index, c.label, c.selected = i, data2D.index[i], False
            ax.add_patch(c)

            if title:
                c.text = ax.text(c.center[0]-0.0003/2, c.center[1]+0.0005, c.label, color = 'black')
            else:
                c.text = ax.text(c.center[0]-0.0003/2, c.center[1]+0.0005, "", color = 'none')
            path = image_preprocessing(path)
            img = getImage(path)
            c.ab = AnnotationBbox(img, (x0, y0), frameon=False)
            ax.add_artist(c.ab)
            ax.dragpoint = DraggablePoints_Image(ax, ax.ellipse)
            i += 1        

        # Clean up the plot

        ax.set_xticks([])
        ax.set_yticks([])
        ax.axis('equal')

# Interactive Visualization

## Sliders
Use the sliders to control the **dimension weights** for the input HD data. The weights indicate which dimensions are given more emphasis in the DR plot distances.  For computational reasons, weights cannot go to absolute zero. When used in DR, the weights are first normalized such that they sum to 1.

After adjusting the sliders, click the **Apply** button to show the results in the DR plot.

To reset the sliders to their default values, click the **Reset** button.

In [19]:
sliders = create_sliders(weights)

FloatSlider(value=1e-05, continuous_update=False, description='1', layout=Layout(height='80px', width='60%'), …

FloatSlider(value=1e-05, continuous_update=False, description='2', layout=Layout(height='80px', width='60%'), …

FloatSlider(value=1e-05, continuous_update=False, description='3', layout=Layout(height='80px', width='60%'), …

FloatSlider(value=1e-05, continuous_update=False, description='4', layout=Layout(height='80px', width='60%'), …

FloatSlider(value=1e-05, continuous_update=False, description='5', layout=Layout(height='80px', width='60%'), …

FloatSlider(value=1e-05, continuous_update=False, description='6', layout=Layout(height='80px', width='60%'), …

FloatSlider(value=1e-05, continuous_update=False, description='7', layout=Layout(height='80px', width='60%'), …

FloatSlider(value=1e-05, continuous_update=False, description='8', layout=Layout(height='80px', width='60%'), …

FloatSlider(value=1e-05, continuous_update=False, description='9', layout=Layout(height='80px', width='60%'), …

FloatSlider(value=1e-05, continuous_update=False, description='10', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='11', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='12', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='13', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='14', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='15', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='16', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='17', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='18', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='19', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='20', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='21', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='22', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='23', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='24', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='25', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='26', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='27', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='28', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='29', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='30', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='31', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='32', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='33', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='34', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='35', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='36', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='37', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='38', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='39', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='40', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='41', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='42', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='43', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='44', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='45', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='46', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='47', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='48', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='49', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='50', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='51', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='52', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='53', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='54', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='55', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='56', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='57', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='58', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='59', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='60', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='61', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='62', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='63', layout=Layout(height='80px', width='60%'),…

FloatSlider(value=1e-05, continuous_update=False, description='64', layout=Layout(height='80px', width='60%'),…

In [20]:
create_slider_buttons(sliders, thresh = 20, version=1)

Button(description='Apply Slider Weights', style=ButtonStyle())

Button(description='Reset Plot', style=ButtonStyle())

Button(description='Save Important Weights', style=ButtonStyle())

Button(description='Save Weights', style=ButtonStyle())

Button(description='Load Weights', style=ButtonStyle())

## Dimension Reduction Plot
This shows the HD data in 2D form, such that **proximity == similarity**, based on the current slider weights.  Distances between points in the plot approximately reflect their distances in the weighted HD data.  Thus points near each other have similar HD data values in the up-weighted dimensions, and points far away have very different HD data values in those dimensions.

The color represents the **relevance** of each point to the current slider weights. Yellower points have larger values in up-weighted dimensions.

Points can be **selected** to highlight in Green and view their details below.  Points can be **dragged** to specify a new projection for learning weights, see below. To reset the plot and clear the selections, click the **Reset** button above.

In [21]:
plot_ax = create_plot(df_2D)

<IPython.core.display.Javascript object>

In [22]:
point_attribute = create_checkbox()

interactive(children=(Checkbox(value=True, description='Toggle Titles'), Output()), _dom_classes=('widget-inte…

interactive(children=(Checkbox(value=False, description='Toggle Images'), Output()), _dom_classes=('widget-int…

##  Selected Points Details
Click the **Print** button below to display detailed data values of the points selected in the DR plot.  The selected points are Green.

In [23]:
dod = create_detail_display()

Button(description='Print selected points', style=ButtonStyle())

Output()

## Inverse Dimension Reduction
After selecting and/or dragging some points in the DR plot, click the **Learn** button to machine learn new dimension weights that would produce a plot with similar pairwise distances as your plot.  **Only the Green selected points** in the plot are considered when learning new weights. You must select or move at least two points to specify new desired distances. Use this to create your own clusters, and find out what makes some data points similar to or different from others.  The **learned weights** are shown in a bar chart below.

To see the effects of the learned weights, click the **Copy** button to apply the learned weights to the sliders and  make a new DR plot above.

In [24]:
inverse = create_inverse_button()

Button(description='Learn New Weights', style=ButtonStyle())

Button(description='Copy to Sliders', style=ButtonStyle())

<IPython.core.display.Javascript object>

Starting stress = 0.5066013775260267 Processing...
Solution stress = 7.320093139361433e-32 Done.
