### pin_mask demo
- We get a $1^{st}$ embedding (using random init)
- linearly transform so "good" and "bad" samples so their x values end up at [-1,+1]
    - (in least-squares sense)
- then re-embed "pinning" those x values *exactly* at -1, +1
    - (could random-init by-hand, then fix x values for single pin_mask umap fit)
- umap applies no gradient (but does apply rescaling)
- we determine the new linear rescaling
- and remap the embedding back to good|bad x-values -1|+1

### New
- We select a good/bad entry based on feature 0 "sepal length"
  for each iris species.  Lowest pinned to x=-1, highest pinned to x=+1
  
  
- still perhaps better to ROTATE (see Kabsch algorithm) the initial embedding
  such that good|bad points are aligned *towards* (-1,0)|(+1,0) idealized drag
  positions
   - then rescale and shift to put their x-centroids exactly at (-1,0)|(+1,0)
- Now we rotate, scale & shift
- TODO: option to **shear** instead of orthogonally rotate

#### behavior is not (completely) *as expected*!
- Still no *constraint* of having unpinned points having x-values between (-1,+1)

### New
- added a 'TRIAL' code blocks into layouts.py to test dimension-wise clipping bound
- clipping bounds for *x* must be -10,+10 so that spectral init is not fubar.
- *GOOD*: final embedding now does have *unpinned points* "on the inside"

### NEW: add hover tool to inspect bokeh embedding (and data) values

### *TODO*: add a user-passible `constraint` object (default=None) to UMAP.
- or actually, a list of constraint objects
- `constraint.project_onto_constraint( low_embedding_vector )` may do an
  in-place modification of `low_embedding_vector`
- layouts.py has a `jitclass` called `DimLohi` example of a clipping constraint
- other constraint types might project the full data set (during/post-epoch?),
  perhaps like `DimLoHi:project_rows_onto_constraint(self, mat)`
  
### *LATER*: see if UMAP can crudely be simulated by spring & dashpot physics
- *springs* : equilibrium distances and force constants
- *dashpots* : motion damping proportional to velocity

Why? such force fields can be quickly done client side, all in the low-dim embedding space.

Init via nearest-neighbors + random sample of distant neighbors
Fine-tuning might double-check points with "too-large" force-gradient
and add some small number of careful long-distance springs that best reduce this.

- Approximating force-field would be generated server-side.
- Client drag-n-drop uses approx force field (perhaps with some expanded *user dims*)
- State is preserved after rerunning umap? Or is it better to save the spring model
  and let the user decide when to run a full UMAP recalc (with new drag constraints etc.)

Shorter distance ~ higher spring constant; longer ~ weak spring constants

### drag-n-drop & high-D embedding
The current approach embeds high-D euclidean to low-D euclidean.  However, the drag'n'drop *user interest* was generated from a single high-D feature.

#### What rotation of high-D space, when projected onto the first 2 dims, best matches the dragged points?
- i.e. align 1st PCA component of the dragged point data with x axis.
   - 2nd one would be some randomish projection (ok, user doesn't really care about *y* yet,
     except that it should yield "some" separation in lo-D space
- a weighted-Euclidean metric (in the rotated hi-D space) might work nicely.
- but no dims should get *zero* weight
   - to avoid catastropic info loss
   - so 'y' axis still allows separation of individual items
   - (do not want all items to appear on single left-right line

In [1]:
from bokeh.plotting import figure, output_file, show
from bokeh.models import CategoricalColorMapper, ColumnDataSource, HoverTool
from bokeh.palettes import Category10, Colorblind, Viridis
from bokeh.io import output_notebook
from bokeh.layouts import column
output_notebook()

import umap
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd

In [2]:
iris = load_iris()

In [3]:
print(type(iris))
print(type(iris.data))
print(iris.data.shape, iris.data[0:5,])
print(iris.target.shape, iris.target[0:5])
print(iris.target_names.size, iris.target_names)
print(len(iris.feature_names), iris.feature_names)

<class 'sklearn.utils.Bunch'>
<class 'numpy.ndarray'>
(150, 4) [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
(150,) [0 0 0 0 0]
3 ['setosa' 'versicolor' 'virginica']
4 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [4]:
# I'm interested in feature 0 (sepal length) really small.
feature_of_interest = 0
fi_name = iris.feature_names[feature_of_interest]
di = data_of_interest = iris.data[:,feature_of_interest]
if True:
    # This time, choose good/bad from each iris species
    nFeat = iris.target_names.size
    good10 = np.zeros(nFeat,dtype=np.int32)
    bad10 = np.zeros(nFeat,dtype=np.int32)
    for t,name in enumerate(iris.target_names):
        print(t,name)
        mask = (iris.target==t)
        #print(di[mask])
        dilo = np.argmin(di[mask]) # index within masked group
        dihi = np.argmax(di[mask])
        #print(dilo)
        diilo = np.arange(di.shape[0]) [mask] [dilo] # index within original
        diihi = np.arange(di.shape[0]) [mask] [dihi]
        #print(diilo)
        good10[t] = diilo
        bad10[t] = diihi

    print("\nSelected shortest (good) and longest (bad)",
          fi_name, "of each iris species")
    print(fi_name, "good/bad values:")
    row_names = ["good", "bad"]
    col_names = iris.target_names
    matrix = np.zeros((2,3))
    for t,name in enumerate(col_names):
        matrix[0,t] = iris.data[ good10[t], feature_of_interest ]
        matrix[1,t] = iris.data[ bad10[t], feature_of_interest ]
    df = pd.DataFrame(matrix, columns=col_names, index=row_names)
    print(df)
    print("\n")
    

if False: # older case
    # choose 2 "interesting" examples and 2 uninteresting
    nFeat = 4
    print("feature_of_interest",feature_of_interest)
    best = np.argmin(data_of_interest)
    #good3 = np.argpartition(iris.data[:,0], 3)
    #print("good",good, "good3",good3)
    #print(iris.data[good3,])
    goods = np.argsort(data_of_interest)
    good10 = goods[0:nFeat]
    bad10 = goods[-nFeat:,]

print("good10",good10,"\ndata of goods:\n",iris.data[good10,])
print("bad10",bad10,  "\ndata of bads:\n",iris.data[bad10])
#

0 setosa
1 versicolor
2 virginica

Selected shortest (good) and longest (bad) sepal length (cm) of each iris species
sepal length (cm) good/bad values:
      setosa  versicolor  virginica
good     4.3         4.9        4.9
bad      5.8         7.0        7.9


good10 [ 13  57 106] 
data of goods:
 [[4.3 3.  1.1 0.1]
 [4.9 2.4 3.3 1. ]
 [4.9 2.5 4.5 1.7]]
bad10 [ 14  50 131] 
data of bads:
 [[5.8 4.  1.2 0.2]
 [7.  3.2 4.7 1.4]
 [7.9 3.8 6.4 2. ]]


In [5]:
##%%writefile iris4-emb-init.log
##%%capture iris4-emb-init.log
embedding = umap.UMAP(
    n_neighbors=50, learning_rate=0.5, random_state=12345, init="random", min_dist=0.001
).fit_transform(iris.data)
print(embedding[0:15,])

[[ 7.558583  14.226555 ]
 [ 8.783481  14.9717455]
 [ 8.937262  14.397663 ]
 [ 9.175676  14.721283 ]
 [ 7.775514  14.120574 ]
 [ 7.654318  13.304425 ]
 [ 9.012706  14.493627 ]
 [ 8.078566  14.389912 ]
 [ 9.420021  14.650941 ]
 [ 8.989386  14.855281 ]
 [ 7.4851007 13.48668  ]
 [ 8.417519  14.440206 ]
 [ 9.013475  14.866419 ]
 [ 9.387079  14.607569 ]
 [ 7.455905  13.235369 ]]


In [6]:
print(embedding[0:15,])

[[ 7.558583  14.226555 ]
 [ 8.783481  14.9717455]
 [ 8.937262  14.397663 ]
 [ 9.175676  14.721283 ]
 [ 7.775514  14.120574 ]
 [ 7.654318  13.304425 ]
 [ 9.012706  14.493627 ]
 [ 8.078566  14.389912 ]
 [ 9.420021  14.650941 ]
 [ 8.989386  14.855281 ]
 [ 7.4851007 13.48668  ]
 [ 8.417519  14.440206 ]
 [ 9.013475  14.866419 ]
 [ 9.387079  14.607569 ]
 [ 7.455905  13.235369 ]]


In [7]:
#output_file("iris2a.html")

targets = [str(d) for d in iris.target_names]
targets += ["good","bad"]
source = ColumnDataSource(
    data = dict(
        x0=embedding[:,0],
        y0=embedding[:,1],
        #g=[i in good10 for e,i in enumerate(embedding) # ?
        #b=[i in bad10  for i in range(embedding.shape[0])] # equiv for bad ?
        label=[targets[d] for d in iris.target],
    )
)
#for i in range(len(iris.feature_names)
#    source.data[iris.feature_names[i]] = iris.data[i,]
# 4 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
source.data["Sepal_Length"] = iris.data[:,0]
source.data["Sepal_Width"]  = iris.data[:,1]
source.data["Petal_Length"] = iris.data[:,2]
source.data["Petal_Width"]  = iris.data[:,3]
tooltips = [
    ("(x,y)",  "(@x0,@y0)"),   # tooltips[1] can be modified in later plots
    ("Iris Sample", "$index: @label"),
    ("Sepal Length,Width", "@Sepal_Length{0.0}, @Sepal_Width{0.0}"),
    ("Petal Length,Width", "@Petal_Length{0.0}, @Petal_Width{0.0}"),
]
#print(tooltips[0])

cmap = CategoricalColorMapper(factors=targets, palette=Category10[10])

p1 = figure(title="Test UMAP on Iris dataset",
            tooltips=tooltips)
circles = p1.circle(source=source, x="x0", y="y0",
    size=8, fill_alpha=0.5,
    color={"field": "label", "transform": cmap},
    #legend_label="species",
    legend_group="label"
)
# tooltips are only for circles
hover = p1.select_one(HoverTool)
hover.renderers = [circles]

# gray boxes around good/bad points and (fake) category
gb = np.vstack([embedding[good10,], embedding[bad10,]])
gbcat = np.hstack((np.repeat("good",nFeat), np.repeat("bad",nFeat)))
gbsource = ColumnDataSource( dict(
        x0 = gb[:,0],
        y0 = gb[:,1],
        label = gbcat,
    ))
p1.square(source=gbsource, x="x0", y="y0",
          size=16, line_alpha=0.7, line_width=4, fill_alpha=0.0,
          color={"field": "label", "transform": cmap},
          legend_group="label",
)
#p1.add_layout(p1.legend[0], 'right') # outside, plot rectangular!
#p1.legend.location = 'top_left'
#p1.legend.location = 'top_right'
p1.legend.location = 'center_center'

show(p1)

In [8]:
print("good10:\n", embedding[good10,])
print("bad10:\n",  embedding[bad10,])

good10:
 [[ 9.387079  14.607569 ]
 [-4.942664   4.6394544]
 [-4.389731   3.9435954]]
bad10:
 [[ 7.455905   13.235369  ]
 [-2.3573585   4.084615  ]
 [-0.19130808  2.3325393 ]]


In [9]:
# simulate drag'n'drop of goods to left, bads to right

print("good10:\n",  embedding[good10,])
print("bad10:\n",   embedding[bad10,])# simulate drag'n'drop of goods to left, bads to right
method_names=["force", "linear", "rot,scale,trans"]
method=2
#
# method 1: "best" linear transform of x-coords of embedding
#
# y = [e 1] @ [m c] st. e'[good10,][0] ~ -1 and e'[bad10,][0] ~ 1
def emb_linear(emb0, emb1, pt0, pt1, coord=0 ):
    """ shift emb0, emb1 towards goals t0, t1 returning mx+c that best shifts x (axis=0) values """
    e = np.hstack((emb0[:,coord], emb1[:,coord]))
    # hoping for first half ~ -1, rest ~ +1
    y = np.hstack((np.repeat(pt0[coord],emb0.shape[0]), np.repeat(pt1[coord],emb1.shape[0])))
    print(type(e), type(y), e.shape, y.shape)
    assert( e.size == y.size )
    A = np.vstack([e, np.ones(len(e))]).T  # add a one's column
    print(type(A), A.shape, "\n")
    print("A", A)
    print("y", y)

    #x, residuals, rank, s = np.linalg.lstsq(A, y, rcond=None)
    #print("lstsq -> x=",x)
    #m,c = x
    m,c = np.linalg.lstsq(A, y, rcond=None)[0]
    print("m,c",np.round(m,3),np.round(c,3))
    print("fit",m*e+c)
    return [m, c]

def emb_linear_apply(x, embedding):
    m = x[0]
    c = x[1]
    emb2 = embedding
    # re-embed all data w/ "best" linear transform of 'x' values
    # rescale 'y' too, (keep rel. distances, don't care about y shift)
    emb2[:,0] = m * embedding[:,0] + c
    emb2[:,1] = m * embedding[:,1]
    emb2[:,1] -= np.average(emb2, axis=1) # 'y' centroid --> zeroprint("good10:\n", embedding[good10,])
    return embedding

def opa(a, b):
    """ return rot, scale, translation, and rmsd of shifting `b` to concord with `a`.
    
    `a` and `b` are N D-dim vectors.
    
    Suppose we return r, s, t, d.
    
    To apply the recovered transform to other M D-dim vectors X, calculate
    `X.dot(r) * s + t`
    """
    assert( a.shape == b.shape )
    aT = a.mean(0)
    bT = b.mean(0)
    A = a - aT 
    B = b - bT
    aS = np.sum(A * A)**.5
    bS = np.sum(B * B)**.5
    A /= aS
    B /= bS
    U, _, V = np.linalg.svd(np.dot(B.T, A))
    aR = np.dot(U, V)
    if np.linalg.det(aR) < 0:
        V[1] *= -1
        aR = np.dot(U, V)
    aS = aS / bS
    aT-= (bT.dot(aR) * aS)
    # the original only returned a rotation-only "rms"... between scaled+translated points
    aD = (np.sum((A - B.dot(aR))**2) / len(a))**.5
    # the xform in general is : a[1] = a[1].dot(r) * s + t
    # if we actually DO the full transform "LONG HAND"
    #aD = np.sqrt(((a - (b.dot(aR) * aS + aT))**2).sum() / len(a))
    # equivalently, include scaling into previous rmsd as
    aD *= (aS * bS)
    return aR, aS, aT, aD 
        
def emb_opa(emb0, emb1, pt0, pt1, coord=0):
    """ rotate,scale,translate s.t. coord of emb0,emb1 somewhat match 2 points pt0,pt1."""
    print(len(pt0), emb0.shape)
    D = emb0.shape[1]
    assert( len(pt0) == D ) # emb0 and pt0 are both D-dim
    assert( emb1.shape[1] == D )
    assert( len(pt1) == D )
    e = np.vstack((emb0, emb1))
    print(e.shape, e)
    # hoping for first half ~ -1, rest ~ +1
    # if pt0,pt1 were scalar target values for a single coord...
    #y = np.zeros_like(e)
    #y[:,coord] = np.hstack((np.repeat(pt0,emb0.shape[0]), np.repeat(pt1,emb1.shape[0])))
    #print("y[,feat]",y[:,feature_of_interest])
    # if pt0,pt1 are D-dim target points for each class emb0/emb1
    y = np.repeat( np.vstack((pt0,pt1)), [emb0.shape[0],emb1.shape[0]], axis=0 )
    print(y.shape, y)
    
    print(type(e), type(y), e.shape, y.shape)
    assert( e.size == y.size )
    #r,s,t,d = opa(y,e)
    return opa(y,e)
def emb_opa_apply(x, embedding):
    """ given x=[r,s,t,d]"""
    return embedding.dot(x[0]) * x[1] + x[2]

#
# method 0: naive, brute force
#
if method==0:
    emb2 = embedding
    emb2[good10,0] = -10.0
    emb2[bad10,0] = +10.0
    # --- without clamping, we totally lose the "init" state

#
# method 1: "best" linear transform of x-coords of embedding
#
if method==1:
    print("good10:\n",  embedding[good10,])
    print("bad10:\n",  embedding[bad10,])
    x = emb_linear( embedding[good10,], embedding[bad10,], [-10,0], [10,0] )
    print("x", x)
    emb2 = emb_linear_apply( x, embedding)

#
# method 2: "best" rotate, scale and translate
#
if method==2:
    x = emb_opa( embedding[good10,], embedding[bad10,], [-10,0], [10,0] )
    print("x", x)
    emb2 = emb_opa_apply( x, embedding)

print("UMAP pinning init method", method_names[method])
print("emb2 pinning init good10:\n",  emb2[good10,])
print("emb2 pinning init bad10:\n",  emb2[bad10,])
#embedding = emb2

good10:
 [[ 9.387079  14.607569 ]
 [-4.942664   4.6394544]
 [-4.389731   3.9435954]]
bad10:
 [[ 7.455905   13.235369  ]
 [-2.3573585   4.084615  ]
 [-0.19130808  2.3325393 ]]
2 (3, 2)
(6, 2) [[ 9.387079   14.607569  ]
 [-4.942664    4.6394544 ]
 [-4.389731    3.9435954 ]
 [ 7.455905   13.235369  ]
 [-2.3573585   4.084615  ]
 [-0.19130808  2.3325393 ]]
(6, 2) [[-10   0]
 [-10   0]
 [-10   0]
 [ 10   0]
 [ 10   0]
 [ 10   0]]
<class 'numpy.ndarray'> <class 'numpy.ndarray'> (6, 2) (6, 2)
x (array([[ 0.80802539,  0.58914766],
       [-0.58914766,  0.80802539]]), 1.3469054785339654, array([ 4.76615447, -8.42750914]), 13.154369144479203)
UMAP pinning init method rot,scale,trans
emb2 pinning init good10:
 [[ 3.3909417  14.91929549]
 [-4.29464276 -7.30036755]
 [-3.14068451 -7.61892739]]
emb2 pinning init bad10:
 [[ 2.37805608 11.89344639]
 [-1.04068772 -5.85270992]
 [ 2.70701633 -6.04073566]]


In [10]:
#plot emb2
# Modify existing ColumnDataSources
source.data["x2"] = emb2[:,0]
source.data["y2"] = emb2[:,1]
tooltips[0] = ("(x,y)",  "(@x2,@y2)")
# gray boxes around good/bad points and (fake) category
gb = np.vstack([emb2[good10,], emb2[bad10,]])
gbsource.data["x2"] = gb[:,0]
gbsource.data["y2"] = gb[:,1]
#print(tooltips)
#cmap = CategoricalColorMapper(factors=targets, palette=Category10[10])

p2 = figure(title=("Iris "+method_names[method]+" drag'n'drop init"),
            tooltips=tooltips
           )
circles = p2.circle( source=source, x="x2", y="y2",
    size=8, fill_alpha=0.5,
    color={"field": "label", "transform": cmap},
    legend_group="label"
)
hover = p2.select_one(HoverTool)
hover.renderers = [circles]

p2.square(source=gbsource, x="x2", y="y2",
          size=16, line_alpha=0.7, line_width=4, fill_alpha=0.0,
          color={"field": "label", "transform": cmap},
          legend_group="label"
)
#p2.legend.location = 'top_right'
#p2.legend.location = 'top_left'
p2.legend.location = 'center_center'
show(p2)

In [11]:
# Pinned UMAP (and undo UMAP internal rescaling)
emb3 = emb2.copy()
emb3[good10,0] = -10
emb3[bad10,0]  = +10
print("embedding.shape",emb3.shape)
print("good10:\n", emb3[good10,])
print("bad10:\n",  emb3[bad10,])# re-embed just with new init conditions
pin_mask = np.ones(emb3.shape, dtype=np.float32) # todo: allow float32
pin_mask[good10,0] = 0.0 # zero gradient, so zero 'x' movement of init embedding
pin_mask[bad10,0] = 0.0
for i in range(pin_mask.shape[0]):
    if np.any(pin_mask[i,] == 0.0):
        print("pinned sample",i,"pin_mask",pin_mask[i,],"emb3",emb3[i,])
print("pin_mask.shape",pin_mask.shape)
print("pin_mask good10:\n", pin_mask[good10,])
print("pin_mask bad10:\n",  pin_mask[bad10,])# re-embed just with new init conditions
#   NOTE: should have pin_mask in UMAP constructor !

embedding.shape (150, 2)
good10:
 [[-10.          14.91929549]
 [-10.          -7.30036755]
 [-10.          -7.61892739]]
bad10:
 [[10.         11.89344639]
 [10.         -5.85270992]
 [10.         -6.04073566]]
pinned sample 13 pin_mask [0. 1.] emb3 [-10.          14.91929549]
pinned sample 14 pin_mask [0. 1.] emb3 [10.         11.89344639]
pinned sample 50 pin_mask [0. 1.] emb3 [10.         -5.85270992]
pinned sample 57 pin_mask [0. 1.] emb3 [-10.          -7.30036755]
pinned sample 106 pin_mask [0. 1.] emb3 [-10.          -7.61892739]
pinned sample 131 pin_mask [0. 1.] emb3 [10.         -6.04073566]
pin_mask.shape (150, 2)
pin_mask good10:
 [[0. 1.]
 [0. 1.]
 [0. 1.]]
pin_mask bad10:
 [[0. 1.]
 [0. 1.]
 [0. 1.]]


In [12]:
##%%writefile iris4-reemb.log
print("init=emb3 good10:\n", emb3[good10,])
print("init=emb3 bad10:\n",  emb3[bad10,])# re-embed just with new init conditions
embedder = umap.UMAP(
    n_neighbors=50, learning_rate=0.5, random_state=12346, init=emb3,
    negative_sample_rate=5, repulsion_strength=0.40,
    min_dist=0.001, spread=3.0,
    #a=0.1, b=0.9,
)
emb3 = embedder.fit_transform(iris.data, pin_mask=pin_mask)
print(emb3[0:15,])
print("pinned umap emb3 good10:\n", emb3[good10,])
print("pinned umap emb3 bad10:\n",  emb3[bad10,])

init=emb3 good10:
 [[-10.          14.91929549]
 [-10.          -7.30036755]
 [-10.          -7.61892739]]
init=emb3 bad10:
 [[10.         11.89344639]
 [10.         -5.85270992]
 [10.         -6.04073566]]
X.shape (150, 4)
pin_mask.shape (150, 2)
TRIAL: opt+mask+version 1
sample 13 pin head[ 0 ] begins at -10.0
sample 14 pin head[ 0 ] begins at 10.0
sample 50 pin head[ 0 ] begins at 10.0
sample 57 pin head[ 0 ] begins at -10.0
sample 106 pin head[ 0 ] begins at -10.0
sample 131 pin head[ 0 ] begins at 10.0
[[  1.5719222   15.80899   ]
 [ -0.45983684  14.223991  ]
 [ -0.37624493  15.260746  ]
 [ -0.59061384  14.797828  ]
 [  1.7790616   15.688201  ]
 [  3.2818418   15.341591  ]
 [ -0.05723464  15.374071  ]
 [  1.1312436   15.292517  ]
 [ -0.9262416   14.986894  ]
 [ -0.04890564  14.408092  ]
 [  2.8703125   15.205132  ]
 [  0.5755368   14.879891  ]
 [ -0.61831445  14.456572  ]
 [-10.          15.192844  ]
 [ 10.          15.066165  ]]
pinned umap emb3 good10:
 [[-10.        15.192844]


In [13]:
# UMAP has rescaled things "behind our back".
# I modified UMAP to avoid the rescale if pin_mask is not None
#    (or maybe if "enough" points have been pinned?)
# There were also mods needed to avoid dimension-wise rescaling
#    factors begin applied during 'init='
print("pinned umap good10:\n", emb3[good10,])
print("pinned umap bad10:\n",  emb3[bad10,])

emb4 = emb3.copy()
if False: # old code (this coord-rescale method is actually what we want to do.)
    # Oh-oh.  umap is doing some internal rescaling -- let's undo that.
    goodx = emb3[good10[0],0]
    badx  = emb3[bad10[0],0]
    print("umap --> good,bad=",goodx,badx)
    x = np.array([goodx,badx])
    A = np.array([[goodx,1.0],[badx,1.0]])
    y = np.array([-1.0,1.0])
    print("A\n",A,"\ny\n",y)
    m, c = np.linalg.lstsq(A, y, rcond=None)[0]
    print("m,c",np.round(m,3),np.round(c,3))
    print("fit",m*x+c)
    # scaling factor applies to BOTH x and y
    emb4[:,0] = m*emb3[:,0] + c
    emb4[:,1] = m*emb3[:,1]
    emb4[:,1] -= np.average(embedding, axis=1) # 'y' centroid --> zero

if False: # new: support several "re-project" methods
    # This reproject should ONLY SCALE
    rescale = 1
    if rescale==0:
        emb4[good10,0] = -10.0
        emb4[bad10,0] = +10.0
    if rescale==1: # use x-values to determine space scalings
        print("good10:\n",  emb3[good10,])
        print("bad10:\n",  emb3[bad10,])
        x = emb_linear( emb3[good10,], emb3[bad10,], [-10,0], [10,0] )
        print("x", x)
        emb4 = emb_linear_apply( x, emb3)
    if rescale==2:
        # rotate/scale/translate WILL NOT RE-PIN the x-values as desired!
        x = emb_opa( emb3[good10,], emb3[bad10,], [-10,0], [10,0] )
        print("x", x)
        emb4 = emb_opa_apply( x, emb3)
print("re-shift good10:\n", emb4[good10,])
print("re-shift bad10:\n",  emb4[bad10,])

pinned umap good10:
 [[-10.        15.192844]
 [-10.        -9.085252]
 [-10.        -9.769344]]
pinned umap bad10:
 [[10.       15.066165]
 [10.       -7.254607]
 [10.       -8.48607 ]]
re-shift good10:
 [[-10.        15.192844]
 [-10.        -9.085252]
 [-10.        -9.769344]]
re-shift bad10:
 [[10.       15.066165]
 [10.       -7.254607]
 [10.       -8.48607 ]]


In [14]:
#output_file("iris4.html")

source.data["x4"] = emb4[:,0]
source.data["y4"] = emb4[:,1]
tooltips[0] = ("(x,y)",  "(@x4,@y4)")
# gray boxes around good-bad (fake) drag'n'drop "category"
gb = np.vstack([emb4[good10,], emb4[bad10,]])
gbsource.data["x4"] = gb[:,0]
gbsource.data["y4"] = gb[:,1]

p4 = figure(title="Iris UMAP post drag'n'drop",
            tooltips=tooltips)
circles = p4.circle( source=source, x="x4", y="y4",
    size=8, fill_alpha=0.5,
    color={"field": "label", "transform": cmap},
    legend_group="label"
)
hover = p4.select_one(HoverTool)
hover.renderers = [circles]
p4.square(source=gbsource, x="x4", y="y4",
          size=16, line_alpha=0.7, line_width=4, fill_alpha=0.0,
          color={"field": "label", "transform": cmap},
          legend_group="label"
)
p4.legend.location = 'center_center'


output_notebook()
show(p4)
output_file("iris4.html")
show(column(p1,p2,p4))

In [15]:
x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])
A = np.vstack([x, np.ones(len(x))]).T
print("A\n",A,"\ny\n",y)
m, c = np.linalg.lstsq(A, y, rcond=None)[0]
print("m,c",np.round(m,3),np.round(c,3))
print("fit",m*x+c)

A
 [[0. 1.]
 [1. 1.]
 [2. 1.]
 [3. 1.]] 
y
 [-1.   0.2  0.9  2.1]
m,c 1.0 -0.95
fit [-0.95  0.05  1.05  2.05]


In [16]:
a=[1,2,3]; a+=[4,5]; print(a)
a=np.array([1,2,3]); a = np.hstack((a, [4,5])); print(a)

[1, 2, 3, 4, 5]
[1 2 3 4 5]


In [17]:
# test @jitclass for a "constraint" example
import numba
from numba.experimental import jitclass

dimLohiSpec = [
    ('lo',  numba.types.float32[:]),
    ('hi',  numba.types.float32[:]),
    #('size',numba.types.int32),
]

@jitclass(dimLohiSpec)
class DimLohi(object):
    def __init__(self, lo, hi):
        """ clip dim[i] to range lo[i]..hi[i], for i with lo[i] < hi[i];
            o/w clip to [-10.0,+10.0]
        """
        if self.lo.size != self.hi.size:
            print("warning: DimLohi(lo[],hi[]) lo and hi vectors should have same size")
        #self.size = min(lo.size, hi.size)    
        sz = min(lo.size, hi.size)    
        self.lo = lo[0:sz]
        self.hi = hi[0:sz]
        for i in range(self.lo.size):
            #print("cmp",self.lo[i], self.hi[i])
            if self.lo[i] >= self.hi[i]:
                self.lo[i] = -10.0
                self.hi[i] = +10.0
        #print("DimLohi lo",self.lo)
        #print("DimLohi hi",self.hi)

    #@property
    #def size(self):
    #    return self.lo.size
    
    #@numba.jit(numba.types.float32[:](numba.typeof(dimLohiSpec), numba.types.float32[:]))
    # --> "class members not yet supported"
    def project_onto_constraint(self, vec):
        """ In-place bounding of vec[] dimension-wise """
        if len(vec.shape) == 1 and vec.shape[0] >= self.lo.size:
            for i in range(self.lo.size):
                #print("cmp",self.lo[i], vec[i], self.hi[i])
                if   vec[i] < self.lo[i]:
                     vec[i] = self.lo[i]
                elif vec[i] > self.hi[i]:
                     vec[i] = self.hi[i]
        else:
            print("Mismatch between vec shape",vec.shape,"with DimLohi size", self.lo.size)
        return vec

    def project_rows_onto_constraint(self, mat):
        """ In-place bounding of mat[i,dim] dimension-wise """
        if len(mat.shape) == 2 and mat.shape[1] >= self.lo.size:
            for i in range(mat.shape[0]):
                for j in range(self.lo.size):
                    #print("cmp",self.lo[i], vec[i], self.hi[i])
                    if   mat[i,j] < self.lo[j]:
                         mat[i,j] = self.lo[j]
                    elif mat[i,j] > self.hi[j]:
                         mat[i,j] = self.hi[j]
        else:
            print("Mismatch between mat shape",mat.shape,"with DimLohi size", self.lo.size)
        return mat

# Try it out with 8 samples, each with 5 dims
#        bound the first 3 dims, not the rest.
#bd = DimLohi([-0.25, -0.5, -0.75, -0.5, -0.5], [+0.25, +0.5, +0.75, -1,   -1])
# no... need exactly-matching types for constructor

lows = np.array([-0.25, -0.5, -0.75, -0.5, -0.5],dtype=np.float32)
higs = np.array([+0.25, +0.5, +0.75, -1,   -1],dtype=np.float32)
print("lows\n",lows)
print("higs\n",higs)
print(numba.typeof(lows))
bd = DimLohi(lows, higs)

a = np.floor((np.random.rand(8,5)*2-1) * 100) * 0.01
print("a[2]=",a[2],"shape",a[2].shape)
assert( a[2].shape == (5,))

b = a.copy()
for i in range(b.shape[0]):
    bd.project_onto_constraint(b[i,])
print("a\n",a)
print("b\n",b)

# perhaps slightly slower way to generate b without modifying a
for i in range(a.shape[0]):
    b[i,] = bd.project_onto_constraint(a[i,].copy())
print("a\n",a)
print("b\n",b)

b = a.copy()
bd.project_rows_onto_constraint(b)
print("bd.project_rows_onto_constraint(b)\n",b)


lows
 [-0.25 -0.5  -0.75 -0.5  -0.5 ]
higs
 [ 0.25  0.5   0.75 -1.   -1.  ]
array(float32, 1d, C)
a[2]= [ 0.15  0.37 -0.2   0.36 -0.3 ] shape (5,)
a
 [[-0.69  0.14 -0.71  0.55 -0.2 ]
 [-0.36  0.21 -0.92 -0.33  0.15]
 [ 0.15  0.37 -0.2   0.36 -0.3 ]
 [ 0.1  -0.02  0.27  0.95  0.9 ]
 [-0.69 -0.52  0.58  0.87  0.4 ]
 [ 0.6  -0.23  0.28  0.12 -0.73]
 [ 0.11 -0.42 -0.13 -0.29  0.91]
 [-0.16 -0.24  0.93 -0.38 -0.21]]
b
 [[-0.25  0.14 -0.71  0.55 -0.2 ]
 [-0.25  0.21 -0.75 -0.33  0.15]
 [ 0.15  0.37 -0.2   0.36 -0.3 ]
 [ 0.1  -0.02  0.27  0.95  0.9 ]
 [-0.25 -0.5   0.58  0.87  0.4 ]
 [ 0.25 -0.23  0.28  0.12 -0.73]
 [ 0.11 -0.42 -0.13 -0.29  0.91]
 [-0.16 -0.24  0.75 -0.38 -0.21]]
a
 [[-0.69  0.14 -0.71  0.55 -0.2 ]
 [-0.36  0.21 -0.92 -0.33  0.15]
 [ 0.15  0.37 -0.2   0.36 -0.3 ]
 [ 0.1  -0.02  0.27  0.95  0.9 ]
 [-0.69 -0.52  0.58  0.87  0.4 ]
 [ 0.6  -0.23  0.28  0.12 -0.73]
 [ 0.11 -0.42 -0.13 -0.29  0.91]
 [-0.16 -0.24  0.93 -0.38 -0.21]]
b
 [[-0.25  0.14 -0.71  0.55 -0.2 ]
 [-0.25  0.21

In [18]:
constrain_lo = np.float32(-10.0)
constrain_hi = np.float32(+10.0)

In [19]:
# here is the umap 'euclidean output' rescaling factor
lo = np.min(embedding,0)
spread = np.max(embedding,0) - np.min(embedding,0)
print("lo",lo)
print("spread",spread)
emb2x = 10.0 * (emb2 - lo) / (spread)
print("emb2\n",emb2[0:15,])
print("emb2x\n",emb2x[0:15,])

lo [-4.942664   2.3134077]
spread [14.362685 12.685847]
emb2
 [[ 1.70327164 13.05366534]
 [ 2.44504113 14.83667006]
 [ 3.06795554 14.333906  ]
 [ 3.07062957 14.87530078]
 [ 2.02346396 13.11046348]
 [ 2.53919739 12.12604876]
 [ 3.0739145  14.49821326]
 [ 2.13955864 13.64407204]
 [ 3.39237645 14.99263907]
 [ 2.76155219 14.87330866]
 [ 2.21040874 12.1901246 ]
 [ 2.46854313 13.96777673]
 [ 2.77893169 14.90454638]
 [ 3.3909417  14.91929549]
 [ 2.37805608 11.89344639]]
emb2x
 [[4.62722374 8.46633058]
 [5.14367973 9.87183759]
 [5.57738304 9.47551872]
 [5.57924483 9.90228941]
 [4.8501572  8.51110342]
 [5.2092359  7.73510896]
 [5.58153196 9.60503885]
 [4.93098797 8.93173639]
 [5.80326066 9.99478484]
 [5.36405013 9.90071906]
 [4.98031725 7.78561867]
 [5.16004297 9.18690633]
 [5.37615058 9.92534313]
 [5.80226171 9.93696956]
 [5.09704148 7.55175316]]
