
<br>

---

<div align="center"> 
  <font size="3"><b>Machine Learning</b> </font>
</div>
<br>
<div align="center"> 
  <font size="6">
      <b>Winner Take All Hash<br></b>
    </font>
    <br>
    <font size="4">
        A brief study on Winner Take All Hash algorithm, inspired by Jay Yagnik. 
    </font>
    <br>
    <br>
    <font size="3">
        Original paper: 
        <a href="http://www.cs.toronto.edu/~dross/YagnikStrelowRossLin_ICCV2011.pdf">
            The Power of Comparative Reasoning <br>Jay Yagnik, Dennis Strelow, David A. Ross, Ruei-sung Lin - 2011
        </a>
    </font>
</div>

---

<div align="center"> 
    <font size="4">
         <b>Konstantinos Nikoletos, BS Student</b>
     </font>
</div>
<div align="center"> 
    <font size="2">Athens 2021</font>
</div>


---

# WTA: Basic idea

## Rank Correlation Spaces

This algorithm benefits the nature of the given embeddings. Embeddings must be rank ordered (partialy or fully). What do we mean when say rank ordered embedding or vector. Lets take an example. Assume $Y$ is our initial vector.
$$
Y = [4, -1.5, 2.1, 5.5, 0]
$$
Then the ranked ordered vector of $Y$ will be:
$$
Y_{rankedOrdered} = [3, 0, 2, 4, 1]
$$

And what's a __rank correlation__?

Rank correlation is any of several statistics that measure an ordinal association — the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels 1st, 2nd, 3rd, etc. to different observations of a particular variable. __Rank correlation coefficient__ measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. Rank correlation measures are known for their stability to perturbations in
numeric values while giving a good indication of inherent similarity / agreement between items / vectors being considered.


__Partial order rankings__

There's no need to have embeddings or vectors of full rankings. For example $Y$ is a full ranking as it has all indexes from 0 to 4. Partial rankings miss a number of that indexes. 

## Goal

WTA algorithms goal is a feature space transformation that results in a space that is not sensitive to the absolute values of the feature dimensions but rather on the implicit ordering defined by those values. In effect the similarity between two points is defined by the degree to which their feature dimension rankings agree.

## Algorithms main functionality

Takes as input a set of vectors (embeddings), and for each vector:
- Permutes with a random pemutation
- Takes the first K components from the permuted vector
- Finds and outputs the index of the maximum value of that components
- Repeats for every vector in the set

The maximum index is the hash code. 


### A simple pairwise-order measure

The algorithms main point is to make a feature space transformation that benefits the most from the ordering of the vectors. In effect the similarity between two points is defined by the degree to which their feature dimension rankings agree. As an example, Yagnik's proposal is a measure like the above:

__Equation 1__:
$$
    PO(X,Y) = \sum_{i} \sum_{j<i}  T((x_i − x_j ) (y_i − y_j)) 
$$

where:
- $x_i$ and $y_i$ are the i-th feature dimensions in
- $X,Y$: ranked ordered vectors
- $T$ is simply a threshold function

Equation 1 simply measures the number of pairs of feature dimensions in X and Y that agree in ordering.



The above PO function and the Threshold: 

In [1]:
def WTA_similarity(vector1,vector2):
    
    PO=0
    for i in range(0,len(vector1),1):
        for j in range(0,i,1):
            ij_1 = vector1[i] - vector1[j]
            ij_2 = vector2[i] - vector2[j]
            PO += WTA_Threshold(ij_1*ij_2)
            
    return PO

def WTA_Threshold(x):    
    if x>0:
        return 1
    else:
        return 0

### Comparison with MinHash

WTA is a generalization of the well-known MinHash and should enjoy the theoretical benefits of LSH schemes. MinHash is a special case of WTA when applies to binary feautures.


## __Implementation code in Python__

__Import of libraries__

In [1]:
import numpy as np

__Main code__

In [22]:
'''
@author: Konstantinos Nikoletos, 2021
'''
def WTA(vectors, K, theta = None):
    '''
      Winner Take All hash - Yagnik
      .............................
      
      vectors: initial vectors
      K: window size
    '''
    
    new_vectors = []
    buckets = dict()
    num_of_vectors = vectors.shape[0]
    vector_dim    = vectors.shape[1]

    if vector_dim < K:
        K = vector_dim
        print("Window size greater than vector dimension")

    C = np.zeros(num_of_vectors, dtype=int)
    
    if theta == None:
        theta = np.random.permutation(vector_dim)      # randomization is without replacement and has to be consistent 
                                                      # across all samples and hence the notion of permutations

    j=0;
    for v_index in range(0,num_of_vectors,1):
        X_new = permuted(vectors[v_index], theta)
        new_vectors.append(X_new)
        C[j] = max(range(len(X_new[:K])), key=X_new[:K].__getitem__)
        bucket_insert(buckets, str(C[j]), v_index)
        j+=1;
        
    return C, buckets, np.array(new_vectors, dtype=np.intp)


def permuted(vector,permutation):
    permuted_vector = [vector[x] for x in permutation]

    return permuted_vector


def bucket_insert(buckets, hash_code, value):
    if hashCode not in buckets.keys():
        buckets[hash_code] = set()
    buckets[hash_code].add(value)
    
    return buckets

## Example and Remarks

### Example (from original paper)

An example with 6-dimensional input vectors, 
- K = 4, and 
- θ = (1, 4, 2, 5, 0, 3). 

X in (a) and (b) are unrelated and result in different output codes, 1 and 2 respectively.
X in (c) is a scaled and offset version of (a) and results in
the same code as (a). X in (d) has each element perturbed
by 1 which results in a different ranking of the elements,
but the maximum of the first K elements is the same, again
resulting in the same code.

#### Visualization
<img align="center" src="./wta-1.png" width="600">

#### Code

In [20]:
a = [10,5,2,6,12,3]
b = [4,5,10,3,2,1]
c = [22,12,6,14,26,8]
d = [11,4,3,7,13,2]
theta = [1, 4, 2, 5,0, 3]
vectors = np.array([a,b,c,d])
vectors

array([[10,  5,  2,  6, 12,  3],
       [ 4,  5, 10,  3,  2,  1],
       [22, 12,  6, 14, 26,  8],
       [11,  4,  3,  7, 13,  2]])

In [23]:
K = 4

C, buckets, new_vectors = WTA(vectors, K, theta)

In [24]:
C

array([1, 2, 1, 1])

In [25]:
buckets

{'1': {0, 2, 3}, '2': {1}}

In [26]:
new_vectors

array([[ 5, 12,  2,  3, 10,  6],
       [ 5,  2, 10,  1,  4,  3],
       [12, 26,  6,  8, 22, 14],
       [ 4, 13,  3,  2, 11,  7]], dtype=int64)

### Pairwise agreement of a,b,c,d

In [9]:
print("Similarity on X: ")
print("WTA_similarity(a,b) = ",WTA_similarity(a,b))
print("WTA_similarity(a,c) = ",WTA_similarity(a,c))
print("WTA_similarity(a,d) = ",WTA_similarity(a,d))
print("\n\n")
print("Similarity on X': ")
print("WTA_similarity(a,b) = ",WTA_similarity(new_vectors[0],new_vectors[1]))
print("WTA_similarity(a,c) = ",WTA_similarity(new_vectors[0],new_vectors[2]))
print("WTA_similarity(a,d) = ",WTA_similarity(new_vectors[0],new_vectors[3]))
print("\n\n")
print("Similarity on X'[:K]: ")
print("WTA_similarity(a,b) = ",WTA_similarity(new_vectors[0][:K],new_vectors[1][:K]))
print("WTA_similarity(a,c) = ",WTA_similarity(new_vectors[0][:K],new_vectors[2][:K]))
print("WTA_similarity(a,d) = ",WTA_similarity(new_vectors[0][:K],new_vectors[3][:K]))

Similarity on X: 
WTA_similarity(a,b) =  5
WTA_similarity(a,c) =  15
WTA_similarity(a,d) =  14



Similarity on X': 
WTA_similarity(a,b) =  5
WTA_similarity(a,c) =  15
WTA_similarity(a,d) =  14



Similarity on X'[:K]: 
WTA_similarity(a,b) =  2
WTA_similarity(a,c) =  6
WTA_similarity(a,d) =  5


### Kendal Tau similarity

In [10]:
from scipy.stats import kendalltau

Similarity based on Kendal Tau for the initial vectors ($X$)

In [11]:
similarity_prob, p_value = kendalltau(a,b)
print("kendalltau(a,b) = ", similarity_prob)
similarity_prob, p_value = kendalltau(a,c)
print("kendalltau(a,c) = ", similarity_prob)
similarity_prob, p_value = kendalltau(a,d)
print("kendalltau(a,d) = ", similarity_prob)

kendalltau(a,b) =  -0.3333333333333333
kendalltau(a,c) =  0.9999999999999999
kendalltau(a,d) =  0.8666666666666666


Similarity based on Kendal Tau for the permuted vectors ($X'$)

In [12]:
similarity_prob, p_value = kendalltau(new_vectors[0],new_vectors[1])
print("a,b similarity: ", similarity_prob)
similarity_prob, p_value = kendalltau(new_vectors[0],new_vectors[2])
print("a,c similarity: ", similarity_prob)
similarity_prob, p_value = kendalltau(new_vectors[0],new_vectors[3])
print("a,d similarity: ", similarity_prob)

a,b similarity:  -0.3333333333333333
a,c similarity:  0.9999999999999999
a,d similarity:  0.8666666666666666


In [13]:
similarity_prob, p_value = kendalltau(new_vectors[0][:K],new_vectors[1][:K])
print("a,b similarity: ", similarity_prob)
similarity_prob, p_value = kendalltau(new_vectors[0][:K],new_vectors[2][:K])
print("a,c similarity: ", similarity_prob)
similarity_prob, p_value = kendalltau(new_vectors[0][:K],new_vectors[3][:K])
print("a,d similarity: ", similarity_prob)

a,b similarity:  -0.3333333333333334
a,c similarity:  1.0
a,d similarity:  0.6666666666666669


# Overall Remarks


## Advantages

1. Very fast hashing method. For example, consider using it in a classification problem.
2. No need of trainning

## Disadvantages

1. Only applied to ranked vectors


# References

[1.]   [The Power of Comparative Reasoning Jay Yagnik, Dennis Strelow, David A. Ross, Ruei-sung Lin](http://www.cs.toronto.edu/~dross/YagnikStrelowRossLin_ICCV2011.pdf)