# Chapter 1: PQk-means
This chapter contains the followings:

1. Vector compression by Product Quantization
1. Clustering by PQk-means
1. Comparison to other clustering methods

Requisites:
- numpy
- sklearn
- pqkmeans

## 1. Vector compression by Product Quantization

In [1]:
import numpy
import pqkmeans
import sys
import pickle

First , we introduce vector compression by Product Quantization (PQ) [Jegou, TPAMI 11]. The first task is to train an encoder. Let us assume that there are 1000 six-dimensional vectors for training; $X_1 \in \mathbb{R}^{1000\times6}$



In [2]:
X1 = numpy.random.random((1000, 6))
print("X1.shape =  \n", X1.shape, "\n")
print("X1 = \n", X1)

X1.shape =  
 (1000, 6) 

X1 = 
 [[ 0.70275556  0.51219496  0.76284138  0.13980293  0.86790663  0.59893453]
 [ 0.37008827  0.39502803  0.73402674  0.43158098  0.79462144  0.96854116]
 [ 0.24389028  0.78572064  0.15534625  0.19633922  0.90830083  0.42978262]
 ..., 
 [ 0.78156807  0.04260586  0.50256526  0.82012482  0.73211878  0.76370338]
 [ 0.76902753  0.2173942   0.74703122  0.47439234  0.74531087  0.30878455]
 [ 0.21489385  0.13756352  0.59329169  0.92189439  0.80937742  0.06592726]]


Then we can train a PQEncoder using $X_1$.

In [3]:
encoder = pqkmeans.encoder.PQEncoder(num_subdim=2, Ks=256)
encoder.fit(X1)

The encoder takes two parameters: $num\_subdim$ and $Ks$. In the training step, each vector is splitted into $num\_subdim$ sub-vectors, and quantized with $Ks$ codewords. The $num\_subdim$ decides the bit length of PQ-code, and typically set as 4, 8, etc. The $Ks$ is usually set as 256.

In this example, each 6D training vector is splitted into $num\_subdim(=2)$ sub-vectors (two 3D vectors). Consequently, the 1000 6D training vectors are splitted into the two set of 1000 3D vectors. The k-means clustering is applied for each set of subvectors with $K=256$.


After the training step, the encoder stores the resulting codewords (2 subpspaces $*$ 256 codewords $*$ 3 dimensions):

In [4]:
print(encoder.codewords.shape)

(2, 256, 3)


Note that you can train the encoder preliminary using training data, and write/read the encoder via pickle.


In [5]:
# pickle.dump(encoder, open('encoder.pkl', 'wb'))  # Write
# encoder = pickle.load(open('encoder.pkl', 'rb'))  # Read

Next, let us consider database vectors (2000 six-dimensional vectors, $X_2$) that we'd like to compress. 

In [6]:
X2 = numpy.random.random((2000, 6))
print("X2.shape:\n", X2.shape, "\n")
print("X2:\n", X2, "\n")
print("Data type of each element:\n", type(X2[0][0]), "\n")
print("Memory usage:\n", X2.nbytes, "byte")

X2.shape:
 (2000, 6) 

X2:
 [[ 0.0156      0.01311389  0.35813774  0.3859867   0.21378337  0.59912353]
 [ 0.71678096  0.35654127  0.85522785  0.4384428   0.51480571  0.60912294]
 [ 0.04447281  0.84149141  0.15362455  0.26419114  0.15876926  0.22082696]
 ..., 
 [ 0.25256687  0.58320653  0.15698569  0.08444892  0.02452599  0.64746556]
 [ 0.55939827  0.17863138  0.3245326   0.36510014  0.07182216  0.70163746]
 [ 0.93133231  0.18555859  0.68414746  0.58857661  0.30737436  0.62998395]] 

Data type of each element:
 <class 'numpy.float64'> 

Memory usage:
 96000 byte


We can compress these vectors by the trained PQ-encoder.

In [7]:
X2_pqcode = encoder.transform(X2)
print("X2_pqcode.shape:\n", X2_pqcode.shape, "\n")
print("X2_pqcode\n", X2_pqcode, "\n")
print("Data type of each element:\n", type(X2_pqcode[0][0]), "\n")
print("Memory usage:\n", X2_pqcode.nbytes, "byte")

X2_pqcode.shape:
 (2000, 2) 

X2_pqcode
 [[128 109]
 [215  15]
 [174  22]
 ..., 
 [244 206]
 [125 189]
 [ 32 105]] 

Data type of each element:
 <class 'numpy.uint8'> 

Memory usage:
 4000 byte


Each vector is splitted into $num\_subdim(=2)$ sub-vectors, and the nearest codeword is searched for each sub-vector. The id of the nearest codeword is recorded, i.e., two integers in this case. This representation is called PQ-code.
 
Note that PQ-code is a mamemory efficient data representation. The original 6D vector requies $6 * 64 = 384$ bit if 64 bit float is used for each element. On the other, PQ code requires only $2 * \log_2 256 = 16$ bit. 

We can approximately recunstruct the original vector from a PQ-code, by fetching the codewords using the PQ-code:

In [8]:
X2_reconstructed = encoder.inverse_transform(X2_pqcode)
print("original X2:\n", X2, "\n")
print("reconstructed X2:\n", X2_reconstructed)

original X2:
 [[ 0.0156      0.01311389  0.35813774  0.3859867   0.21378337  0.59912353]
 [ 0.71678096  0.35654127  0.85522785  0.4384428   0.51480571  0.60912294]
 [ 0.04447281  0.84149141  0.15362455  0.26419114  0.15876926  0.22082696]
 ..., 
 [ 0.25256687  0.58320653  0.15698569  0.08444892  0.02452599  0.64746556]
 [ 0.55939827  0.17863138  0.3245326   0.36510014  0.07182216  0.70163746]
 [ 0.93133231  0.18555859  0.68414746  0.58857661  0.30737436  0.62998395]] 

reconstructed X2:
 [[ 0.01975923  0.06736936  0.46318343  0.39555917  0.28400721  0.54631893]
 [ 0.66183932  0.33606359  0.82841298  0.45640054  0.54916867  0.53466686]
 [ 0.10339425  0.79627874  0.17225137  0.28756499  0.10818275  0.16972432]
 ..., 
 [ 0.26826145  0.63171795  0.07940289  0.20309724  0.04658785  0.69167184]
 [ 0.50138988  0.06807226  0.26473145  0.3341703   0.0377606   0.65188995]
 [ 0.88204834  0.19634262  0.63536372  0.54954645  0.30228684  0.52894986]]


As can be seen, the reconstructed vectors are similar to the original one.

In a large-scale data processing scenario where all data cannot be stored on memory, you can compress input vectors to PQ-codes and store the PQ-codes only (X2_pqcode).

In [9]:
# pickle.dump(X2_pqcode, open('pqcode.pkl', 'wb')) # You can store the PQ-codes only

## 2. Clustering by PQk-means

Let us run the clustering over the PQ-codes. The clustering object is instanciated with the trained encoder. Here, we set the number of cluster as $k=10$.

In [25]:
kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=10)

Let's run the PQk-means over X2_pqcode.

In [26]:
clustered = kmeans.fit_predict(X2_pqcode)
print(clustered[:100]) # Just show the 100 results

[6, 2, 0, 0, 8, 4, 2, 2, 6, 4, 0, 6, 1, 0, 1, 3, 5, 7, 3, 5, 7, 8, 1, 2, 2, 0, 3, 7, 5, 4, 4, 3, 8, 5, 1, 4, 1, 7, 2, 6, 7, 4, 7, 4, 9, 4, 6, 1, 3, 6, 0, 5, 9, 5, 8, 6, 4, 4, 0, 3, 6, 4, 3, 3, 0, 0, 4, 8, 3, 3, 2, 8, 0, 9, 7, 4, 6, 0, 7, 0, 9, 0, 6, 8, 7, 2, 9, 3, 2, 3, 5, 5, 1, 3, 4, 2, 7, 2, 2, 6]


The resulting vector (clustered) contains the id of assigned codeword for each input PQ-code.

In [27]:
print("The id of assigned codeword for the 1st PQ-code is ", clustered[0])
print("The id of assigned codeword for the 2nd PQ-code is ", clustered[1])
print("The id of assigned codeword for the 3rd PQ-code is ", clustered[2])

The id of assigned codeword for the 1st PQ-code is  6
The id of assigned codeword for the 2nd PQ-code is  2
The id of assigned codeword for the 3rd PQ-code is  0


You can fetch the center of the clustering by:

In [28]:
print("clustering centers:\n", kmeans.cluster_centers_)

clustering centers:
 [[170, 92], [24, 232], [152, 146], [126, 6], [150, 128], [171, 96], [163, 106], [237, 254], [120, 134], [193, 112]]


The centers are also PQ-codes. They can be reconstructed by PQ-encoder. 

In [29]:
clustering_centers_numpy = numpy.array(kmeans.cluster_centers_, dtype=encoder.code_dtype)  # Convert to np.array with the proper dtype
clustering_centers_reconstructd = encoder.inverse_transform(clustering_centers_numpy) # From PQ-code to 6D vectors
print("reconstructed clustering centers:\n", clustering_centers_reconstructd)

reconstructed clustering centers:
 [[ 0.53360177  0.77639461  0.63012951  0.26380888  0.31202664  0.29219604]
 [ 0.74692713  0.56350071  0.73656737  0.49674636  0.78332985  0.41896516]
 [ 0.75852336  0.14173008  0.73176168  0.68377146  0.43976632  0.6449249 ]
 [ 0.39630225  0.61564376  0.44836144  0.71438811  0.67933583  0.18204068]
 [ 0.46460275  0.63710803  0.58704328  0.65678022  0.48001491  0.88080196]
 [ 0.59859426  0.28144411  0.53685033  0.22135294  0.19829786  0.67700851]
 [ 0.24306931  0.45709692  0.32156575  0.3753507   0.4738097   0.65171996]
 [ 0.66934827  0.3932376   0.34434023  0.23362457  0.80432322  0.79646046]
 [ 0.79809461  0.38572955  0.2642597   0.64587247  0.24504116  0.22047964]
 [ 0.47851439  0.19963551  0.54408934  0.29084716  0.66420591  0.19116989]]


Let's summalize the result:

In [30]:
print("13th input vector:\n", X2[12], "\n")
print("13th PQ code:\n", X2_pqcode[12], "\n")
print("reconstructed 13th PQ code:\n", X2_reconstructed[12], "\n")
print("ID of the assigned center:\n", clustered[12], "\n")
print("Assigned center (PQ-code):\n", kmeans.cluster_centers_[clustered[12]], "\n")
print("Assigned center (reconstructed):\n", clustering_centers_reconstructd[clustered[12]])

13th input vector:
 [ 0.99527229  0.42485577  0.69820284  0.62567077  0.84666029  0.70164452] 

13th PQ code:
 [ 63 157] 

reconstructed 13th PQ code:
 [ 0.97862943  0.32630291  0.73809074  0.62566541  0.90514829  0.62460715] 

ID of the assigned center:
 1 

Assigned center (PQ-code):
 [24, 232] 

Assigned center (reconstructed):
 [ 0.74692713  0.56350071  0.73656737  0.49674636  0.78332985  0.41896516]


## 3. Comparison to other clustering methods

Let us compare PQk-means and the traditional k-means using high-dimensional data

In [16]:
from sklearn.cluster import KMeans

In [17]:
X3 = numpy.random.random((1000, 1024))  # 1K 1024-dim vectors, for training 
X4 = numpy.random.random((10000, 1024)) # 10K 1024-dim vectors, for database
K = 100

In [18]:
# Train the encoder
encoder_large = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256)
encoder_large.fit(X3)

# Encode the vectors to PQ-code
X4_pqcode = encoder_large.transform(X4)

Let's run the PQ-kmeans, and see the computational cost

In [22]:
%time clustered_pqkmeans = pqkmeans.clustering.PQKMeans(encoder=encoder_large, k=K).fit_predict(X4_pqcode)

CPU times: user 264 ms, sys: 0 ns, total: 264 ms
Wall time: 152 ms


Then, run the traditional k-means clustering 

In [23]:
%time clustered_kmeans = KMeans(n_clusters=K, n_jobs=-1).fit_predict(X4)

CPU times: user 2.07 s, sys: 68 ms, total: 2.14 s
Wall time: 49.5 s


PQk-means would be tens to hundreds of times faster than k-means depending on your machine. Then let's see the accuracy. Since the result of PQk-means is the approximation of that of k-means, k-means achieved the lower error:

In [24]:
_, pqkmeans_micro_average_error, _ = pqkmeans.evaluation.calc_error(clustered_pqkmeans, X4, K)
_, kmeans_micro_average_error, _ = pqkmeans.evaluation.calc_error(clustered_kmeans, X4, K)

print("PQk-means, micro avg error: ", pqkmeans_micro_average_error)
print("k-means, micro avg error: ", kmeans_micro_average_error)

PQk-means, micro avg error:  9.17064329162
k-means, micro avg error:  9.1461016175
