# Product Quantization

### Notebook written by Ryan Awad

Links:
- https://www.pinecone.io/learn/series/faiss/product-quantization/
- https://inria.hal.science/inria-00514462v2/document

In [1]:
from random import randint, uniform
import numpy as np
from math import log2, ceil

### We are defining 5 random 12 dimensional vectors

In [2]:
dim = 12
amount_of_vectors = 5
large_vectors = [[uniform(-100.0,100.0) for i in range(dim)] for i in range(amount_of_vectors)]
#large_vectors

### We are defining some variables for the algorithm

- `m`: dimensions of a compressed vector / how many sub-vectors will be created per vector
- `D_`: dimensions of a sub-vector

In [3]:
m = 4 # dimensions of a compressed vector

# ensure dim is divisable by m
assert dim % m == 0
# length of each subvector will be dim / m (D* in notation)
D_ = int(dim / m)

D_

3

In [4]:
# now create the subvectors
u_vectors = []
for x in large_vectors:
    u = [x[row:row+D_] for row in range(0, dim, D_)]
    u_vectors.append(u)

u_vectors

[[[-54.700138239816056, -43.89459613315494, 42.77922047603576],
  [-47.800174559679334, -67.99537143881791, 34.79883056429435],
  [-70.21771795772436, -72.73921786942144, 19.686458648771833],
  [0.26157904936408727, 31.299117583491352, 88.20935500064192]],
 [[-3.424943174158088, -39.888291875907676, 48.46925835051297],
  [-62.181335415667014, 36.55960586352688, 64.20847369347621],
  [92.65912419442824, -30.124153089185228, 67.9116042993661],
  [56.52168499971839, -12.302040748724252, -5.152890065980628]],
 [[-56.56888535476967, 57.67132045753428, 89.08142456710598],
  [44.50868926315778, 44.654171389976995, -67.61334918291485],
  [10.239987194046392, -31.701813900225815, -51.37523522564924],
  [57.219234299487255, 64.13722325207453, 3.2272766245328626]],
 [[-70.90570218417088, -59.3306374361906, 67.35684075768319],
  [-63.95133667749131, -94.88805725482332, -72.67086359098518],
  [11.572603914238996, 54.06640771328003, -49.92723789639493],
  [44.67143822128193, 76.51382766941924, 59.48

# Important
- `k`: total number of centroids generated
- `k_`: number of centroids to choose from per sub-vector

Notice how when you increase `k`, `k_` will increase proportionally. And as `k_` increases, the compressed vector size also increases, WHILE the distances between the regenerated vectors and the actual vectors decrease. In addition, as `k_` increases, so does the time needed to generate the centroids, as well as compressing a vector.

This is because as `k_` increases, each sub-vector gets to choose from a large amount of randomly generated centroids. Probablistically, a sub-vector will find a much better centroid than if `k_` was smaller which would give each sub-vector less options. This causes the average distance between regenerated vectors and actual vectors to decrease.

In addition, as `k_` increases, each sub-vector will have a larger amount of centroids to choose from, making the compression algorithm take longer

On the other hand, each sub-vector is converted into a centroid ID, as part of the compression algorithm. Since each sub-vector gets `k_` centroids to choose from, each ID will need to have allocated memory to store a value between 0 to `k_`-1. Therefore, as `k_` increases, the overall size of a compressed vector will increase. 

In [5]:
k = 2**10
assert k % m == 0
k_ = int(k/m)
print(f"{k=}, {k_=}")
print(f"Compressed vector size: {ceil(ceil(log2(k_))/8) * m} bytes")
print(f"Centroid map size: {ceil(k*D_*(32/8))} bytes")

k=1024, k_=256
Compressed vector size: 4 bytes
Centroid map size: 12288 bytes


### Randomly generating the centroids

In [6]:
c = []  # our overall list of reproduction values
for j in range(m):
    # each j represents a subvector (and therefore subquantizer) position
    c_j = []
    for i in range(k_):
        # each i represents a cluster/reproduction value position *inside* each subspace j
        c_ji = [uniform(-100.0,100.0) for _ in range(D_)]
        c_j.append(c_ji)  # add cluster centroid to subspace list
    # add subspace list of centroids to overall list
    c.append(c_j)
#c

In [7]:
def euclidean(v, u):
    distance = sum((x - y) ** 2 for x, y in zip(v, u)) ** .5
    return distance

'''
c_j is the SET of centroids for the specific sub-vector u_j

this function finds the nearest centroid to a sub-vector
'''
def nearest(c_j, u_j):
    distance = 9e+9
    for i in range(k_):
        new_dist = euclidean(c_j[i], u_j)
        if new_dist < distance:
            nearest_idx = i
            distance = new_dist
    return nearest_idx

### Compressing the vectors

In [8]:
compressed_vectors = [] # ids of the nearest centroid for each sub-vector
for u in u_vectors:
    curr_compressed_vec = []
    for j in range(m):
        i = nearest(c[j], u[j])
        curr_compressed_vec.append(i)
    compressed_vectors.append(curr_compressed_vec)

#print(f'old dimension: {len(x)}\ncompressed dimension: {len(ids)}')
compressed_vectors

[[244, 151, 109, 176],
 [140, 57, 236, 93],
 [102, 243, 21, 47],
 [249, 108, 60, 106],
 [104, 215, 80, 27]]

### Defining an error function

This error function only cares about how much the distance of regenerated vectors <u>RELATIVE TO EACH OTHER</u> changed. This is because, in vector similarity search, the relative distance of vectors is the only thing that matters. They're actual position in the vector space means nothing if their distance relative to each other stays the same or similar. 

Computations
1. Computes the distance between all the uncompressed vectors, and the distance between the regenerated vectors
2. Computes the absolute different between the two distance arrays into an array
3. Returns the average absolute difference

### TODO:
- Since vectors can be of various size, the error can sometimes be really big, but at the same time not be signficant when compared relative to the vector's size.
    - The error should take the average absolute difference and make it relative to the size of the vectors

In [9]:
def get_transformation_err(regen, old) -> float:
    r_dists = []
    o_dists = []
    for i in range(len(regen)):
        for j in range(len(regen)):
            if i < j:
                r_dists.append(euclidean(regen[i], regen[j]))
    for i in range(len(old)):
        for j in range(len(old)):
            if i < j:
                o_dists.append(euclidean(old[i], old[j]))

    print(r_dists)
    print(o_dists)

    error = sum([abs(r - o) for r, o in zip(r_dists, o_dists)]) / len(r_dists)
    return error

### Regenerating the vectors from the compressed vectors

In [10]:
regen_vecs = []
for comp in compressed_vectors:
    curr_regen = []
    for j in range(m):
        c_ji = c[j][comp[j]] # take the nearest centroid to each sub-vector using their index i
        curr_regen.extend(c_ji)
    regen_vecs.append(curr_regen)

### Results & Statistics
- Printing old and regenerated vectors
- The distance between regenerated and old vectors
- The average distance between regenerated and old vectors (not a great error metric as described in the cell above the error function definition)
- The output of the error function defined above
- The size of a compressed vector vs an old vector
- The compression $\%$

In [11]:
dists = []
for i in range(amount_of_vectors):
    dists.append(euclidean(regen_vecs[i], large_vectors[i]))

print(f'Regenerated vectors from compressed vectors:\n{np.array(regen_vecs)}')
print(f'Old vectors:\n{np.array(large_vectors)}')

print(f'Distances:\n{dists}')
print(f'Average distance: {round(sum(dists)/amount_of_vectors, 2)}')
print(f'ERROR: {get_transformation_err(regen_vecs, large_vectors)}')

new_size = ceil(ceil(log2(k_))/8) * m
old_size = dim * ceil(32/8)

print(f"Compressed vector size: {new_size} bytes")
print(f"Old vector size: {old_size} bytes")
print(f"{round((1 - (new_size/old_size))*100, 2)}% compression")

Regenerated vectors from compressed vectors:
[[-59.15684011 -44.82010568  43.66388861 -44.53905337 -68.89517964
   31.34406289 -99.09862716 -78.34977659   9.31678057   6.67918896
   26.81270135  73.26773636]
 [-15.49505867 -27.68558465  40.98824096 -74.83864206  27.46040693
   47.56061513  76.51076495 -21.74335753  60.7490854   49.63523948
  -11.8562858   -8.96491367]
 [-63.74716034  36.41362832  86.31573235  58.35233552  47.76715061
  -69.23831323   7.54577674 -26.66743136 -60.03297656  64.3545244
   56.18855078   8.0301435 ]
 [-72.92452896 -66.65719551  77.1402021  -71.66862416 -77.28619085
  -76.10712675   1.12486119  59.55184034 -43.49985458  45.56692013
   74.85646808  43.52156801]
 [ 25.85185026  68.41673642  45.7155241   73.46736692  90.80197016
   15.52145289  50.1321212  -13.01635806 -82.04962061  -7.64764034
   -9.53599322  80.9679464 ]]
Old vectors:
[[-54.70013824 -43.89459613  42.77922048 -47.80017456 -67.99537144
   34.79883056 -70.21771796 -72.73921787  19.68645865   0.26