Our vector is of length **D=12**. We start by splitting this vector into **m** subvectors like

In [14]:
x = [1, 8, 3, 9, 1, 2, 9, 4, 5, 4, 6, 2]


lets divide the vector into 4 sub vectors, **m=4**

In [17]:
m = 4
D = len(x)
# Make sure that D is multiple of 4
assert D%m == 0
D_ = D//m
D_


3

In [23]:
u = [x[j : j + D_] for j in range(0, m)]
u

[[1, 8, 3], [8, 3, 9], [3, 9, 1], [9, 1, 2]]

Now we want to produce a number of centroids and assign a cluster of centroids to each sub vector

In [20]:
k = 32
assert k % m == 0
k_ = int(k/m)
print(f"{k=}, {k_=}")

k=32, k_=8


In [24]:
from random import randint
c = []

for j in range(m):
    # each j represents a subvector
    c_j = []
    for i in range(k_):
        # each subvector needs k_ centroids and i represents a centroid in j
        c_ji = [randint(0,9) for _ in range(D_)]
        c_j.append(c_ji)
    c.append(c_j)

c


[[[8, 2, 8],
  [2, 2, 1],
  [7, 6, 4],
  [0, 9, 3],
  [9, 7, 6],
  [3, 3, 3],
  [8, 5, 1],
  [9, 7, 3]],
 [[5, 9, 1],
  [5, 9, 3],
  [2, 0, 3],
  [4, 4, 3],
  [2, 9, 3],
  [8, 8, 3],
  [9, 9, 8],
  [0, 3, 7]],
 [[1, 4, 2],
  [4, 8, 1],
  [8, 8, 1],
  [7, 4, 2],
  [2, 2, 8],
  [4, 4, 6],
  [9, 7, 6],
  [6, 8, 2]],
 [[9, 8, 2],
  [6, 0, 0],
  [4, 2, 7],
  [1, 6, 6],
  [0, 7, 3],
  [2, 9, 5],
  [1, 2, 4],
  [3, 9, 4]]]

Each of our subvectors will be assigned to one of these centroids. In PQ terminology these centroids are called reproduction values and are represented by cⱼ,ᵢ where j is our subvector identifier, and i identifies the chosen centroid (there are k* centroids for each subvector space j).

In [26]:
def euclidean(u, v):
    # distance between u and v
    distance = 0
    for x, x_ in zip(u, v):
        distance += ( x - x_)**2
    return distance**0.5

def nearest(c_j, u_j):
    distance = 9e9
    for i in range(k_):
        new_distance = euclidean(c_j[i], u_j)
        if new_distance < distance:
            distance = new_distance
            nearest_index = i
    
    return nearest_index

# Now we caluculate the main centroids of all k_ centroids for each sub_vectorspace
ids = []
for j in range(m):
    i = nearest(c[j], u[j])
    ids.append(i)
ids

[3, 6, 1, 1]

```c[j][ids[j]]``` represents the main centroid of the sub vector space

With that, we have compressed a ```12-dimensional``` vector into a ````4-dimensional```` vector of IDs

Let’s switch from our original 12-dimensional vector of 8-bit integers to a more realistic 128-dimensional vector of 32-bit floats (as we will be using throughout the next section). We can find a good balance in performance after compression to an 8-bit integer vector containing just eight dimensions.

Original: 128×32 = 4096 Quantized: 8×8 = 64

That’s a big difference — 64x!