# Compressing large embeddings

To run the `/neighbours` API we need either need to be able to store the entities vectors in RAM, or use [`faiss`'s compression methods](https://github.com/facebookresearch/faiss/wiki/Lower-memory-footprint). These compression methods consist of PCA and then quantization.

For ease, we can do this PCA offline instead. Reducing from 1200 dim to 400 dim for V&A data, I noticed no quality drop in the nearest neighbours returned.

In [14]:
import sys
sys.path.append("..")

from src.embedding_store import KGEmbeddingStore
import numpy as np
from sklearn.decomposition import PCA

## 1. Get full-dimensioned embeddings

In [2]:
emb_store = KGEmbeddingStore.from_dglke(
    embeddings_folder="../data/processed/final_model_dglke_vanda/", 
    embeddings_file_names=["heritageconnector_RotatE_entity.npy", "heritageconnector_RotatE_relation.npy"], 
    mappings_folder="../data/processed/final_model_dglke_vanda/"
)



  after removing the cwd from sys.path.


In [18]:
emb_store.ent_embedding_matrix.shape, emb_store.ent_embedding_matrix.nbytes/1e9

((1208256, 1200), 5.7996288)

## 2. Reduce embeddings

A `t3a.large` EC2 machine has 8GB RAM. Let's aim for the entities to be approx 2GB RAM to give the machine plenty of memory to do other stuff, or potentially to downgrade the machine in future.

The full-dimensioned entity embedding matrix is 5.8GB (see above), so let's reduce its dimension by a third (1200dim -> 400dim).

In [6]:
X_reduced = PCA(n_components=400, random_state=42).fit_transform(emb_store.ent_embedding_matrix)

In [17]:
X_reduced.nbytes / 1e9

1.9332096

In [12]:
with open("../data/processed/final_model_dglke_vanda/heritageconnector_RotatE_entity_reduced_400.npy", "wb") as f:
    np.save(f, X_reduced)