## Getting Started

This is a simple tutorial showing how to use spDB. This example makes use of the Fiqa Beir dataset. You can find more information about the Beir datasets [here](https://github.com/beir-cellar/beir)

### Set up the environment

First, we will set up the environment by importing the required libraries and appending the paths needed

In [2]:
import os
import sys
import numpy as np
import pickle

# Load in spDB from the local directory
current_dir = os.getcwd()
sys.path.append(current_dir + "/../")
sys.path.append(current_dir + "/../tests/integration/")

from spdb.spdb import spDB, load_db
import helpers

### Load in test data

In [150]:
# Load in the Fiqa test data
vectors, text, queries, _ = helpers.fiqa_test_data()
with open(current_dir + "/../tests/data/fiqa_queries_text.pickle", "rb") as f:
    query_text = pickle.load(f)

print (len(vectors))
print (type(vectors[0][0]))


30000
<class 'numpy.float32'>


### Create the spDB object

In [None]:
# Create the spDB
db_name = "fiqa_test"
db = spDB(db_name)

### Load in the spDB

This section is not necessary to run, it just shows how to load in an spDB object that has been created

In [13]:
# Optional: Load in the spDB object

import numpy as np
from scipy.optimize import curve_fit

db_name = "fiqa_test_1"
db = load_db(db_name)
print (db.faiss_index.ntotal)
print (db.faiss_index)
print (47304704/(db.faiss_index.ntotal))

# PQ64
#vectors = np.array([60000, 120000, 240000, 360000, 600000])
#memory_per_vector = np.array([142.1, 106.2, 89.1, 83.4, 78.84])

#PQ32
vectors = np.array([60000, 120000, 180000, 240000, 360000, 600000])
memory_per_vector = np.array([110.1, 74.2, 62.8, 57.1, 51.4, 46.8])

#PQ128
#vectors = np.array([60000, 120000, 180000, 240000, 360000, 600000])
#memory_per_vector = np.array([206.1, 170.2, 158.8, 153.1, 147.4, 142.8])

vectors = np.array([60000, 120000, 180000, 240000, 360000, 600000])
memory_per_vector = np.array([8525840, 12744704, 17064704, 21384704, 30024704, 47304704])

def model(n, C, b):
    return C / n + b

params, params_covariance = curve_fit(model, vectors, memory_per_vector)

C, b = params
print(f"Fitted parameters: C = {C}, b = {b}")

30000
<faiss.swigfaiss_avx2.IndexPreTransform; proxy of <Swig Object of type 'faiss::IndexPreTransform *' at 0x7fe0f8aab1b0> >
1576.8234666666667
Fitted parameters: C = 4219994.86512343, b = 39.51947796057294


In [12]:
print (db.faiss_index)

<faiss.swigfaiss_avx2.IndexPreTransform; proxy of <Swig Object of type 'faiss::IndexPreTransform *' at 0x7fe0f8a88d50> >


In [154]:
import numpy as np
import random
import string

def generate_random_vectors_with_text(N, D):
    random_vectors = np.random.rand(N, D).astype(np.float32) 
    random_text = [''.join(random.choices(string.ascii_lowercase, k=D)) for _ in range(N)]
    return random_vectors, random_text

# Specify the number of random vectors (N) and the dimensionality (D)
N = 30000  # Number of random vectors
D = 2048  # Dimensionality of each vector

# Generate N random vectors with D dimensions and random text strings
random_vectors, random_text = generate_random_vectors_with_text(N, D)

In [5]:
def predict_memory(n):
    return model(n, C, b)

def predict_full_memory(n, C, b):
    return C + (b * n)

print (predict_memory(2048))
print (predict_full_memory(2048, C, b))

7785779.197298177
15945275796.066666


### Add data to the spDB

The data must be a list of tuples, where each tuple contains `(vector, metadata)`

In [171]:
# Add the data to the spDB
add_data = [(random_vectors[i], {"text": random_text[i]}) for i in range(len(random_vectors))]
db.add(add_data)

range(570000, 600000)

In [149]:
# Get info
print(db.vector_dimension)

768


### Train the faiss index

For this example, we are using PCA 256, compressed vector bytes of 32, and omitting OPQ

For more information on these parameters, you can visit the Github Wiki [here](https://github.com/SuperpoweredAI/spDB/wiki/Tunable-parameters)

In [None]:
# Train the spDB
db.train(True, pca_dimension=256, compressed_vector_bytes=32, omit_opq=True)

### Query the trained index

Make a test query using the `db.query()` method

In [None]:
# Make a test query
results = db.query(queries[0])
print ("Query text:", query_text[0])
print ("")
print (results["metadata"][0]["text"])