# This Notebook for Running the ADB Project Phase 2


**This notebook is divided into two main parts, each focusing on different database sizes:**

- **Part 1: Database Size 10K**

  - Initiate a new database and insert vectors into it.
  - Retrieve vectors from the database.
  - Ensure that the insertion time for this database does not exceed 5 minutes.
  - Allow flexible RAM usage during insertion but ensure it stays within Google Colab limits.
  - Evaluate retrieval time and accuracy.
  - Ensure that the peak RAM usage for retrieval does not exceed 5 MB.

- **Part 2: Database Sizes 100K and More**
  - Generate database vectors using a random seed (refer to the provided code).
  - You have generate the database and its index before the submission.
  - Implement a VecDB class that loads the pre-generated database, including the index, and retrieves vectors, to load the generated database.
  - Evaluate retrieval time and accuracy for different database sizes.
  - The Peak RAM usage for the retrieval should not exceed
    - For 100 K --> 10 MB
    - For 1 M --> 25 MB
    - For 5 M --> 75 MB
    - For 10 M --> 150 MB
    - For 15 M --> 225 MB
    - For 20 M --> 300 MB

**This notebook is structured into two parts:**

- **Part 1 - Modifiable Cells:**
  This section contains cells that teams are allowed to modify. The modification are only variables and to be submitted during the project's final phase. They are

  - GitHub repository link (including PAT token).
  - Database (DB) variables, providing the path to the directory or file for loading existing databases and indexes (refer to provided code to see how).

- **Part 2 - Non-Modifiable Cells:** This section must not be modified by any team. It includes essential setup and evaluation code. Ensure that the notebook runs smoothly by providing the required inputs in Part 1.


## Part 1 - Modifiable Cells


Of course each team will provide different github repo link
Should include PAT token to enable me to download


In [None]:
!git clone https://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:@github.com/abdokaseb/sematic_search_DB.git

Teams are required to provide unique paths for the generated databases of sizes 1M, 5M, 10M, 15M, and 20M. Follow these steps to submit the databases:

- Once you have the database and index ready, zip the necessary folders/files.
- Upload the zip file to Google Drive.
- Ensure the file is shareable with "anyone with the link."
- Obtain the zip file link (e.g., https://drive.google.com/file/d/1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah/view?usp=drive_link).
- Extract the zip file ID (e.g., 1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah).
- Place the ID in the designated variable (to be submitted during the project final phase).
- The code will automatically download the zip file and unzip it inside this directory.
- Provide the local PATH for each database to be passed to the initializer for automatic loading of the database and index (to be submitted during the project final phase).


In [None]:
TEAM_NUMBER = 2
GDRIVE_ID_DB_100K = "14Gp_C3LxLuYIyF-zL5q-xFXKQ8KOFtZp"#TODO
GDRIVE_ID_DB_1M = "1Kxikp9SHfD5s44PJjuXVrlrFSJU93Dvj"
GDRIVE_ID_DB_5M = "1UpNe1IUO2_rQLsx64zIxbNpMmgedcrpv"
GDRIVE_ID_DB_10M = "1l7kajUqLVk8Ix3Zf3s6hNGTqvbCi57J0"
GDRIVE_ID_DB_15M = "1sPIgNxIuNDUUBnTvAuPMLryd1mqmgwV3" #TODO
GDRIVE_ID_DB_20M = "1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah" #TODO
PATH_DB_100K = "saved_db_100k.csv"
PATH_DB_1M = "saved_db_1m.csv"
PATH_DB_5M = "saved_db_5m.csv"
PATH_DB_10M = "saved_db_10m.csv"
PATH_DB_15M = "saved_db_15m.csv"
PATH_DB_20M = "saved_db_20m.csv"


These two varaible I'll change while running in on the discussion


In [None]:
QUERY_SEED_NUMBER = 10
DB_SEED_NUMBER = 20

This means that the project submission will include these

- TEAM_NUMBER
- Github clone link
- GDRIVE_ID_DB_100K
- GDRIVE_ID_DB_1M
- GDRIVE_ID_DB_5M
- GDRIVE_ID_DB_10M
- GDRIVE_ID_DB_15M
- GDRIVE_ID_DB_20M
- PATH_DB_100K
- PATH_DB_1M
- PATH_DB_5M
- PATH_DB_10M
- PATH_DB_15M
- PATH_DB_20M <br>
- And for sure the project document that describes what you did


## Part 2: No edits from here

#### You can't edit this part, and neither me.

#### Note: Maybe I can edit if there is a major bug


In [None]:
%cd sematic_search_DB

This cell to run any additional requirement that your code need <br>


In [None]:
%pip install memory-profiler >> log.txt
%pip install -r requirements.txt

This cell to download the zip files and unzip them here.


In [None]:
!gdown $GDRIVE_ID_DB_100K -O saved_db_100k.zip
!gdown $GDRIVE_ID_DB_1M -O saved_db_1m.zip
!gdown $GDRIVE_ID_DB_5M -O saved_db_5m.zip
!gdown $GDRIVE_ID_DB_10M -O saved_db_10m.zip
!gdown $GDRIVE_ID_DB_15M -O saved_db_15m.zip
!gdown $GDRIVE_ID_DB_20M -O saved_db_20m.zip
!unzip saved_db_100k.zip
!unzip saved_db_1m.zip
!unzip saved_db_5m.zip
!unzip saved_db_10m.zip
!unzip saved_db_15m.zip
!unzip saved_db_20m.zip

These are the functions for running and reporting


In [None]:
import numpy as np
from ivf import VecDB
import time
from dataclasses import dataclass
from typing import List
from memory_profiler import memory_usage
import gc


@dataclass
class Result:
    run_time: float
    top_k: int
    db_ids: List[int]
    actual_ids: List[int]


results = []
to_print_arr = []


def run_queries(db, query, top_k, actual_ids, num_runs):
    global results
    results = []
    for _ in range(num_runs):
        tic = time.time()
        db_ids = db.retrive(query, top_k)
        toc = time.time()
        run_time = toc - tic
        results.append(Result(run_time, top_k, db_ids, actual_ids))
    return results


def memory_usage_run_queries(args):
    global results
    # This part is added to calcauate the RAM usage
    mem_before = max(memory_usage())
    mem = memory_usage(proc=(run_queries, args, {}), interval=1e-3)
    return results, max(mem) - mem_before


def evaluate_result(results: List[Result]):
    # scores are negative. So getting 0 is the best score.
    scores = []
    run_time = []
    for res in results:
        run_time.append(res.run_time)
        # case for retireving number not equal to top_k, socre will be the lowest
        if len(set(res.db_ids)) != res.top_k or len(res.db_ids) != res.top_k:
            scores.append(-1 * len(res.actual_ids) * res.top_k)
            continue
        score = 0
        for id in res.db_ids:
            try:
                ind = res.actual_ids.index(id)
                if ind > res.top_k * 3:
                    score -= ind
            except:
                score -= len(res.actual_ids)
        scores.append(score)

    return sum(scores) / len(scores), sum(run_time) / len(run_time)


def get_actual_ids_first_k(actual_sorted_ids, k):
    return [id for id in actual_sorted_ids if id < k]

This to generate 10K, 100K databases and the query using the seed numbers that will be changed at submissions day


In [None]:
# rng = np.random.default_rng(DB_SEED_NUMBER)
rng = np.random.default_rng(50)
vectors = rng.random((10**6 *20, 70), dtype=np.float32)
# save with memmap
rng = np.random.default_rng(QUERY_SEED_NUMBER)
query = rng.random((1, 70), dtype=np.float32)

actual_sorted_ids_10k = (
    np.argsort(
        vectors.dot(query.T).T
        / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(query)),
        axis=1,
    )
    .squeeze()
    .tolist()[::-1]
)

Open new DB add 10K then retrieve and evaluate. Then add another 90K (total 100K) then retrieve and evaluate.


In [None]:
# records_dict = [{"id": i, "embed": list(row)} for i, row in enumerate(vectors)]
n = vectors.shape[0]
# n = 5* 10**6
db = VecDB(
    file_path=PATH_DB_100K
    if n == 10**5
    else PATH_DB_1M
    if n == 10**6
    else PATH_DB_5M
    if n == 5 * 10**6
    else PATH_DB_10M
    if n == 10 * 10**6
    else PATH_DB_15M
    if n == 15 * 10**6
    else PATH_DB_20M
    if n == 20 * 10**6
    else f"saved_db_{n}.csv",
    new_db=True,
)

batch_size = min(10**6, int(n * 0.1)) if n >= 10**6 else n

print("batch_size:", batch_size)

flag = False
# ------------------ Inserting ------------------
# for i in range(0, n // batch_size):

#     flag = i == (n // batch_size - 1)

#     # print(records_dict[0])
#     # records_dict = [{"id": i, "embed": list(vectors[i])} for i in range(vectors.shape[0])]
#     db.insert_records(vectors, flag)

    # del records_dict[0:batch_size]

db.insert_records(vectors)
# ------------------ Build index ------------------
# to be deleted
# db.build_index()

res = run_queries(
    db, query, 5, actual_sorted_ids_10k, 1
)  # one run to make everything fresh and loaded

res, mem = memory_usage_run_queries(
    (db, query, 5, actual_sorted_ids_10k, 5)
)  # actual runs to compute time, and memory

eval = evaluate_result(res)

to_print = f"10K\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"

to_print_arr.append(to_print)

print(to_print)

Remove exsiting varaibles to empty some RAM


In [None]:
del vectors
del query
del actual_sorted_ids_10k
# del records_dict
del db
gc.collect()

This code to generate 20M database. The seed (50) will not be changed. Create the same DB and prepare it's files indexes and every related file. <br>
Note at the submission I'll not run the insert records. <br>
The query istelf will be changed at submissions day but not the DB


In [None]:
rng = np.random.default_rng(50)
vectors = rng.random((10**7 * 2, 70), dtype=np.float32)

rng = np.random.default_rng(QUERY_SEED_NUMBER)
query = rng.random((1, 70), dtype=np.float32)

actual_sorted_ids_20m = (
    np.argsort(
        vectors.dot(query.T).T
        / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(query)),
        axis=1,
    )
    .squeeze()
    .tolist()[::-1]
)

In [None]:
db = VecDB(file_path=PATH_DB_100K, new_db=False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**5)
res = run_queries(
    db, query, 5, actual_ids, 1
)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries(
    (db, query, 5, actual_ids, 3)
)  # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"100K\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path=PATH_DB_1M, new_db=False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6)
res = run_queries(
    db, query, 5, actual_ids, 1
)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries(
    (db, query, 5, actual_ids, 3)
)  # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"1M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path=PATH_DB_5M, new_db=False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6 * 5)
res = run_queries(
    db, query, 5, actual_ids, 1
)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries(
    (db, query, 5, actual_ids, 3)
)  # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"5M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path=PATH_DB_10M, new_db=False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6 * 10)
res = run_queries(
    db, query, 5, actual_ids, 1
)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries(
    (db, query, 5, actual_ids, 3)
)  # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"10M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path=PATH_DB_15M, new_db=False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6 * 15)
res = run_queries(
    db, query, 5, actual_ids, 1
)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries(
    (db, query, 5, actual_ids, 3)
)  # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"15M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)


db = VecDB(file_path=PATH_DB_20M, new_db=False)
actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, 10**6 * 20)
res = run_queries(
    db, query, 5, actual_ids, 1
)  # one run to make everything fresh and loaded
res, mem = memory_usage_run_queries(
    (db, query, 5, actual_ids, 3)
)  # actual runs to compute time, and memory
eval = evaluate_result(res)
to_print = f"20M\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
to_print_arr.append(to_print)
print(to_print)

In [None]:
print("Team Number", TEAM_NUMBER)
print("\n".join(to_print_arr))