# ADB Phase 2 Project Evaluation Notebook


**Purpose**: This notebook evaluates the performance of a semantic search project by analyzing databases of various sizes.

### Evaluation Focus:
- **Database Sizes**:
  - 1 Million Records
  - 10 Million Records
  - 20 Million Records

For each database size, this notebook will:
- Download the database
- Use the `VecDB` class (implemented by students) to retrieve queries
- Evaluate and report retrieval time, accuracy, and RAM usage.

### Project Constraints:
Refer to the project document for details on RAM, Disk, Time, and Score constraints.

### Notebook Structure:
1. **Part 1 - Modifiable Cells**:
   - Includes cells that teams are allowed to modify, specifically for these variables only:
     - GitHub repository link (including PAT token).
     - Google Drive IDs for indexes files.
     - Paths for loading existing indexes.

2. **Part 2 - Non-Modifiable Cells**:
   - Contains essential setup and evaluation code that must not be modified.
   - Students should only modify inputs in Part 1 to ensure smooth execution of the notebook.

## Part 1 - Modifiable Cells

Each team must provide a unique GitHub repository link that includes a PAT token. This link will allow the notebook to download the necessary code for evaluation.

In [None]:
!git clone https://github.com/farah-moh/vec_db.git

Cloning into 'vec_db'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 14 (delta 3), reused 9 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (14/14), 6.11 KiB | 6.11 MiB/s, done.
Resolving deltas: 100% (3/3), done.


# Database Path Instructions


Teams need to specify paths for each database (1M, 10M, 20M records) as follows:

1. Zip each database directory/file after generation.
2. Upload the zip file to Google Drive.
3. Share the file with "Anyone with the link."
4. Extract the file ID from the link (e.g., for `https://drive.google.com/file/d/1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah/view`, the ID is `1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah`).
5. Assign each ID to the appropriate variable in Part 1.
6. Provide the local PATH for each database to be passed to the initializer for automatic loading of the database and index (to be submitted during the project final phase). (This path could be folder name or whatever string you need).

**Note**: The code will download and unzip these files automatically. Once extracted, the local path for each database should be specified to enable the notebook to load databases and indexes.

In [None]:
TEAM_NUMBER = 1
GDRIVE_ID_DB_1M = "1XbEP6sU0k0UbuPcQLkHJP7cdPtebpsbD"
GDRIVE_ID_DB_10M = "1Jho9rair77eWdA5-iI_qDT2spJNesuDe"
GDRIVE_ID_DB_20M = "1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah"
PATH_DB_1M = "saved_db_1m.csv"
PATH_DB_10M = "saved_db_10m.csv"
PATH_DB_20M = "saved_db_20m.csv"

**Seed Number**:
This number will be changed during discussions by the instructor.


In [None]:
SEED_NUMBER = 10
import random
random.seed(SEED_NUMBER)

**Final Submission Checklist**:
Ensure the following items are included in your final submission:
- `TEAM_NUMBER`
- GitHub clone link (with PAT token)
- Google Drive IDs for each database:
  - `GDRIVE_ID_DB_1M`, `GDRIVE_ID_DB_10M`, `GDRIVE_ID_DB_20M`
- Paths for each database:
  - `PATH_DB_1M`, `PATH_DB_10M`, `PATH_DB_20M`
- Project document detailing the work and findings.

## Part 2: Do Not Modify Beyond This Point
### Note:
This section contains setup and evaluation code that should not be edited by students. Only the instructor may modify this section in case of a major bug.


In [None]:
# This code is not working now for some reason on Colab
# %load_ext autoreload
# %autoreload 2

In [None]:
%cd vec_db

/content/vec_db


This cell to run any additional requirement that your code need <br>


In [None]:
!pip install memory-profiler >> log.txt
!pip install -r requirements.txt



This cell to download the zip files and unzip them here.

In [None]:
!gdown $GDRIVE_ID_DB_1M -O saved_db_1m.zip
!gdown $GDRIVE_ID_DB_10M -O saved_db_10m.zip
!gdown $GDRIVE_ID_DB_20M -O saved_db_20m.zip
!unzip saved_db_1m.zip
!unzip saved_db_10m.zip
!unzip saved_db_20m.zip

Downloading...
From: https://drive.google.com/uc?id=14Gp_C3LxLuYIyF-zL5q-xFXKQ8KOFtZp
To: /content/sematic_search_DB/saved_db_100k.zip
100% 28.5M/28.5M [00:00<00:00, 41.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1XbEP6sU0k0UbuPcQLkHJP7cdPtebpsbD
To: /content/sematic_search_DB/saved_db_1m.zip
100% 28.5M/28.5M [00:00<00:00, 37.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1DX0tw9YDlvRthjMq3LQ6_aUyp3BvTC1r
To: /content/sematic_search_DB/saved_db_5m.zip
100% 28.5M/28.5M [00:00<00:00, 91.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Jho9rair77eWdA5-iI_qDT2spJNesuDe
To: /content/sematic_search_DB/saved_db_10m.zip
100% 28.5M/28.5M [00:00<00:00, 34.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1sPIgNxIuNDUUBnTvAuPMLryd1mqmgwV3
To: /content/sematic_search_DB/saved_db_15m.zip
100% 28.5M/28.5M [00:00<00:00, 116MB/s]
Downloading...
From: https://drive.google.com/uc?id=1j1gAU3kvdRqcOoKI5K5FgMMUZpOQANah
To: /content/sematic_search_DB/saved_db_2

Download and Generate The DBs

In [None]:
import os

In [None]:
PATH_DB_VECTORS_20M = "OpenSubtitles_en_20M_emb_64.dat"
PATH_DB_VECTORS_10M = "OpenSubtitles_en_10M_emb_64.dat"
PATH_DB_VECTORS_1M = "OpenSubtitles_en_1M_emb_64.dat"
if not os.path.exists(PATH_DB_VECTORS_20M):
    !gdown "1a7KL0BmPeW8SsckllNTtCX42L1gS_8U0" -O "OpenSubtitles_en_20M_emb_64.dat"

In [None]:
import numpy as np
import os

DIMENSION = 64
def create_other_DB_size(input_file, output_file, target_rows, embedding_dim = DIMENSION):
    # Configuration
    dtype = 'float32'

    # 1. Determine the shape of the source file
    # We calculate rows based on file size to be safe, or you can hardcode 20_000_000
    file_size_bytes = os.path.getsize(input_file)
    itemsize = np.dtype(dtype).itemsize
    total_rows = file_size_bytes // (embedding_dim * itemsize)

    print(f"Source detected: {total_rows} rows.")

    # 2. Open source in read mode ('r')
    # This uses almost 0 RAM, it just points to the file on disk
    source_memmap = np.memmap(
        input_file,
        dtype=dtype,
        mode='r',
        shape=(total_rows, embedding_dim)
    )

    # 3. Create the new file in write mode ('w+')
    # We define the shape as the target size (1M, 64)
    dest_memmap = np.memmap(
        output_file,
        dtype=dtype,
        mode='w+',
        shape=(target_rows, embedding_dim)
    )

    # 4. Copy the data
    # This transfers the binary blocks directly
    print("Copying data...")
    dest_memmap[:] = source_memmap[:target_rows]

    # 5. Flush to save changes to disk
    dest_memmap.flush()

    print(f"Success! Saved first {target_rows} rows to {output_file}")

In [None]:
if not os.path.exists(PATH_DB_VECTORS_1M):
    create_other_DB_size(PATH_DB_VECTORS_20M, PATH_DB_VECTORS_1M, 1_000_000)
if not os.path.exists(PATH_DB_VECTORS_10M):
    create_other_DB_size(PATH_DB_VECTORS_20M, PATH_DB_VECTORS_10M, 10_000_000)

Code to generate the queries that will be used to evaluate the questions.

Note: English sentences will be changed at submission day

The first sentence will be used just for warmup, then the others will be used for evaluation

In [None]:
queries_embed_file = "queries_emb_64.dat"

if not os.path.exists(queries_embed_file):
    from sentence_transformers import SentenceTransformer
    batch_sentences = [
        "Hello World",
        "We are Software Engineering Students",
        "What's the best way to be a good human?",
        "What a good day"
    ]
    model = SentenceTransformer('minishlab/potion-base-2M')
    queries_np = model.encode(batch_sentences, convert_to_numpy=True)
    queries_np = queries_np.astype(np.float32)
    queries_np.tofile(queries_embed_file)
else:
    queries_np = np.fromfile(queries_embed_file, dtype=np.float32).reshape(-1, DIMENSION)

query_dummy = queries_np[0].reshape(1, DIMENSION)
queries = [queries_np[1].reshape(1, DIMENSION), queries_np[2].reshape(1, DIMENSION), queries_np[3].reshape(1, DIMENSION)]
queries_np = queries_np[1:,:]

Generate the sorted_ids for each DB

In [None]:
actual_sorted_ids_file = "actual_sorted_ids_20m.dat"
saved_top_k = 30_000
needed_top_k = 10_000
if not os.path.exists(actual_sorted_ids_file):
    vectors = np.memmap(PATH_DB_VECTORS_20M, dtype='float32', mode='r', shape=(20_000_000, DIMENSION))
    actual_sorted_ids_20m = np.argsort(np.dot(vectors, queries_np.T) / (1e-45 + np.linalg.norm(vectors, axis=1)[:, None] * np.linalg.norm(queries_np, axis=1)), axis=0)[-saved_top_k:][::-1].T
    actual_sorted_ids_20m = actual_sorted_ids_20m.astype(np.int32)
    actual_sorted_ids_20m.tofile(actual_sorted_ids_file)
else:
    actual_sorted_ids_20m = np.fromfile(actual_sorted_ids_file, dtype=np.int32).reshape(-1, saved_top_k)

These are the functions for running and reporting

In [None]:
import numpy as np
import os
import time
from dataclasses import dataclass
from typing import List
from memory_profiler import memory_usage
import gc

@dataclass
class Result:
    run_time: float
    top_k: int
    db_ids: List[int]
    actual_ids: List[int]

def run_queries(db, queries, top_k, actual_ids, num_runs):
    """
    Run queries on the database and record results for each query.

    Parameters:
    - db: Database instance to run queries on.
    - queries: List of query vectors.
    - top_k: Number of top results to retrieve.
    - actual_ids: List of actual results to evaluate accuracy.
    - num_runs: Number of query executions to perform for testing.

    Returns:
    - List of Result
    """
    global results
    results = []
    for i in range(num_runs):
        tic = time.time()
        db_ids = db.retrieve(queries[i], top_k)
        toc = time.time()
        run_time = toc - tic
        results.append(Result(run_time, top_k, db_ids, actual_ids[i]))
    return results

def memory_usage_run_queries(args):
    """
    Run queries and measure memory usage during the execution.

    Parameters:
    - args: Arguments to be passed to the run_queries function.

    Returns:
    - results: The results of the run_queries.
    - memory_diff: The difference in memory usage before and after running the queries.
    """
    global results
    mem_before = max(memory_usage())
    mem = memory_usage(proc=(run_queries, args, {}), interval = 1e-3)
    return results, max(mem) - mem_before

def evaluate_result(results: List[Result]):
    """
    Evaluate the results based on accuracy and runtime.
    Scores are negative. So getting 0 is the best score.

    Parameters:
    - results: A list of Result objects

    Returns:
    - avg_score: The average score across all queries.
    - avg_runtime: The average runtime for all queries.
    """
    scores = []
    run_time = []
    for res in results:
        run_time.append(res.run_time)
        # case for retireving number not equal to top_k, socre will be the lowest
        if len(set(res.db_ids)) != res.top_k or len(res.db_ids) != res.top_k:
            scores.append( -1 * len(res.actual_ids) * res.top_k)
            continue
        score = 0
        for id in res.db_ids:
            try:
                ind = res.actual_ids.index(id)
                if ind > res.top_k * 3:
                    score -= ind
            except:
                score -= len(res.actual_ids)
        scores.append(score)

    return sum(scores) / len(scores), sum(run_time) / len(run_time)

def get_actual_ids_first_k(actual_sorted_ids, k, out_len = 10_000):
    """
    Retrieve the IDs from the sorted list of actual IDs.
    actual IDs has the top_k for the 20 M database but for other databases we have to remove the numbers higher than the max size of the DB.

    Parameters:
    - actual_sorted_ids: A list of lists containing the sorted actual IDs for each query.
    - k: The DB size.

    Returns:
    - List of lists containing the actual IDs for each query for this DB.
    """
    return [[id for id in actual_sorted_ids_one_q if id < k] for actual_sorted_ids_one_q in actual_sorted_ids][:out_len]

This code to actually run the class you have been implemented. The `VecDB` class should take the database path, and index path that you provided.<br>
Note at the submission I'll not run the insert records. <br>
The query istelf will be changed at submissions day but not the DB

In [None]:
# check memory usage for the import line independently
import tracemalloc
tracemalloc.start()
start_snapshot = tracemalloc.take_snapshot()

from vec_db import VecDB

end_snapshot = tracemalloc.take_snapshot()
stats = end_snapshot.compare_to(start_snapshot, 'lineno')
for stat in stats[:5]:  # show top differences
    print(stat)

tracemalloc.stop()

In [None]:
results = []
to_print_arr = []

In [None]:
print("Team Number", TEAM_NUMBER)
database_info = {
    "1M": {
        "database_file_path": PATH_DB_VECTORS_1M,
        "index_file_path": PATH_DB_1M,
        "size": 10**6
    },
    "10M": {
        "database_file_path": PATH_DB_VECTORS_10M,
        "index_file_path": PATH_DB_10M,
        "size": 10 * 10**6
    },
    "20M": {
        "database_file_path": PATH_DB_VECTORS_20M,
        "index_file_path": PATH_DB_20M,
        "size": 20 * 10**6
    }
}

for db_name, info in database_info.items():
    print(f"*"*40)
    print(f"Evaluating DB of size {db_name}")

    # This part added to check RAM usage for the class init function
    tracemalloc.start()
    start_snapshot = tracemalloc.take_snapshot()

    db = VecDB(database_file_path = info["database_file_path"], index_file_path = info["index_file_path"], new_db = False)

    end_snapshot = tracemalloc.take_snapshot()
    stats = end_snapshot.compare_to(start_snapshot, 'lineno')
    for stat in stats[:5]:  # show top differences
        print(stat)
    tracemalloc.stop()

    actual_ids = get_actual_ids_first_k(actual_sorted_ids_20m, info["size"], needed_top_k)
    # Make a dummy run query to make everything fresh and loaded (wrap up)
    # CRITICAL DON'T CACHE ANYTHING IN THE QUERY FUNCTION

    # This part added to check RAM usage for the run queries with another method
    tracemalloc.start()
    start_snapshot = tracemalloc.take_snapshot()

    res = run_queries(db, query_dummy, 5, actual_ids, 1)

    end_snapshot = tracemalloc.take_snapshot()
    stats = end_snapshot.compare_to(start_snapshot, 'lineno')
    for stat in stats[:5]:  # show top differences
        print(stat)
    tracemalloc.stop()
    # actual runs to evaluate
    res, mem = memory_usage_run_queries((db, queries, 5, actual_ids, 3))
    eval = evaluate_result(res)
    to_print = f"{db_name}\tscore\t{eval[0]}\ttime\t{eval[1]:.2f}\tRAM\t{mem:.2f} MB"
    print(to_print)
    to_print_arr.append(to_print)
    del db
    del actual_ids
    del res
    del mem
    del eval
    gc.collect()

Team Number 1
1M	score	0.0	time	140.53	RAM	156.94 MB


In [None]:
print("Team Number", TEAM_NUMBER)
print("\n".join(to_print_arr))

Team Number 1
1M	score	0.0	time	140.53	RAM	156.94 MB


In [None]:
!git log

In [None]:
!du -h --max-depth=2