# Documentation for Homomorphic Similarity Matrix Computation

## Overview
This script demonstrates the computation of a similarity matrix between face image embeddings using **cleartext operations** and **homomorphic encryption** (HE). The embeddings are extracted using the **ArcFace model**. The cleartext and encrypted results are compared for accuracy, runtime performance, and ciphertext size.

---

## Experimental Setup

### Dataset
- **Train Pairs**: 2,200 pairs.
- **Test Pairs**: 1,000 pairs.

### Sampling
- Unique images used:
  - **Templates**: 500 sampled images from the train set.
  - **Test Samples**: 300 sampled images from the test set.

### Embedding Details
- **Embedding Extraction**: ArcFace model (`buffalo_l`).
- **Embedding Dimensions**: 512 features per image.

### Encryption Scheme
- **CKKS Parameters**:
  - Polynomial modulus degree: 8192.
  - Coefficient modulus bit sizes: `[60, 40, 40, 60]`.
  - Global scale: \( 2^{40} \).

---

## Results

### Cleartext Similarity Matrix
- **Runtime**: Computation took **1.02 seconds** for a \( 300 \times 500 \) matrix.
- **Output Files**:
  - **Scores**: Saved in `output_partC/scores.csv`.
  - **Top-10 Similarity Indices**: Saved in `output_partC/top10.csv`.

### Homomorphic Similarity Matrix
- **Runtime**: Computation took **7,800.92 seconds** (~2.17 hours).
- **Output Files**:
  - **Scores**: Saved in `output_partC/scores_dec.csv`.
  - **Top-10 Similarity Indices**: Saved in `output_partC/top10_dec.csv`.

### Accuracy Comparison
The decrypted results from the homomorphic similarity computation were compared to the cleartext results.

- **Absolute Difference**:
  - **Average**: \( 0.000002 \).
  - **Standard Deviation**: \( 0.000000 \).
  - **Maximum**: \( 0.000003 \).
  - **Minimum**: \( 0.000000 \).

- **Top-10 Rank Consistency**:
  - **Column-wise Matching**:
    - All ranks (0 through 9) matched with **100.00%** accuracy.
  - **Exact Top-10 List Matching**:
    - **100.00%** of the lists matched exactly for all test samples.

### Ciphertext Size
- **Average Ciphertext Size**: ~334,330 bytes per embedding.

---

## Runtime Summary
- **Cleartext Similarity**: **1.02 seconds**.
- **Homomorphic Similarity**: **7,800.92 seconds** (~2.17 hours).

---

## Conclusions
1. **Accuracy**:
   - The homomorphic similarity results matched the cleartext results with minimal differences (absolute difference \( \leq 0.000003 \)).
   - Top-10 ranks were consistent across all test samples.
2. **Performance**:
   - Homomorphic computation is significantly slower than cleartext computation due to encryption overhead.
3. **Ciphertext Size**:
   - Each encrypted embedding consumes ~334KB, which impacts memory usage and communication overhead.

---

## Future Directions
1. Optimize encryption parameters to reduce runtime and ciphertext size.
2. Explore batching strategies for more efficient homomorphic computations.
3. Evaluate performance on larger datasets and real-world scenarios.

---


In [2]:
import os
import csv
import time
import statistics
import numpy as np
import torch
import cv2
import tenseal as ts
import matplotlib.pyplot as plt
import random

BASE_DIR = os.path.join("dataset", "lfw-deepfunneled", "lfw-deepfunneled")
PAIRS_TRAIN_PATH = "pairsDevTrain.txt"
PAIRS_TEST_PATH = "pairsDevTest.txt"

EMBED_DIM = 512

OUTPUT_DIR = "output_partC"
os.makedirs(OUTPUT_DIR, exist_ok=True)

def normalize_name(name):
    return name.replace(" ", "_")

def load_pairs(pairs_path, base_dir):
    """Load LFW pairs text file -> list of (img1_path, img2_path, label)."""
    pairs = []
    with open(pairs_path, "r") as f:
        lines = f.readlines()[1:]
        for line in lines:
            parts = line.strip().split()
            if len(parts) == 3:
                person, img1, img2 = parts
                person = normalize_name(person)
                img1_path = os.path.join(base_dir, person, f"{person}_{int(img1):04d}.jpg")
                img2_path = os.path.join(base_dir, person, f"{person}_{int(img2):04d}.jpg")
                if os.path.exists(img1_path) and os.path.exists(img2_path):
                    pairs.append((img1_path, img2_path, 1))
            elif len(parts) == 4:
                p1, img1, p2, img2 = parts
                p1, p2 = normalize_name(p1), normalize_name(p2)
                img1_path = os.path.join(base_dir, p1, f"{p1}_{int(img1):04d}.jpg")
                img2_path = os.path.join(base_dir, p2, f"{p2}_{int(img2):04d}.jpg")
                if os.path.exists(img1_path) and os.path.exists(img2_path):
                    pairs.append((img1_path, img2_path, 0))
    return pairs

print("Loading LFW pairs...")
train_pairs = load_pairs(PAIRS_TRAIN_PATH, BASE_DIR)
test_pairs = load_pairs(PAIRS_TEST_PATH, BASE_DIR)
all_pairs = train_pairs + test_pairs
print(f"Train pairs: {len(train_pairs)}, Test pairs: {len(test_pairs)}")

print("Initializing ArcFace model (buffalo_l)...")
import insightface
from insightface.app import FaceAnalysis
app = FaceAnalysis(name="buffalo_l", providers=["CUDAExecutionProvider","CPUExecutionProvider"])
app.prepare(ctx_id=0, det_size=(112, 112))

def get_arcface_embedding(img_path):
    bgr_img = cv2.imread(img_path)
    if bgr_img is None:
        raise ValueError(f"Could not load image {img_path}")
    bgr_img = cv2.resize(bgr_img, (112, 112))
    rgb_img = cv2.cvtColor(bgr_img, cv2.COLOR_BGR2RGB)
    feat = app.models["recognition"].get_feat(rgb_img)

    if feat.ndim == 2:
        feat = feat[0]

    feat_norm = feat / np.linalg.norm(feat)
    return feat_norm.astype(np.float32)

def build_embeddings_dict(image_paths):
    """Compute and cache embeddings for each unique image path."""
    emb_dict = {}
    for path in image_paths:
        emb_dict[path] = get_arcface_embedding(path)
    return emb_dict

template_paths = set()
test_sample_paths = set()
for (img1, img2, label) in train_pairs:
    template_paths.add(img1)
    template_paths.add(img2)
for (img1, img2, label) in test_pairs:
    test_sample_paths.add(img1)
    test_sample_paths.add(img2)

template_paths = list(template_paths)
test_sample_paths = list(test_sample_paths)

print(f"Number of unique template images: {len(template_paths)}")
print(f"Number of unique test-sample images: {len(test_sample_paths)}")

random.shuffle(template_paths)
random.shuffle(test_sample_paths)

MAX_TEMPLATES = 500
MAX_SAMPLES   = 300

template_paths = template_paths[:MAX_TEMPLATES]
test_sample_paths = test_sample_paths[:MAX_SAMPLES]

print(f"After sampling: #templates = {len(template_paths)}, #samples = {len(test_sample_paths)}")

print("Building embeddings for templates...")
template_emb_dict = build_embeddings_dict(template_paths)

print("Building embeddings for test samples...")
sample_emb_dict = build_embeddings_dict(test_sample_paths)

def cosine_similarity(vec1, vec2):
    """Cosine similarity for 1D arrays."""
    vec1 = vec1.ravel()
    vec2 = vec2.ravel()
    dot = np.dot(vec1, vec2)
    return dot / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

sample_list = test_sample_paths
template_list = template_paths

n = len(sample_list)
m = len(template_list)

scores_matrix = np.zeros((n, m), dtype=np.float32)

print("\nComputing cleartext similarity matrix [n x m] ...")
t0 = time.time()
for i, sample_path in enumerate(sample_list):
    emb_s = sample_emb_dict[sample_path]
    for j, template_path in enumerate(template_list):
        emb_t = template_emb_dict[template_path]
        sim = cosine_similarity(emb_s, emb_t)
        scores_matrix[i, j] = sim
cleartext_time = time.time() - t0
print(f"Cleartext similarity computation took {cleartext_time:.2f} s")

scores_csv_path = os.path.join(OUTPUT_DIR, "scores.csv")
print(f"Writing {scores_csv_path} ...")
np.savetxt(scores_csv_path, scores_matrix, delimiter=",", fmt="%.5f")

top10_indices = []
for i in range(n):
    row = scores_matrix[i, :]
    top10 = np.argsort(-row)[:10]
    top10_indices.append(top10)

top10_csv_path = os.path.join(OUTPUT_DIR, "top10.csv")
print(f"Writing {top10_csv_path} ...")
with open(top10_csv_path, "w", newline="") as f:
    writer = csv.writer(f)
    for i in range(n):
        writer.writerow(top10_indices[i].tolist())

print("\nInitializing CKKS context & encryption keys...")

context = ts.context(
    ts.SCHEME_TYPE.CKKS,
    poly_modulus_degree=8192,
    coeff_mod_bit_sizes=[60, 40, 40, 60]
)
context.global_scale = 2 ** 40
context.generate_galois_keys()

def encrypt_vector(vec):
    return ts.ckks_vector(context, vec)

enc_template_emb_dict = {}
enc_sample_emb_dict = {}

print("Encrypting template embeddings...")
for path in template_list:
    enc_template_emb_dict[path] = encrypt_vector(template_emb_dict[path])

print("Encrypting sample embeddings...")
for path in sample_list:
    enc_sample_emb_dict[path] = encrypt_vector(sample_emb_dict[path])

def homomorphic_cosine_similarity(enc_vec1, enc_vec2):
    dot_product = (enc_vec1 * enc_vec2).sum()
    norm1 = (enc_vec1 * enc_vec1).sum()
    norm2 = (enc_vec2 * enc_vec2).sum()

    decrypted_dot = dot_product.decrypt()[0]
    decrypted_norm1 = norm1.decrypt()[0]
    decrypted_norm2 = norm2.decrypt()[0]
    return decrypted_dot / (np.sqrt(decrypted_norm1) * np.sqrt(decrypted_norm2))

scores_enc_matrix = np.zeros((n, m), dtype=np.float32)

print("\nComputing homomorphic similarity matrix [n x m] ...")
t0 = time.time()
for i, sample_path in enumerate(sample_list):
    enc_s = enc_sample_emb_dict[sample_path]
    for j, template_path in enumerate(template_list):
        enc_t = enc_template_emb_dict[template_path]
        sim_enc = homomorphic_cosine_similarity(enc_s, enc_t)
        scores_enc_matrix[i, j] = sim_enc
homomorphic_time = time.time() - t0
print(f"Encrypted similarity computation took {homomorphic_time:.2f} s")

scores_dec_csv_path = os.path.join(OUTPUT_DIR, "scores_dec.csv")
print(f"Writing {scores_dec_csv_path} ...")
np.savetxt(scores_dec_csv_path, scores_enc_matrix, delimiter=",", fmt="%.5f")

top10_dec_indices = []
for i in range(n):
    row = scores_enc_matrix[i, :]
    top10 = np.argsort(-row)[:10]
    top10_dec_indices.append(top10)

top10_dec_csv_path = os.path.join(OUTPUT_DIR, "top10_dec.csv")
print(f"Writing {top10_dec_csv_path} ...")
with open(top10_dec_csv_path, "w", newline="") as f:
    writer = csv.writer(f)
    for i in range(n):
        writer.writerow(top10_dec_indices[i].tolist())

print("\nComparing cleartext vs. decrypted similarity matrices...")

cleartext_scores = scores_matrix
dec_scores = scores_enc_matrix

diff_matrix = np.abs(cleartext_scores - dec_scores)
avg_diff = diff_matrix.mean()
std_diff = diff_matrix.std()
max_diff = diff_matrix.max()
min_diff = diff_matrix.min()

print(f"Absolute difference: avg={avg_diff:.6f}, std={std_diff:.6f}, "
      f"max={max_diff:.6f}, min={min_diff:.6f}")

top10_clear = []
with open(top10_csv_path, "r") as f:
    reader = csv.reader(f)
    for row in reader:
        top10_clear.append(list(map(int, row)))

top10_dec = []
with open(top10_dec_csv_path, "r") as f:
    reader = csv.reader(f)
    for row in reader:
        top10_dec.append(list(map(int, row)))

match_percentages = []
for col_idx in range(10):
    match_count = 0
    for sample_idx in range(n):
        if top10_clear[sample_idx][col_idx] == top10_dec[sample_idx][col_idx]:
            match_count += 1
    match_percentages.append(100.0 * match_count / n)

print("Top-10 rank consistency (by column):")
for i, pct in enumerate(match_percentages):
    print(f"  Rank {i} match = {pct:.2f}%")

exact_top10_match_count = 0
for sample_idx in range(n):
    if top10_clear[sample_idx] == top10_dec[sample_idx]:
        exact_top10_match_count += 1
exact_top10_pct = 100.0 * exact_top10_match_count / n
print(f"Exact top-10 list match for all columns: {exact_top10_pct:.2f}%")

print("\n=== Runtime Summary ===")
print(f"Cleartext similarity took: {cleartext_time:.2f} s")
print(f"Homomorphic similarity took: {homomorphic_time:.2f} s")

enc_size_samples = []
for path in sample_list[:5]:
    serialized = enc_sample_emb_dict[path].serialize()
    enc_size_samples.append(len(serialized))

avg_ciphertext_size = sum(enc_size_samples)/len(enc_size_samples)
print(f"\nCiphertext size example (~one embedding): ~{avg_ciphertext_size} bytes")

print("\nPart C script completed successfully!")


Loading LFW pairs...
Train pairs: 2200, Test pairs: 1000
Initializing ArcFace model (buffalo_l)...
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: C:\Users\Alpha/.insightface\models\buffalo_l\1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: C:\Users\Alpha/.insightface\models\buffalo_l\2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: C:\Users\Alpha/.insightface\models\buffalo_l\det_10g.onnx detection [1, 3, '?', '?'] 127.5 128.0
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: C:\Users\Alpha/.insightface\models\buffalo_l\genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: