Skip to content

Conversation

@DeveloperAmrit
Copy link

@DeveloperAmrit DeveloperAmrit commented Jan 7, 2026

Closes #904

Changes done

  • Added face encodings of 1000 celebrities
  • Implemented celebrity detection
  • Marks a cluster with name of celebrity if a celebrity is detected in that cluster

Why second commit "fix merge issue"

  • Faced some local merge issues, solved them here

Video was larger than 10MB so i have attached the link
https://collection.cloudinary.com/db6cpegxg/5256757d53b4723bee8050226eeaa9ee
or
https://res.cloudinary.com/db6cpegxg/video/upload/v1767805877/Screen_Recording_2026-01-07_at_10.32.52_PM_co6scx.mov

Summary by CodeRabbit

Release Notes

  • New Features
    • Added celebrity identification capability to your photo gallery
    • New "Scan Celebrities" button on the home page triggers background scanning
    • Identified celebrity faces are automatically grouped and organized together
    • Real-time progress indicators show scan status and results

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions bot added backend enhancement New feature or request frontend labels Jan 7, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

This pull request implements a celebrity detection and tagging feature that integrates face embeddings with a pre-loaded celebrity face database. The system identifies detected faces by comparing embeddings against known celebrities using Euclidean distance, automatically associates matched faces with celebrity-named clusters, and provides both backend processing and frontend UI for triggering celebrity scans.

Changes

Cohort / File(s) Summary
Database Layer
backend/app/database/face_clusters.py
Added db_get_or_create_cluster_by_name() function to retrieve existing clusters by name or create new ones with UUID identifiers.
ML/Embedding Models
backend/app/models/CelebrityMatcher.py
New class loads pre-computed celebrity face encodings from pickle file, optimizes into embeddings matrix, and provides identify_face() method to match 128D embeddings against celebrities using Euclidean distance with configurable threshold.
Face Detection Integration
backend/app/models/FaceDetector.py
Integrated CelebrityMatcher instance; added cluster tracking; after embedding generation, identifies celebrity matches and creates/associates cluster IDs; extended database insertion to pass cluster metadata.
Backend Routes
backend/app/routes/celebrity.py
New APIRouter with scan_celebrities() endpoint; triggers background task that processes all stored faces, identifies celebrities via embeddings, queues cluster updates, and performs batched database writes.
Demo & Entry Points
backend/celebrity_detection_demo.py, backend/main.py
New demo script showcases end-to-end celebrity detection pipeline with error handling; minor formatting adjustment to main.py.
Frontend API Layer
frontend/src/api/api-functions/celebrity.ts, frontend/src/api/api-functions/index.ts
Added new scanCelebrities() async API helper; re-exported from index module.
Frontend UI
frontend/src/pages/Home/Home.tsx
Integrated React Query mutation for celebrity scan; added button with loading state to trigger background scan from Home page.

Sequence Diagram(s)

sequenceDiagram
    participant FE as Frontend
    participant API as Backend API
    participant FM as FaceDetector<br/>(CelebrityMatcher)
    participant DB as Database
    
    FE->>API: POST /celebrity/scan
    activate API
    API->>API: schedule background task
    API-->>FE: 202 Accepted
    deactivate API
    
    par Background Processing
        API->>DB: fetch all faces + cluster names
        activate DB
        DB-->>API: [faces with embeddings]
        deactivate DB
        
        loop For each face
            API->>FM: identify_face(embedding)
            activate FM
            FM->>FM: compute Euclidean distances<br/>to celebrity encodings
            FM-->>API: celebrity name (if matched)
            deactivate FM
            
            alt Celebrity matched & differs from current cluster
                API->>DB: get_or_create_cluster_by_name(celebrity_name)
                activate DB
                DB-->>API: cluster_id
                deactivate DB
                API->>API: queue cluster_id update
            end
        end
        
        API->>DB: batch update face_cluster_ids
        activate DB
        DB-->>API: update confirmed
        deactivate DB
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A bunny hops through embeddings bright,
Matching faces to stars of the night,
Cluster by cluster, the celebrities align,
With Euclidean magic—a celebrity design! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: implementing celebrity tagging functionality to address issue #904, which is the primary objective of the PR.
Linked Issues check ✅ Passed The PR implements all core requirements from #904: celebrity encodings are loaded via CelebrityMatcher, face embeddings are compared against them, and clusters are marked with celebrity names when matches are found.
Out of Scope Changes check ✅ Passed All code changes directly support celebrity detection and tagging: CelebrityMatcher loads encodings, FaceDetector integrates identification, database and API layers support cluster marking, and the frontend provides the scan UI.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In @backend/app/database/face_clusters.py:
- Around line 355-377: The db_get_or_create_cluster_by_name function has a race
condition and opens new DB connections per call; change it to attempt the INSERT
first inside a transaction (use "INSERT OR IGNORE" or plain INSERT and catch
sqlite3.IntegrityError), then SELECT the cluster_id if the INSERT was ignored or
failed, ensure commit on success and rollback on error, and accept an optional
cursor/connection parameter (matching the file's existing pattern) so callers
can reuse a transaction/cursor instead of creating many short-lived connections;
update error handling to close or leave the passed-in cursor/connection
untouched only when it was created inside the function.

In @backend/app/models/CelebrityMatcher.py:
- Around line 29-37: The path resolution is overly nested and contains an empty
pass; simplify by replacing the if-not-abs block with a single check that only
tries the fallback when the provided relative path doesn't exist: if not
os.path.isabs(path) and not os.path.exists(path): compute dir_of_file =
os.path.dirname(__file__), potential_path = os.path.join(dir_of_file,
os.path.basename(path)), and if os.path.exists(potential_path): set path =
potential_path. Remove the empty pass and the extra nested if/else.
- Around line 70-97: identify_face currently assumes unknown_embedding is a
128-D vector; add input validation at the start of the method to verify
unknown_embedding is not None and has the expected 128 dimensions (accept either
shape (128,) or (1,128)), optionally converting/flattening to a 1-D numpy array
before computing distances with self.embeddings_matrix, and if the shape is
wrong raise a ValueError or return None and log a clear debug/error message
using logger; reference the identify_face function, self.embeddings_matrix and
self.names when adding the checks so the subsequent np.linalg.norm call is safe.

In @backend/app/models/FaceDetector.py:
- Around line 61-68: The call to db_get_or_create_cluster_by_name in
FaceDetector (after celebrity_matcher.identify_face) can raise DB exceptions and
currently will crash the whole detection flow; wrap the
db_get_or_create_cluster_by_name call in a try/except (catching database-related
exceptions) and on failure log the error with context (including image_id and
name) and append None (or skip) to cluster_ids so processing continues; ensure
the logger used in this file is used for the error message and do not suppress
unexpected exceptions beyond logging unless explicitly intended.

In @backend/app/routes/celebrity.py:
- Around line 10-43: The celebrity scan loads all faces into memory
(db_get_all_faces_with_cluster_names) and calls db_get_or_create_cluster_by_name
per face, causing memory and DB-connection overhead; change
process_celebrity_scan to iterate faces in fixed-size batches (e.g., page
through db_get_all_faces_with_cluster_names or a new db_get_faces_batch
iterator), accumulate updates per batch, open a single DB connection/cursor for
the batch and pass that cursor into db_get_or_create_cluster_by_name to avoid
per-face connections, then call db_update_face_cluster_ids_batch per batch and
continue until all faces are processed.

In @backend/celebrity_detection_demo.py:
- Around line 28-127: The demo never releases resources — call
face_detector.close() and any cleanup on celebrity_matcher (e.g.,
celebrity_matcher.close() if available or celebrity_matcher.shutdown()) before
returning or at the end of run_celebrity_detection_demo; ensure cleanup runs on
all exit paths (after initialization failure, image-not-found/failed-load,
no-faces return, and at normal completion) by invoking the close/shutdown
methods for FaceDetector (face_detector.close()) and CelebrityMatcher
(celebrity_matcher.close()) inside finally blocks or before each early return so
resources are always freed.
🧹 Nitpick comments (7)
frontend/src/api/api-functions/celebrity.ts (1)

3-6: Add TypeScript return type for better type safety.

The function should declare its return type to improve type inference and catch potential API contract changes.

📘 Proposed type definition

Define the response type and add it to the function signature:

+interface ScanCelebritiesResponse {
+  message: string;
+}
+
-export const scanCelebrities = async () => {
+export const scanCelebrities = async (): Promise<ScanCelebritiesResponse> => {
  const response = await apiClient.post('/celebrity/scan');
  return response.data;
};
frontend/src/pages/Home/Home.tsx (1)

35-52: Consider refetching data after successful celebrity scan.

The mutation correctly starts the background scan, but when it completes successfully, the UI displays "Celebrity scan started in background" without indicating when the scan finishes or prompting the user to view results. Users won't see the newly assigned celebrity names without manually refreshing or navigating away and back.

💡 Proposed enhancement with query invalidation

After the scan starts, consider invalidating relevant queries or redirecting to the clusters view so users can see the results:

+import { useQueryClient } from '@tanstack/react-query';
+
 export const Home = () => {
   const dispatch = useDispatch();
+  const queryClient = useQueryClient();
   const images = useSelector(selectImages);
   // ... rest of setup

   const scanMutation = useMutation({
     mutationFn: scanCelebrities,
+    onSuccess: () => {
+      // Invalidate clusters query to trigger refetch with updated celebrity names
+      queryClient.invalidateQueries({ queryKey: ['clusters'] });
+    },
   });

   useMutationFeedback(
     { 
       isPending: scanMutation.isPending, 
       isSuccess: scanMutation.isSuccess, 
       isError: scanMutation.isError, 
       error: scanMutation.error 
     },
     {
       loadingMessage: 'Scanning for celebrities...',
-      successMessage: 'Celebrity scan started in background.',
+      successMessage: 'Celebrity scan complete! Check the Clusters page to see results.',
       errorTitle: 'Scan Failed',
       errorMessage: 'Failed to start celebrity scan.',
     },
   );

Note: This assumes the background scan completes quickly. For longer scans, consider implementing polling or WebSocket updates to notify when the scan finishes.

backend/app/routes/celebrity.py (1)

23-23: Consider making the celebrity matching threshold configurable.

Line 23 uses the default threshold (0.7) from identify_face. Depending on the quality of the celebrity encodings and the detection model, this threshold might need tuning for optimal precision/recall. A threshold that's too low will produce false positives; too high will miss valid matches.

⚙️ Proposed configuration approach

Make the threshold configurable via environment variable or settings:

In backend/app/config/settings.py (or wherever configuration lives):

CELEBRITY_MATCH_THRESHOLD = float(os.getenv("CELEBRITY_MATCH_THRESHOLD", "0.7"))

Then in this file:

+from app.config.settings import CELEBRITY_MATCH_THRESHOLD
+
 def process_celebrity_scan():
     # ... existing setup ...
     
     for face in faces:
         embeddings = face["embeddings"]
         
-        name = matcher.identify_face(embeddings)
+        name = matcher.identify_face(embeddings, threshold=CELEBRITY_MATCH_THRESHOLD)
         
         if name:
             # ... rest of logic ...

This allows tuning without code changes.

backend/app/models/FaceDetector.py (1)

61-68: Consider performance impact of per-face database calls.

Each identified celebrity triggers a synchronous database call to db_get_or_create_cluster_by_name. For images with many faces, this could introduce significant latency. Consider:

  1. Batching cluster creation/retrieval after processing all faces
  2. Caching cluster IDs in memory for frequently identified celebrities
  3. Moving cluster assignment to a background task
backend/celebrity_detection_demo.py (1)

80-81: Hard-coded threshold duplicates FaceDetector logic.

Line 80 accesses face_detector.yolo_detector.conf_threshold, which is good. However, this creates a dependency on the internal structure of FaceDetector. Consider whether the demo should expose or document this threshold dependency.

backend/app/models/CelebrityMatcher.py (2)

87-90: Performance optimization opportunity: early stopping.

The current implementation computes distances to all celebrity embeddings before finding the minimum. For large celebrity databases, you could potentially stop early once a match below the threshold is found, though this would require different logic (e.g., iterative search or using a spatial index).

For now, the implementation is correct and efficient enough for 1,000 celebrities. Consider optimization only if performance becomes an issue with larger databases.


70-70: Add documentation for the threshold parameter and provide tuning guidance.

The default threshold of 0.7 already accepts a parameter (good), but lacks documentation on why this value was chosen or how to tune it for different precision/recall tradeoffs. The embeddings are L2-normalized (confirmed in FaceNet.py), which places 0.7 within the standard range for FaceNet-based matching (typically 0.6–0.8). Update the docstring to explain:

  • The assumption that embeddings are L2-normalized
  • Why 0.7 was selected (empirical basis if available, or note it as a reasonable default)
  • How to adjust the threshold for stricter (lower distance) or looser (higher distance) matching based on your use case
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3af96c and 477093e.

⛔ Files ignored due to path filters (3)
  • __pycache__/app.cpython-310.pyc is excluded by !**/*.pyc
  • backend/app/models/celebrity_encodings.pkl is excluded by !**/*.pkl
  • utils/__pycache__/cache.cpython-310.pyc is excluded by !**/*.pyc
📒 Files selected for processing (9)
  • backend/app/database/face_clusters.py
  • backend/app/models/CelebrityMatcher.py
  • backend/app/models/FaceDetector.py
  • backend/app/routes/celebrity.py
  • backend/celebrity_detection_demo.py
  • backend/main.py
  • frontend/src/api/api-functions/celebrity.ts
  • frontend/src/api/api-functions/index.ts
  • frontend/src/pages/Home/Home.tsx
💤 Files with no reviewable changes (1)
  • backend/main.py
🧰 Additional context used
🧬 Code graph analysis (3)
frontend/src/api/api-functions/celebrity.ts (1)
frontend/src/api/axiosConfig.ts (1)
  • apiClient (5-12)
backend/app/routes/celebrity.py (3)
backend/app/models/CelebrityMatcher.py (2)
  • CelebrityMatcher (8-97)
  • identify_face (70-97)
backend/app/database/faces.py (2)
  • db_get_all_faces_with_cluster_names (246-285)
  • db_update_face_cluster_ids_batch (288-341)
backend/app/database/face_clusters.py (1)
  • db_get_or_create_cluster_by_name (355-377)
backend/celebrity_detection_demo.py (4)
backend/app/models/FaceDetector.py (1)
  • FaceDetector (17-91)
backend/app/models/FaceNet.py (2)
  • FaceNet (12-29)
  • get_embedding (20-25)
backend/app/models/CelebrityMatcher.py (2)
  • CelebrityMatcher (8-97)
  • identify_face (70-97)
backend/app/utils/FaceNet.py (1)
  • FaceNet_util_preprocess_image (6-13)
🔇 Additional comments (5)
frontend/src/api/api-functions/index.ts (1)

7-8: LGTM!

The barrel export follows the existing pattern and correctly exposes the new celebrity API functions.

frontend/src/pages/Home/Home.tsx (1)

78-89: LGTM! Clean UI integration.

The button implementation is well-structured with appropriate loading states, disabled state during scanning, and clear visual feedback using the Sparkles icon.

backend/celebrity_detection_demo.py (2)

7-9: LGTM: Path setup is appropriate for a demo script.

The sys.path manipulation correctly adds the backend directory to enable imports. The conditional check prevents duplicate entries.


14-23: LGTM: Robust import error handling.

The try-except block with helpful error messages makes the script user-friendly and easier to debug.

backend/app/models/FaceDetector.py (1)

25-25: CelebrityMatcher.identify_face is thread-safe—no synchronization required.

The identify_face method performs only read-only operations on immutable data (self.embeddings_matrix and self.names), which are loaded once during initialization and never modified. NumPy array reads are inherently thread-safe, so concurrent calls to detect_faces will not cause race conditions in the celebrity matching logic.

Likely an incorrect or invalid review comment.

Comment on lines +355 to +377
def db_get_or_create_cluster_by_name(name: str) -> str:
"""Gets an existing cluster ID by name or creates a new one."""
conn = sqlite3.connect(DATABASE_PATH)
cursor = conn.cursor()
try:
# Check if exists
cursor.execute(
"SELECT cluster_id FROM face_clusters WHERE cluster_name = ?", (name,)
)
result = cursor.fetchone()
if result:
return result[0]

# Create new
cluster_id = str(uuid.uuid4())
cursor.execute(
"INSERT INTO face_clusters (cluster_id, cluster_name) VALUES (?, ?)",
(cluster_id, name),
)
conn.commit()
return cluster_id
finally:
conn.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical race condition in get-or-create pattern.

The check-then-insert logic is vulnerable to race conditions when multiple threads call this function concurrently with the same celebrity name. If two calls pass the SELECT check before either completes the INSERT, one will fail with a constraint violation.

Additionally, this function opens a new database connection for every call. When invoked in a loop (as in backend/app/routes/celebrity.py line 32), this creates many short-lived connections, which is inefficient.

🔒 Proposed fix using INSERT OR IGNORE with proper transaction handling
-def db_get_or_create_cluster_by_name(name: str) -> str:
-    """Gets an existing cluster ID by name or creates a new one."""
-    conn = sqlite3.connect(DATABASE_PATH)
-    cursor = conn.cursor()
-    try:
-        # Check if exists
-        cursor.execute(
-            "SELECT cluster_id FROM face_clusters WHERE cluster_name = ?", (name,)
-        )
-        result = cursor.fetchone()
-        if result:
-            return result[0]
-
-        # Create new
-        cluster_id = str(uuid.uuid4())
-        cursor.execute(
-            "INSERT INTO face_clusters (cluster_id, cluster_name) VALUES (?, ?)",
-            (cluster_id, name),
-        )
-        conn.commit()
-        return cluster_id
-    finally:
-        conn.close()
+def db_get_or_create_cluster_by_name(
+    name: str, cursor: Optional[sqlite3.Cursor] = None
+) -> str:
+    """Gets an existing cluster ID by name or creates a new one.
+    
+    Args:
+        name: The cluster name to look up or create
+        cursor: Optional existing database cursor. If None, creates a new connection.
+        
+    Returns:
+        The cluster_id (existing or newly created)
+    """
+    own_connection = cursor is None
+    if own_connection:
+        conn = sqlite3.connect(DATABASE_PATH)
+        cursor = conn.cursor()
+
+    try:
+        # Try to insert first with a new UUID
+        cluster_id = str(uuid.uuid4())
+        try:
+            cursor.execute(
+                "INSERT INTO face_clusters (cluster_id, cluster_name) VALUES (?, ?)",
+                (cluster_id, name),
+            )
+            if own_connection:
+                conn.commit()
+            return cluster_id
+        except sqlite3.IntegrityError:
+            # Cluster name already exists, fetch the existing ID
+            if own_connection:
+                conn.rollback()
+            cursor.execute(
+                "SELECT cluster_id FROM face_clusters WHERE cluster_name = ?", (name,)
+            )
+            result = cursor.fetchone()
+            if result:
+                return result[0]
+            else:
+                # Should not happen, but handle gracefully
+                raise ValueError(f"Failed to create or retrieve cluster for name: {name}")
+    except Exception:
+        if own_connection:
+            conn.rollback()
+        raise
+    finally:
+        if own_connection:
+            conn.close()

This approach:

  1. Uses INSERT first (optimistic path for new names)
  2. Catches IntegrityError if name already exists
  3. Falls back to SELECT to retrieve existing cluster_id
  4. Accepts optional cursor to enable reuse in transactions (matches pattern of other functions in this file)
  5. Includes proper error handling and rollback

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @backend/app/database/face_clusters.py around lines 355 - 377, The
db_get_or_create_cluster_by_name function has a race condition and opens new DB
connections per call; change it to attempt the INSERT first inside a transaction
(use "INSERT OR IGNORE" or plain INSERT and catch sqlite3.IntegrityError), then
SELECT the cluster_id if the INSERT was ignored or failed, ensure commit on
success and rollback on error, and accept an optional cursor/connection
parameter (matching the file's existing pattern) so callers can reuse a
transaction/cursor instead of creating many short-lived connections; update
error handling to close or leave the passed-in cursor/connection untouched only
when it was created inside the function.

Comment on lines +29 to +37
if not os.path.isabs(path):
if os.path.exists(path):
pass
else:
# Check relative to this file location as fallback
dir_of_file = os.path.dirname(__file__)
potential_path = os.path.join(dir_of_file, os.path.basename(path))
if os.path.exists(potential_path):
path = potential_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Simplify path resolution logic.

The nested conditionals for path resolution (lines 29-37) are hard to follow and have an unusual structure with an empty pass statement. This logic could be simplified.

♻️ Cleaner path resolution
-        # Resolve absolute path if relative path is provided
-        if not os.path.isabs(path):
-            if os.path.exists(path):
-                pass
-            else:
-                 # Check relative to this file location as fallback
-                 dir_of_file = os.path.dirname(__file__)
-                 potential_path = os.path.join(dir_of_file, os.path.basename(path))
-                 if os.path.exists(potential_path):
-                     path = potential_path
+        # Resolve relative paths: first try as-is, then relative to this module
+        if not os.path.isabs(path) and not os.path.exists(path):
+            module_dir = os.path.dirname(__file__)
+            fallback_path = os.path.join(module_dir, os.path.basename(path))
+            if os.path.exists(fallback_path):
+                path = fallback_path
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not os.path.isabs(path):
if os.path.exists(path):
pass
else:
# Check relative to this file location as fallback
dir_of_file = os.path.dirname(__file__)
potential_path = os.path.join(dir_of_file, os.path.basename(path))
if os.path.exists(potential_path):
path = potential_path
# Resolve relative paths: first try as-is, then relative to this module
if not os.path.isabs(path) and not os.path.exists(path):
module_dir = os.path.dirname(__file__)
fallback_path = os.path.join(module_dir, os.path.basename(path))
if os.path.exists(fallback_path):
path = fallback_path
🤖 Prompt for AI Agents
In @backend/app/models/CelebrityMatcher.py around lines 29 - 37, The path
resolution is overly nested and contains an empty pass; simplify by replacing
the if-not-abs block with a single check that only tries the fallback when the
provided relative path doesn't exist: if not os.path.isabs(path) and not
os.path.exists(path): compute dir_of_file = os.path.dirname(__file__),
potential_path = os.path.join(dir_of_file, os.path.basename(path)), and if
os.path.exists(potential_path): set path = potential_path. Remove the empty pass
and the extra nested if/else.

Comment on lines +43 to +45
try:
with open(path, 'rb') as f:
data = pickle.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

CRITICAL: Pickle deserialization security risk.

Using pickle.load() on a file can execute arbitrary code if the pickle file is malicious or has been tampered with. Since the encodings file path is configurable and could potentially be user-controlled or loaded from an untrusted source, this is a security vulnerability.

Consider these alternatives:

  1. Use a safer serialization format like JSON or NumPy's .npz format
  2. If pickle is required, validate the file's cryptographic signature before loading
  3. Document that the encodings file must only come from trusted sources and implement strict file permission checks
🔒 Example fix using NumPy's safer format
# When saving encodings (in your encoding generation script):
np.savez_compressed(
    'celebrity_encodings.npz',
    names=names_array,
    embeddings=embeddings_array
)

# In _load_and_optimize:
try:
    data = np.load(path, allow_pickle=False)  # Disable pickle for safety
    self.names = data['names']
    self.embeddings_matrix = data['embeddings']
    logger.info(f"Loaded {len(self.names)} celebrity encodings.")
except Exception as e:
    logger.error(f"Failed to load celebrity encodings: {e}")

Alternatively, if you must use pickle for backward compatibility, add this warning:

         try:
+            # WARNING: Only load pickle files from trusted sources
+            # Pickle can execute arbitrary code during deserialization
             with open(path, 'rb') as f:
                 data = pickle.load(f)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
with open(path, 'rb') as f:
data = pickle.load(f)
try:
# WARNING: Only load pickle files from trusted sources
# Pickle can execute arbitrary code during deserialization
with open(path, 'rb') as f:
data = pickle.load(f)

Comment on lines +70 to +97
def identify_face(self, unknown_embedding, threshold=0.7):
"""
Identifies the celebrity from the given embedding using Euclidean distance.
Args:
unknown_embedding (np.ndarray): The 128D embedding of the face to identify.
threshold (float): The distance threshold for a match.
Returns:
str: The name of the celebrity if found, otherwise None.
"""
if self.embeddings_matrix is None or unknown_embedding is None:
return None

# Ensure unknown_embedding is the correct shape for broadcasting
# Calculate Euclidean distance: sqrt(sum((x - y)^2))
# axis=1 to calculate distance for each row in the matrix
distances = np.linalg.norm(self.embeddings_matrix - unknown_embedding, axis=1)

min_distance_idx = np.argmin(distances)
min_distance = distances[min_distance_idx]

logger.debug(f"Closest match: {self.names[min_distance_idx]} with distance: {min_distance}")

if min_distance < threshold:
return self.names[min_distance_idx]
else:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add input validation for embedding dimension.

The identify_face method doesn't validate that unknown_embedding has the expected shape (128D). If an incorrect embedding is passed, the distance calculation could fail or produce incorrect results.

✅ Proposed validation
     def identify_face(self, unknown_embedding, threshold=0.7):
         """
         Identifies the celebrity from the given embedding using Euclidean distance.
         
         Args:
             unknown_embedding (np.ndarray): The 128D embedding of the face to identify.
             threshold (float): The distance threshold for a match.
             
         Returns:
             str: The name of the celebrity if found, otherwise None.
         """
         if self.embeddings_matrix is None or unknown_embedding is None:
             return None
+        
+        # Validate embedding shape
+        if unknown_embedding.shape[-1] != 128:
+            logger.error(f"Invalid embedding dimension: {unknown_embedding.shape}. Expected 128D.")
+            return None
+        
+        # Ensure 1D array for distance calculation
+        if unknown_embedding.ndim > 1:
+            unknown_embedding = unknown_embedding.flatten()

         # Calculate Euclidean distance
         distances = np.linalg.norm(self.embeddings_matrix - unknown_embedding, axis=1)
🤖 Prompt for AI Agents
In @backend/app/models/CelebrityMatcher.py around lines 70 - 97, identify_face
currently assumes unknown_embedding is a 128-D vector; add input validation at
the start of the method to verify unknown_embedding is not None and has the
expected 128 dimensions (accept either shape (128,) or (1,128)), optionally
converting/flattening to a 1-D numpy array before computing distances with
self.embeddings_matrix, and if the shape is wrong raise a ValueError or return
None and log a clear debug/error message using logger; reference the
identify_face function, self.embeddings_matrix and self.names when adding the
checks so the subsequent np.linalg.norm call is safe.

Comment on lines +61 to +68
# Match celebrity
name = self.celebrity_matcher.identify_face(embedding)
if name:
logger.info(f"Identified {name} in image {image_id}")
cluster_id = db_get_or_create_cluster_by_name(name)
cluster_ids.append(cluster_id)
else:
cluster_ids.append(None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add error handling for database operations.

The call to db_get_or_create_cluster_by_name at line 65 could fail due to database errors, but there's no exception handling. If this fails, the entire face detection operation would crash.

🛡️ Proposed fix with error handling
                # Match celebrity
                name = self.celebrity_matcher.identify_face(embedding)
                if name:
                    logger.info(f"Identified {name} in image {image_id}")
-                   cluster_id = db_get_or_create_cluster_by_name(name)
-                   cluster_ids.append(cluster_id)
+                   try:
+                       cluster_id = db_get_or_create_cluster_by_name(name)
+                       cluster_ids.append(cluster_id)
+                   except Exception as e:
+                       logger.error(f"Failed to get/create cluster for {name}: {e}")
+                       cluster_ids.append(None)
                else:
                    cluster_ids.append(None)
🤖 Prompt for AI Agents
In @backend/app/models/FaceDetector.py around lines 61 - 68, The call to
db_get_or_create_cluster_by_name in FaceDetector (after
celebrity_matcher.identify_face) can raise DB exceptions and currently will
crash the whole detection flow; wrap the db_get_or_create_cluster_by_name call
in a try/except (catching database-related exceptions) and on failure log the
error with context (including image_id and name) and append None (or skip) to
cluster_ids so processing continues; ensure the logger used in this file is used
for the error message and do not suppress unexpected exceptions beyond logging
unless explicitly intended.

Comment on lines +10 to +43
def process_celebrity_scan():
try:
logger.info("Starting background celebrity scan...")
matcher = CelebrityMatcher()
faces = db_get_all_faces_with_cluster_names()
logger.info(f"Scanning {len(faces)} faces for celebrity matches...")

updates = []
matches_found = 0
for face in faces:
embeddings = face["embeddings"]

# Identify face
name = matcher.identify_face(embeddings)

if name:
matches_found += 1
# Check if already named correctly
current_name = face.get("cluster_name")

# If current name is different (or None), calculate cluster ID and queue update
if current_name != name:
cluster_id = db_get_or_create_cluster_by_name(name)
updates.append({"face_id": face["face_id"], "cluster_id": cluster_id})
logger.debug(f"Face {face['face_id']} matched as {name}. Queued for update.")

if updates:
db_update_face_cluster_ids_batch(updates)
logger.info(f"Successfully updated {len(updates)} faces with celebrity names.")
else:
logger.info(f"Scan complete. Found {matches_found} matches, but no updates were needed.")

except Exception as e:
logger.error(f"Error during celebrity scan: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Performance concern: loading all faces into memory.

Line 14 loads ALL faces from the database into memory at once. For large galleries with thousands or tens of thousands of faces, this could cause memory pressure or even OOM errors. Additionally, each call to db_get_or_create_cluster_by_name (line 32) opens a new database connection, creating connection overhead when processing many faces.

⚡ Proposed optimization with batched processing

Consider processing faces in batches and reusing database connections:

 def process_celebrity_scan():
+    global _scan_in_progress
+    BATCH_SIZE = 100  # Process faces in batches
+    
     try:
         logger.info("Starting background celebrity scan...")
         matcher = CelebrityMatcher()
-        faces = db_get_all_faces_with_cluster_names()
-        logger.info(f"Scanning {len(faces)} faces for celebrity matches...")
+        
+        # Get total count for logging
+        import sqlite3
+        from app.config.settings import DATABASE_PATH
+        conn = sqlite3.connect(DATABASE_PATH)
+        cursor = conn.cursor()
+        cursor.execute("SELECT COUNT(*) FROM faces")
+        total_faces = cursor.fetchone()[0]
+        logger.info(f"Scanning {total_faces} faces for celebrity matches...")
         
         updates = []
         matches_found = 0
-        for face in faces:
+        processed = 0
+        
+        # Process in batches using LIMIT/OFFSET
+        while processed < total_faces:
+            cursor.execute("""
+                SELECT f.face_id, f.embeddings, fc.cluster_name
+                FROM faces f
+                LEFT JOIN face_clusters fc ON f.cluster_id = fc.cluster_id
+                ORDER BY f.face_id
+                LIMIT ? OFFSET ?
+            """, (BATCH_SIZE, processed))
+            
+            batch_faces = []
+            for row in cursor.fetchall():
+                face_id, embeddings_json, cluster_name = row
+                import numpy as np
+                import json
+                embeddings = np.array(json.loads(embeddings_json))
+                batch_faces.append({
+                    "face_id": face_id,
+                    "embeddings": embeddings,
+                    "cluster_name": cluster_name,
+                })
+            
+            if not batch_faces:
+                break
+                
+            for face in batch_faces:
+                embeddings = face["embeddings"]
-            
-            # Identify face
-            name = matcher.identify_face(embeddings)
-            
-            if name:
-                matches_found += 1
-                # Check if already named correctly
-                current_name = face.get("cluster_name")
-                
-                # If current name is different (or None), calculate cluster ID and queue update
-                if current_name != name:
-                    cluster_id = db_get_or_create_cluster_by_name(name)
-                    updates.append({"face_id": face["face_id"], "cluster_id": cluster_id})
-                    logger.debug(f"Face {face['face_id']} matched as {name}. Queued for update.")
+                
+                name = matcher.identify_face(embeddings)
+                
+                if name:
+                    matches_found += 1
+                    current_name = face.get("cluster_name")
+                    
+                    if current_name != name:
+                        # Reuse cursor for cluster creation
+                        cluster_id = db_get_or_create_cluster_by_name(name, cursor=cursor)
+                        updates.append({"face_id": face["face_id"], "cluster_id": cluster_id})
+                        logger.debug(f"Face {face['face_id']} matched as {name}. Queued for update.")
+            
+            processed += len(batch_faces)
+            logger.info(f"Processed {processed}/{total_faces} faces ({matches_found} matches found so far)...")
         
         if updates:
-            db_update_face_cluster_ids_batch(updates)
+            db_update_face_cluster_ids_batch(updates, cursor=cursor)
             logger.info(f"Successfully updated {len(updates)} faces with celebrity names.")
         else:
             logger.info(f"Scan complete. Found {matches_found} matches, but no updates were needed.")
-
+        
+        conn.commit()
     except Exception as e:
         logger.error(f"Error during celebrity scan: {e}")
+        if 'conn' in locals():
+            conn.rollback()
+    finally:
+        if 'conn' in locals():
+            conn.close()
+        with _scan_lock:
+            _scan_in_progress = False

Note: This requires updating db_get_or_create_cluster_by_name to accept an optional cursor parameter (as suggested in the earlier review comment for that function).

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @backend/app/routes/celebrity.py around lines 10 - 43, The celebrity scan
loads all faces into memory (db_get_all_faces_with_cluster_names) and calls
db_get_or_create_cluster_by_name per face, causing memory and DB-connection
overhead; change process_celebrity_scan to iterate faces in fixed-size batches
(e.g., page through db_get_all_faces_with_cluster_names or a new
db_get_faces_batch iterator), accumulate updates per batch, open a single DB
connection/cursor for the batch and pass that cursor into
db_get_or_create_cluster_by_name to avoid per-face connections, then call
db_update_face_cluster_ids_batch per batch and continue until all faces are
processed.

Comment on lines +45 to +51
@router.post("/scan")
def scan_celebrities(background_tasks: BackgroundTasks):
"""
Triggers a background scan of all existing faces to identify celebrities.
"""
background_tasks.add_task(process_celebrity_scan)
return {"message": "Celebrity scan started in background."}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add authentication and prevent concurrent scans.

The endpoint lacks authentication and has no protection against concurrent scans. Any user can trigger multiple celebrity scans simultaneously, which could overwhelm the system or cause database contention.

🔐 Proposed fix with authentication and concurrency control

Add a simple flag to prevent concurrent scans and require authentication:

+import threading
+
+# Module-level lock to prevent concurrent scans
+_scan_lock = threading.Lock()
+_scan_in_progress = False
+
 @router.post("/scan")
-def scan_celebrities(background_tasks: BackgroundTasks):
+async def scan_celebrities(background_tasks: BackgroundTasks):
     """
     Triggers a background scan of all existing faces to identify celebrities.
+    Returns 409 if a scan is already in progress.
     """
+    global _scan_in_progress
+    
+    # Check if scan is already running
+    if _scan_in_progress:
+        return {"message": "Celebrity scan already in progress."}, 409
+    
+    # Mark scan as in progress
+    with _scan_lock:
+        if _scan_in_progress:
+            return {"message": "Celebrity scan already in progress."}, 409
+        _scan_in_progress = True
+    
     background_tasks.add_task(process_celebrity_scan)
     return {"message": "Celebrity scan started in background."}

Update process_celebrity_scan to reset the flag:

 def process_celebrity_scan():
+    global _scan_in_progress
     try:
         logger.info("Starting background celebrity scan...")
         # ... existing code ...
     except Exception as e:
         logger.error(f"Error during celebrity scan: {e}")
+    finally:
+        with _scan_lock:
+            _scan_in_progress = False
+        logger.info("Celebrity scan completed.")

Authentication note: If this application has authentication middleware, ensure this endpoint is protected. If not, consider adding at least a simple API key or rate limiting.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +28 to +127
def run_celebrity_detection_demo(image_path):
print(f"--- Starting Celebrity Detection on {image_path} ---")

# 1. Initialize Models
print("Initializing models...")
try:
# Note: FaceDetector initializes its own FaceNet internally, but for this custom pipeline
# as requested, we might want to use individual components if FaceDetector is too coupled.
# However, FaceDetector.yolo_detector is accessible.
face_detector = FaceDetector()

# We need a separate FaceNet instance for our explicit embedding step
# if we don't want to use face_detector internals,
# OR we can reuse face_detector.facenet if it exposes what we need.
# FaceDetector has self.facenet.
facenet = face_detector.facenet

# Use default path (rel to CelebrityMatcher file)
celebrity_matcher = CelebrityMatcher()

except Exception as e:
print(f"Failed to initialize models: {e}")
return

# 2. Load Image
if not os.path.exists(image_path):
print(f"Image not found at {image_path}")
return

img = cv2.imread(image_path)
if img is None:
print(f"Failed to load image from {image_path}")
return

# 3. Detect Faces
# Using the YOLO detector inside FaceDetector as requested ("Use FaceDetector to find face bounding boxes")
# FaceDetector.detect_faces() does everything, but we want to demonstrate the pipeline steps.
# So we access the underlying YOLO detector.
print("Detecting faces...")
# The yolo_detector call returns boxes, scores, class_ids
boxes, scores, class_ids = face_detector.yolo_detector(img)

if len(boxes) == 0:
print("No faces detected.")
return

print(f"Found {len(boxes)} faces.")

# 4. Process Each Face
for i, (box, score) in enumerate(zip(boxes, scores)):
# Filter by confidence if needed (YOLO class usually handles this internally or returns all)
# FaceDetector uses 0.45 threshold.
if score < face_detector.yolo_detector.conf_threshold:
continue

x1, y1, x2, y2 = map(int, box)
print(f"Processing Face {i+1} at [{x1}, {y1}, {x2}, {y2}]...")

# Crop Face (with some padding like FaceDetector does)
padding = 20
h, w, _ = img.shape
face_img = img[
max(0, y1 - padding) : min(h, y2 + padding),
max(0, x1 - padding) : min(w, x2 + padding)
]

if face_img.size == 0:
print("Empty crop, skipping.")
continue

# 5. Preprocess for FaceNet (Resize to 160x160)
# The user requested: "preprocess it (resize to 160x160)"
# FaceNet_util_preprocess_image handles: Resize -> RGB -> Transpose -> ExpandDims -> Normalize
# We'll use the util to ensure compatibility with the model.
# If manual resize is strictly required separately:
# resized_face = cv2.resize(face_img, (160, 160))
# preprocessed_face = FaceNet_util_preprocess_image_from_resized(resized_face)
# But FaceNet_util_preprocess_image includes the resize, so we use it directly.

try:
# FaceNet expects the processed tensor
preprocessed_face = FaceNet_util_preprocess_image(face_img)

# 6. Get Embedding
embedding = facenet.get_embedding(preprocessed_face)

# 7. Match Celebrity
name = celebrity_matcher.identify_face(embedding)

if name:
print(f"Result: Found {name} at [{x1}, {y1}, {x2}, {y2}]")
# Optional: specific logging format
else:
print(f"Result: Unknown person at [{x1}, {y1}, {x2}, {y2}]")

except Exception as e:
print(f"Error processing face {i+1}: {e}")

print("--- Scanning Complete ---")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add resource cleanup for models.

The demo initializes face_detector and celebrity_matcher but never closes them. The FaceDetector class has a close() method (line 81-90 in FaceDetector.py) that should be called to release resources.

🧹 Proposed fix with proper cleanup
 def run_celebrity_detection_demo(image_path):
     print(f"--- Starting Celebrity Detection on {image_path} ---")

     # 1. Initialize Models
     print("Initializing models...")
     try:
         face_detector = FaceDetector()
         facenet = face_detector.facenet
         celebrity_matcher = CelebrityMatcher()
         
     except Exception as e:
         print(f"Failed to initialize models: {e}")
         return

+    try:
-    # 2. Load Image
-    if not os.path.exists(image_path):
+        # 2. Load Image
+        if not os.path.exists(image_path):
+            print(f"Image not found at {image_path}")
+            return
+
         # ... rest of the function logic ...
+
+        print("--- Scanning Complete ---")
+    finally:
+        # Clean up resources
+        if face_detector:
+            face_detector.close()

-    print("--- Scanning Complete ---")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @backend/celebrity_detection_demo.py around lines 28 - 127, The demo never
releases resources — call face_detector.close() and any cleanup on
celebrity_matcher (e.g., celebrity_matcher.close() if available or
celebrity_matcher.shutdown()) before returning or at the end of
run_celebrity_detection_demo; ensure cleanup runs on all exit paths (after
initialization failure, image-not-found/failed-load, no-faces return, and at
normal completion) by invoking the close/shutdown methods for FaceDetector
(face_detector.close()) and CelebrityMatcher (celebrity_matcher.close()) inside
finally blocks or before each early return so resources are always freed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feat: Add celebrity tagging and search in clusters

1 participant