-
Notifications
You must be signed in to change notification settings - Fork 518
Fix 904 celebrity tagging #990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix 904 celebrity tagging #990
Conversation
📝 WalkthroughWalkthroughThis pull request implements a celebrity detection and tagging feature that integrates face embeddings with a pre-loaded celebrity face database. The system identifies detected faces by comparing embeddings against known celebrities using Euclidean distance, automatically associates matched faces with celebrity-named clusters, and provides both backend processing and frontend UI for triggering celebrity scans. Changes
Sequence Diagram(s)sequenceDiagram
participant FE as Frontend
participant API as Backend API
participant FM as FaceDetector<br/>(CelebrityMatcher)
participant DB as Database
FE->>API: POST /celebrity/scan
activate API
API->>API: schedule background task
API-->>FE: 202 Accepted
deactivate API
par Background Processing
API->>DB: fetch all faces + cluster names
activate DB
DB-->>API: [faces with embeddings]
deactivate DB
loop For each face
API->>FM: identify_face(embedding)
activate FM
FM->>FM: compute Euclidean distances<br/>to celebrity encodings
FM-->>API: celebrity name (if matched)
deactivate FM
alt Celebrity matched & differs from current cluster
API->>DB: get_or_create_cluster_by_name(celebrity_name)
activate DB
DB-->>API: cluster_id
deactivate DB
API->>API: queue cluster_id update
end
end
API->>DB: batch update face_cluster_ids
activate DB
DB-->>API: update confirmed
deactivate DB
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In @backend/app/database/face_clusters.py:
- Around line 355-377: The db_get_or_create_cluster_by_name function has a race
condition and opens new DB connections per call; change it to attempt the INSERT
first inside a transaction (use "INSERT OR IGNORE" or plain INSERT and catch
sqlite3.IntegrityError), then SELECT the cluster_id if the INSERT was ignored or
failed, ensure commit on success and rollback on error, and accept an optional
cursor/connection parameter (matching the file's existing pattern) so callers
can reuse a transaction/cursor instead of creating many short-lived connections;
update error handling to close or leave the passed-in cursor/connection
untouched only when it was created inside the function.
In @backend/app/models/CelebrityMatcher.py:
- Around line 29-37: The path resolution is overly nested and contains an empty
pass; simplify by replacing the if-not-abs block with a single check that only
tries the fallback when the provided relative path doesn't exist: if not
os.path.isabs(path) and not os.path.exists(path): compute dir_of_file =
os.path.dirname(__file__), potential_path = os.path.join(dir_of_file,
os.path.basename(path)), and if os.path.exists(potential_path): set path =
potential_path. Remove the empty pass and the extra nested if/else.
- Around line 70-97: identify_face currently assumes unknown_embedding is a
128-D vector; add input validation at the start of the method to verify
unknown_embedding is not None and has the expected 128 dimensions (accept either
shape (128,) or (1,128)), optionally converting/flattening to a 1-D numpy array
before computing distances with self.embeddings_matrix, and if the shape is
wrong raise a ValueError or return None and log a clear debug/error message
using logger; reference the identify_face function, self.embeddings_matrix and
self.names when adding the checks so the subsequent np.linalg.norm call is safe.
In @backend/app/models/FaceDetector.py:
- Around line 61-68: The call to db_get_or_create_cluster_by_name in
FaceDetector (after celebrity_matcher.identify_face) can raise DB exceptions and
currently will crash the whole detection flow; wrap the
db_get_or_create_cluster_by_name call in a try/except (catching database-related
exceptions) and on failure log the error with context (including image_id and
name) and append None (or skip) to cluster_ids so processing continues; ensure
the logger used in this file is used for the error message and do not suppress
unexpected exceptions beyond logging unless explicitly intended.
In @backend/app/routes/celebrity.py:
- Around line 10-43: The celebrity scan loads all faces into memory
(db_get_all_faces_with_cluster_names) and calls db_get_or_create_cluster_by_name
per face, causing memory and DB-connection overhead; change
process_celebrity_scan to iterate faces in fixed-size batches (e.g., page
through db_get_all_faces_with_cluster_names or a new db_get_faces_batch
iterator), accumulate updates per batch, open a single DB connection/cursor for
the batch and pass that cursor into db_get_or_create_cluster_by_name to avoid
per-face connections, then call db_update_face_cluster_ids_batch per batch and
continue until all faces are processed.
In @backend/celebrity_detection_demo.py:
- Around line 28-127: The demo never releases resources — call
face_detector.close() and any cleanup on celebrity_matcher (e.g.,
celebrity_matcher.close() if available or celebrity_matcher.shutdown()) before
returning or at the end of run_celebrity_detection_demo; ensure cleanup runs on
all exit paths (after initialization failure, image-not-found/failed-load,
no-faces return, and at normal completion) by invoking the close/shutdown
methods for FaceDetector (face_detector.close()) and CelebrityMatcher
(celebrity_matcher.close()) inside finally blocks or before each early return so
resources are always freed.
🧹 Nitpick comments (7)
frontend/src/api/api-functions/celebrity.ts (1)
3-6: Add TypeScript return type for better type safety.The function should declare its return type to improve type inference and catch potential API contract changes.
📘 Proposed type definition
Define the response type and add it to the function signature:
+interface ScanCelebritiesResponse { + message: string; +} + -export const scanCelebrities = async () => { +export const scanCelebrities = async (): Promise<ScanCelebritiesResponse> => { const response = await apiClient.post('/celebrity/scan'); return response.data; };frontend/src/pages/Home/Home.tsx (1)
35-52: Consider refetching data after successful celebrity scan.The mutation correctly starts the background scan, but when it completes successfully, the UI displays "Celebrity scan started in background" without indicating when the scan finishes or prompting the user to view results. Users won't see the newly assigned celebrity names without manually refreshing or navigating away and back.
💡 Proposed enhancement with query invalidation
After the scan starts, consider invalidating relevant queries or redirecting to the clusters view so users can see the results:
+import { useQueryClient } from '@tanstack/react-query'; + export const Home = () => { const dispatch = useDispatch(); + const queryClient = useQueryClient(); const images = useSelector(selectImages); // ... rest of setup const scanMutation = useMutation({ mutationFn: scanCelebrities, + onSuccess: () => { + // Invalidate clusters query to trigger refetch with updated celebrity names + queryClient.invalidateQueries({ queryKey: ['clusters'] }); + }, }); useMutationFeedback( { isPending: scanMutation.isPending, isSuccess: scanMutation.isSuccess, isError: scanMutation.isError, error: scanMutation.error }, { loadingMessage: 'Scanning for celebrities...', - successMessage: 'Celebrity scan started in background.', + successMessage: 'Celebrity scan complete! Check the Clusters page to see results.', errorTitle: 'Scan Failed', errorMessage: 'Failed to start celebrity scan.', }, );Note: This assumes the background scan completes quickly. For longer scans, consider implementing polling or WebSocket updates to notify when the scan finishes.
backend/app/routes/celebrity.py (1)
23-23: Consider making the celebrity matching threshold configurable.Line 23 uses the default threshold (0.7) from
identify_face. Depending on the quality of the celebrity encodings and the detection model, this threshold might need tuning for optimal precision/recall. A threshold that's too low will produce false positives; too high will miss valid matches.⚙️ Proposed configuration approach
Make the threshold configurable via environment variable or settings:
In
backend/app/config/settings.py(or wherever configuration lives):CELEBRITY_MATCH_THRESHOLD = float(os.getenv("CELEBRITY_MATCH_THRESHOLD", "0.7"))Then in this file:
+from app.config.settings import CELEBRITY_MATCH_THRESHOLD + def process_celebrity_scan(): # ... existing setup ... for face in faces: embeddings = face["embeddings"] - name = matcher.identify_face(embeddings) + name = matcher.identify_face(embeddings, threshold=CELEBRITY_MATCH_THRESHOLD) if name: # ... rest of logic ...This allows tuning without code changes.
backend/app/models/FaceDetector.py (1)
61-68: Consider performance impact of per-face database calls.Each identified celebrity triggers a synchronous database call to
db_get_or_create_cluster_by_name. For images with many faces, this could introduce significant latency. Consider:
- Batching cluster creation/retrieval after processing all faces
- Caching cluster IDs in memory for frequently identified celebrities
- Moving cluster assignment to a background task
backend/celebrity_detection_demo.py (1)
80-81: Hard-coded threshold duplicates FaceDetector logic.Line 80 accesses
face_detector.yolo_detector.conf_threshold, which is good. However, this creates a dependency on the internal structure of FaceDetector. Consider whether the demo should expose or document this threshold dependency.backend/app/models/CelebrityMatcher.py (2)
87-90: Performance optimization opportunity: early stopping.The current implementation computes distances to all celebrity embeddings before finding the minimum. For large celebrity databases, you could potentially stop early once a match below the threshold is found, though this would require different logic (e.g., iterative search or using a spatial index).
For now, the implementation is correct and efficient enough for 1,000 celebrities. Consider optimization only if performance becomes an issue with larger databases.
70-70: Add documentation for the threshold parameter and provide tuning guidance.The default threshold of
0.7already accepts a parameter (good), but lacks documentation on why this value was chosen or how to tune it for different precision/recall tradeoffs. The embeddings are L2-normalized (confirmed in FaceNet.py), which places0.7within the standard range for FaceNet-based matching (typically 0.6–0.8). Update the docstring to explain:
- The assumption that embeddings are L2-normalized
- Why 0.7 was selected (empirical basis if available, or note it as a reasonable default)
- How to adjust the threshold for stricter (lower distance) or looser (higher distance) matching based on your use case
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
__pycache__/app.cpython-310.pycis excluded by!**/*.pycbackend/app/models/celebrity_encodings.pklis excluded by!**/*.pklutils/__pycache__/cache.cpython-310.pycis excluded by!**/*.pyc
📒 Files selected for processing (9)
backend/app/database/face_clusters.pybackend/app/models/CelebrityMatcher.pybackend/app/models/FaceDetector.pybackend/app/routes/celebrity.pybackend/celebrity_detection_demo.pybackend/main.pyfrontend/src/api/api-functions/celebrity.tsfrontend/src/api/api-functions/index.tsfrontend/src/pages/Home/Home.tsx
💤 Files with no reviewable changes (1)
- backend/main.py
🧰 Additional context used
🧬 Code graph analysis (3)
frontend/src/api/api-functions/celebrity.ts (1)
frontend/src/api/axiosConfig.ts (1)
apiClient(5-12)
backend/app/routes/celebrity.py (3)
backend/app/models/CelebrityMatcher.py (2)
CelebrityMatcher(8-97)identify_face(70-97)backend/app/database/faces.py (2)
db_get_all_faces_with_cluster_names(246-285)db_update_face_cluster_ids_batch(288-341)backend/app/database/face_clusters.py (1)
db_get_or_create_cluster_by_name(355-377)
backend/celebrity_detection_demo.py (4)
backend/app/models/FaceDetector.py (1)
FaceDetector(17-91)backend/app/models/FaceNet.py (2)
FaceNet(12-29)get_embedding(20-25)backend/app/models/CelebrityMatcher.py (2)
CelebrityMatcher(8-97)identify_face(70-97)backend/app/utils/FaceNet.py (1)
FaceNet_util_preprocess_image(6-13)
🔇 Additional comments (5)
frontend/src/api/api-functions/index.ts (1)
7-8: LGTM!The barrel export follows the existing pattern and correctly exposes the new celebrity API functions.
frontend/src/pages/Home/Home.tsx (1)
78-89: LGTM! Clean UI integration.The button implementation is well-structured with appropriate loading states, disabled state during scanning, and clear visual feedback using the Sparkles icon.
backend/celebrity_detection_demo.py (2)
7-9: LGTM: Path setup is appropriate for a demo script.The sys.path manipulation correctly adds the backend directory to enable imports. The conditional check prevents duplicate entries.
14-23: LGTM: Robust import error handling.The try-except block with helpful error messages makes the script user-friendly and easier to debug.
backend/app/models/FaceDetector.py (1)
25-25: CelebrityMatcher.identify_face is thread-safe—no synchronization required.The
identify_facemethod performs only read-only operations on immutable data (self.embeddings_matrixandself.names), which are loaded once during initialization and never modified. NumPy array reads are inherently thread-safe, so concurrent calls todetect_faceswill not cause race conditions in the celebrity matching logic.Likely an incorrect or invalid review comment.
| def db_get_or_create_cluster_by_name(name: str) -> str: | ||
| """Gets an existing cluster ID by name or creates a new one.""" | ||
| conn = sqlite3.connect(DATABASE_PATH) | ||
| cursor = conn.cursor() | ||
| try: | ||
| # Check if exists | ||
| cursor.execute( | ||
| "SELECT cluster_id FROM face_clusters WHERE cluster_name = ?", (name,) | ||
| ) | ||
| result = cursor.fetchone() | ||
| if result: | ||
| return result[0] | ||
|
|
||
| # Create new | ||
| cluster_id = str(uuid.uuid4()) | ||
| cursor.execute( | ||
| "INSERT INTO face_clusters (cluster_id, cluster_name) VALUES (?, ?)", | ||
| (cluster_id, name), | ||
| ) | ||
| conn.commit() | ||
| return cluster_id | ||
| finally: | ||
| conn.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical race condition in get-or-create pattern.
The check-then-insert logic is vulnerable to race conditions when multiple threads call this function concurrently with the same celebrity name. If two calls pass the SELECT check before either completes the INSERT, one will fail with a constraint violation.
Additionally, this function opens a new database connection for every call. When invoked in a loop (as in backend/app/routes/celebrity.py line 32), this creates many short-lived connections, which is inefficient.
🔒 Proposed fix using INSERT OR IGNORE with proper transaction handling
-def db_get_or_create_cluster_by_name(name: str) -> str:
- """Gets an existing cluster ID by name or creates a new one."""
- conn = sqlite3.connect(DATABASE_PATH)
- cursor = conn.cursor()
- try:
- # Check if exists
- cursor.execute(
- "SELECT cluster_id FROM face_clusters WHERE cluster_name = ?", (name,)
- )
- result = cursor.fetchone()
- if result:
- return result[0]
-
- # Create new
- cluster_id = str(uuid.uuid4())
- cursor.execute(
- "INSERT INTO face_clusters (cluster_id, cluster_name) VALUES (?, ?)",
- (cluster_id, name),
- )
- conn.commit()
- return cluster_id
- finally:
- conn.close()
+def db_get_or_create_cluster_by_name(
+ name: str, cursor: Optional[sqlite3.Cursor] = None
+) -> str:
+ """Gets an existing cluster ID by name or creates a new one.
+
+ Args:
+ name: The cluster name to look up or create
+ cursor: Optional existing database cursor. If None, creates a new connection.
+
+ Returns:
+ The cluster_id (existing or newly created)
+ """
+ own_connection = cursor is None
+ if own_connection:
+ conn = sqlite3.connect(DATABASE_PATH)
+ cursor = conn.cursor()
+
+ try:
+ # Try to insert first with a new UUID
+ cluster_id = str(uuid.uuid4())
+ try:
+ cursor.execute(
+ "INSERT INTO face_clusters (cluster_id, cluster_name) VALUES (?, ?)",
+ (cluster_id, name),
+ )
+ if own_connection:
+ conn.commit()
+ return cluster_id
+ except sqlite3.IntegrityError:
+ # Cluster name already exists, fetch the existing ID
+ if own_connection:
+ conn.rollback()
+ cursor.execute(
+ "SELECT cluster_id FROM face_clusters WHERE cluster_name = ?", (name,)
+ )
+ result = cursor.fetchone()
+ if result:
+ return result[0]
+ else:
+ # Should not happen, but handle gracefully
+ raise ValueError(f"Failed to create or retrieve cluster for name: {name}")
+ except Exception:
+ if own_connection:
+ conn.rollback()
+ raise
+ finally:
+ if own_connection:
+ conn.close()This approach:
- Uses INSERT first (optimistic path for new names)
- Catches IntegrityError if name already exists
- Falls back to SELECT to retrieve existing cluster_id
- Accepts optional cursor to enable reuse in transactions (matches pattern of other functions in this file)
- Includes proper error handling and rollback
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @backend/app/database/face_clusters.py around lines 355 - 377, The
db_get_or_create_cluster_by_name function has a race condition and opens new DB
connections per call; change it to attempt the INSERT first inside a transaction
(use "INSERT OR IGNORE" or plain INSERT and catch sqlite3.IntegrityError), then
SELECT the cluster_id if the INSERT was ignored or failed, ensure commit on
success and rollback on error, and accept an optional cursor/connection
parameter (matching the file's existing pattern) so callers can reuse a
transaction/cursor instead of creating many short-lived connections; update
error handling to close or leave the passed-in cursor/connection untouched only
when it was created inside the function.
| if not os.path.isabs(path): | ||
| if os.path.exists(path): | ||
| pass | ||
| else: | ||
| # Check relative to this file location as fallback | ||
| dir_of_file = os.path.dirname(__file__) | ||
| potential_path = os.path.join(dir_of_file, os.path.basename(path)) | ||
| if os.path.exists(potential_path): | ||
| path = potential_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Simplify path resolution logic.
The nested conditionals for path resolution (lines 29-37) are hard to follow and have an unusual structure with an empty pass statement. This logic could be simplified.
♻️ Cleaner path resolution
- # Resolve absolute path if relative path is provided
- if not os.path.isabs(path):
- if os.path.exists(path):
- pass
- else:
- # Check relative to this file location as fallback
- dir_of_file = os.path.dirname(__file__)
- potential_path = os.path.join(dir_of_file, os.path.basename(path))
- if os.path.exists(potential_path):
- path = potential_path
+ # Resolve relative paths: first try as-is, then relative to this module
+ if not os.path.isabs(path) and not os.path.exists(path):
+ module_dir = os.path.dirname(__file__)
+ fallback_path = os.path.join(module_dir, os.path.basename(path))
+ if os.path.exists(fallback_path):
+ path = fallback_path📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not os.path.isabs(path): | |
| if os.path.exists(path): | |
| pass | |
| else: | |
| # Check relative to this file location as fallback | |
| dir_of_file = os.path.dirname(__file__) | |
| potential_path = os.path.join(dir_of_file, os.path.basename(path)) | |
| if os.path.exists(potential_path): | |
| path = potential_path | |
| # Resolve relative paths: first try as-is, then relative to this module | |
| if not os.path.isabs(path) and not os.path.exists(path): | |
| module_dir = os.path.dirname(__file__) | |
| fallback_path = os.path.join(module_dir, os.path.basename(path)) | |
| if os.path.exists(fallback_path): | |
| path = fallback_path |
🤖 Prompt for AI Agents
In @backend/app/models/CelebrityMatcher.py around lines 29 - 37, The path
resolution is overly nested and contains an empty pass; simplify by replacing
the if-not-abs block with a single check that only tries the fallback when the
provided relative path doesn't exist: if not os.path.isabs(path) and not
os.path.exists(path): compute dir_of_file = os.path.dirname(__file__),
potential_path = os.path.join(dir_of_file, os.path.basename(path)), and if
os.path.exists(potential_path): set path = potential_path. Remove the empty pass
and the extra nested if/else.
| try: | ||
| with open(path, 'rb') as f: | ||
| data = pickle.load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRITICAL: Pickle deserialization security risk.
Using pickle.load() on a file can execute arbitrary code if the pickle file is malicious or has been tampered with. Since the encodings file path is configurable and could potentially be user-controlled or loaded from an untrusted source, this is a security vulnerability.
Consider these alternatives:
- Use a safer serialization format like JSON or NumPy's
.npzformat - If pickle is required, validate the file's cryptographic signature before loading
- Document that the encodings file must only come from trusted sources and implement strict file permission checks
🔒 Example fix using NumPy's safer format
# When saving encodings (in your encoding generation script):
np.savez_compressed(
'celebrity_encodings.npz',
names=names_array,
embeddings=embeddings_array
)
# In _load_and_optimize:
try:
data = np.load(path, allow_pickle=False) # Disable pickle for safety
self.names = data['names']
self.embeddings_matrix = data['embeddings']
logger.info(f"Loaded {len(self.names)} celebrity encodings.")
except Exception as e:
logger.error(f"Failed to load celebrity encodings: {e}")Alternatively, if you must use pickle for backward compatibility, add this warning:
try:
+ # WARNING: Only load pickle files from trusted sources
+ # Pickle can execute arbitrary code during deserialization
with open(path, 'rb') as f:
data = pickle.load(f)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| with open(path, 'rb') as f: | |
| data = pickle.load(f) | |
| try: | |
| # WARNING: Only load pickle files from trusted sources | |
| # Pickle can execute arbitrary code during deserialization | |
| with open(path, 'rb') as f: | |
| data = pickle.load(f) |
| def identify_face(self, unknown_embedding, threshold=0.7): | ||
| """ | ||
| Identifies the celebrity from the given embedding using Euclidean distance. | ||
| Args: | ||
| unknown_embedding (np.ndarray): The 128D embedding of the face to identify. | ||
| threshold (float): The distance threshold for a match. | ||
| Returns: | ||
| str: The name of the celebrity if found, otherwise None. | ||
| """ | ||
| if self.embeddings_matrix is None or unknown_embedding is None: | ||
| return None | ||
|
|
||
| # Ensure unknown_embedding is the correct shape for broadcasting | ||
| # Calculate Euclidean distance: sqrt(sum((x - y)^2)) | ||
| # axis=1 to calculate distance for each row in the matrix | ||
| distances = np.linalg.norm(self.embeddings_matrix - unknown_embedding, axis=1) | ||
|
|
||
| min_distance_idx = np.argmin(distances) | ||
| min_distance = distances[min_distance_idx] | ||
|
|
||
| logger.debug(f"Closest match: {self.names[min_distance_idx]} with distance: {min_distance}") | ||
|
|
||
| if min_distance < threshold: | ||
| return self.names[min_distance_idx] | ||
| else: | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation for embedding dimension.
The identify_face method doesn't validate that unknown_embedding has the expected shape (128D). If an incorrect embedding is passed, the distance calculation could fail or produce incorrect results.
✅ Proposed validation
def identify_face(self, unknown_embedding, threshold=0.7):
"""
Identifies the celebrity from the given embedding using Euclidean distance.
Args:
unknown_embedding (np.ndarray): The 128D embedding of the face to identify.
threshold (float): The distance threshold for a match.
Returns:
str: The name of the celebrity if found, otherwise None.
"""
if self.embeddings_matrix is None or unknown_embedding is None:
return None
+
+ # Validate embedding shape
+ if unknown_embedding.shape[-1] != 128:
+ logger.error(f"Invalid embedding dimension: {unknown_embedding.shape}. Expected 128D.")
+ return None
+
+ # Ensure 1D array for distance calculation
+ if unknown_embedding.ndim > 1:
+ unknown_embedding = unknown_embedding.flatten()
# Calculate Euclidean distance
distances = np.linalg.norm(self.embeddings_matrix - unknown_embedding, axis=1)🤖 Prompt for AI Agents
In @backend/app/models/CelebrityMatcher.py around lines 70 - 97, identify_face
currently assumes unknown_embedding is a 128-D vector; add input validation at
the start of the method to verify unknown_embedding is not None and has the
expected 128 dimensions (accept either shape (128,) or (1,128)), optionally
converting/flattening to a 1-D numpy array before computing distances with
self.embeddings_matrix, and if the shape is wrong raise a ValueError or return
None and log a clear debug/error message using logger; reference the
identify_face function, self.embeddings_matrix and self.names when adding the
checks so the subsequent np.linalg.norm call is safe.
| # Match celebrity | ||
| name = self.celebrity_matcher.identify_face(embedding) | ||
| if name: | ||
| logger.info(f"Identified {name} in image {image_id}") | ||
| cluster_id = db_get_or_create_cluster_by_name(name) | ||
| cluster_ids.append(cluster_id) | ||
| else: | ||
| cluster_ids.append(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for database operations.
The call to db_get_or_create_cluster_by_name at line 65 could fail due to database errors, but there's no exception handling. If this fails, the entire face detection operation would crash.
🛡️ Proposed fix with error handling
# Match celebrity
name = self.celebrity_matcher.identify_face(embedding)
if name:
logger.info(f"Identified {name} in image {image_id}")
- cluster_id = db_get_or_create_cluster_by_name(name)
- cluster_ids.append(cluster_id)
+ try:
+ cluster_id = db_get_or_create_cluster_by_name(name)
+ cluster_ids.append(cluster_id)
+ except Exception as e:
+ logger.error(f"Failed to get/create cluster for {name}: {e}")
+ cluster_ids.append(None)
else:
cluster_ids.append(None)🤖 Prompt for AI Agents
In @backend/app/models/FaceDetector.py around lines 61 - 68, The call to
db_get_or_create_cluster_by_name in FaceDetector (after
celebrity_matcher.identify_face) can raise DB exceptions and currently will
crash the whole detection flow; wrap the db_get_or_create_cluster_by_name call
in a try/except (catching database-related exceptions) and on failure log the
error with context (including image_id and name) and append None (or skip) to
cluster_ids so processing continues; ensure the logger used in this file is used
for the error message and do not suppress unexpected exceptions beyond logging
unless explicitly intended.
| def process_celebrity_scan(): | ||
| try: | ||
| logger.info("Starting background celebrity scan...") | ||
| matcher = CelebrityMatcher() | ||
| faces = db_get_all_faces_with_cluster_names() | ||
| logger.info(f"Scanning {len(faces)} faces for celebrity matches...") | ||
|
|
||
| updates = [] | ||
| matches_found = 0 | ||
| for face in faces: | ||
| embeddings = face["embeddings"] | ||
|
|
||
| # Identify face | ||
| name = matcher.identify_face(embeddings) | ||
|
|
||
| if name: | ||
| matches_found += 1 | ||
| # Check if already named correctly | ||
| current_name = face.get("cluster_name") | ||
|
|
||
| # If current name is different (or None), calculate cluster ID and queue update | ||
| if current_name != name: | ||
| cluster_id = db_get_or_create_cluster_by_name(name) | ||
| updates.append({"face_id": face["face_id"], "cluster_id": cluster_id}) | ||
| logger.debug(f"Face {face['face_id']} matched as {name}. Queued for update.") | ||
|
|
||
| if updates: | ||
| db_update_face_cluster_ids_batch(updates) | ||
| logger.info(f"Successfully updated {len(updates)} faces with celebrity names.") | ||
| else: | ||
| logger.info(f"Scan complete. Found {matches_found} matches, but no updates were needed.") | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Error during celebrity scan: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance concern: loading all faces into memory.
Line 14 loads ALL faces from the database into memory at once. For large galleries with thousands or tens of thousands of faces, this could cause memory pressure or even OOM errors. Additionally, each call to db_get_or_create_cluster_by_name (line 32) opens a new database connection, creating connection overhead when processing many faces.
⚡ Proposed optimization with batched processing
Consider processing faces in batches and reusing database connections:
def process_celebrity_scan():
+ global _scan_in_progress
+ BATCH_SIZE = 100 # Process faces in batches
+
try:
logger.info("Starting background celebrity scan...")
matcher = CelebrityMatcher()
- faces = db_get_all_faces_with_cluster_names()
- logger.info(f"Scanning {len(faces)} faces for celebrity matches...")
+
+ # Get total count for logging
+ import sqlite3
+ from app.config.settings import DATABASE_PATH
+ conn = sqlite3.connect(DATABASE_PATH)
+ cursor = conn.cursor()
+ cursor.execute("SELECT COUNT(*) FROM faces")
+ total_faces = cursor.fetchone()[0]
+ logger.info(f"Scanning {total_faces} faces for celebrity matches...")
updates = []
matches_found = 0
- for face in faces:
+ processed = 0
+
+ # Process in batches using LIMIT/OFFSET
+ while processed < total_faces:
+ cursor.execute("""
+ SELECT f.face_id, f.embeddings, fc.cluster_name
+ FROM faces f
+ LEFT JOIN face_clusters fc ON f.cluster_id = fc.cluster_id
+ ORDER BY f.face_id
+ LIMIT ? OFFSET ?
+ """, (BATCH_SIZE, processed))
+
+ batch_faces = []
+ for row in cursor.fetchall():
+ face_id, embeddings_json, cluster_name = row
+ import numpy as np
+ import json
+ embeddings = np.array(json.loads(embeddings_json))
+ batch_faces.append({
+ "face_id": face_id,
+ "embeddings": embeddings,
+ "cluster_name": cluster_name,
+ })
+
+ if not batch_faces:
+ break
+
+ for face in batch_faces:
+ embeddings = face["embeddings"]
-
- # Identify face
- name = matcher.identify_face(embeddings)
-
- if name:
- matches_found += 1
- # Check if already named correctly
- current_name = face.get("cluster_name")
-
- # If current name is different (or None), calculate cluster ID and queue update
- if current_name != name:
- cluster_id = db_get_or_create_cluster_by_name(name)
- updates.append({"face_id": face["face_id"], "cluster_id": cluster_id})
- logger.debug(f"Face {face['face_id']} matched as {name}. Queued for update.")
+
+ name = matcher.identify_face(embeddings)
+
+ if name:
+ matches_found += 1
+ current_name = face.get("cluster_name")
+
+ if current_name != name:
+ # Reuse cursor for cluster creation
+ cluster_id = db_get_or_create_cluster_by_name(name, cursor=cursor)
+ updates.append({"face_id": face["face_id"], "cluster_id": cluster_id})
+ logger.debug(f"Face {face['face_id']} matched as {name}. Queued for update.")
+
+ processed += len(batch_faces)
+ logger.info(f"Processed {processed}/{total_faces} faces ({matches_found} matches found so far)...")
if updates:
- db_update_face_cluster_ids_batch(updates)
+ db_update_face_cluster_ids_batch(updates, cursor=cursor)
logger.info(f"Successfully updated {len(updates)} faces with celebrity names.")
else:
logger.info(f"Scan complete. Found {matches_found} matches, but no updates were needed.")
-
+
+ conn.commit()
except Exception as e:
logger.error(f"Error during celebrity scan: {e}")
+ if 'conn' in locals():
+ conn.rollback()
+ finally:
+ if 'conn' in locals():
+ conn.close()
+ with _scan_lock:
+ _scan_in_progress = FalseNote: This requires updating db_get_or_create_cluster_by_name to accept an optional cursor parameter (as suggested in the earlier review comment for that function).
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @backend/app/routes/celebrity.py around lines 10 - 43, The celebrity scan
loads all faces into memory (db_get_all_faces_with_cluster_names) and calls
db_get_or_create_cluster_by_name per face, causing memory and DB-connection
overhead; change process_celebrity_scan to iterate faces in fixed-size batches
(e.g., page through db_get_all_faces_with_cluster_names or a new
db_get_faces_batch iterator), accumulate updates per batch, open a single DB
connection/cursor for the batch and pass that cursor into
db_get_or_create_cluster_by_name to avoid per-face connections, then call
db_update_face_cluster_ids_batch per batch and continue until all faces are
processed.
| @router.post("/scan") | ||
| def scan_celebrities(background_tasks: BackgroundTasks): | ||
| """ | ||
| Triggers a background scan of all existing faces to identify celebrities. | ||
| """ | ||
| background_tasks.add_task(process_celebrity_scan) | ||
| return {"message": "Celebrity scan started in background."} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add authentication and prevent concurrent scans.
The endpoint lacks authentication and has no protection against concurrent scans. Any user can trigger multiple celebrity scans simultaneously, which could overwhelm the system or cause database contention.
🔐 Proposed fix with authentication and concurrency control
Add a simple flag to prevent concurrent scans and require authentication:
+import threading
+
+# Module-level lock to prevent concurrent scans
+_scan_lock = threading.Lock()
+_scan_in_progress = False
+
@router.post("/scan")
-def scan_celebrities(background_tasks: BackgroundTasks):
+async def scan_celebrities(background_tasks: BackgroundTasks):
"""
Triggers a background scan of all existing faces to identify celebrities.
+ Returns 409 if a scan is already in progress.
"""
+ global _scan_in_progress
+
+ # Check if scan is already running
+ if _scan_in_progress:
+ return {"message": "Celebrity scan already in progress."}, 409
+
+ # Mark scan as in progress
+ with _scan_lock:
+ if _scan_in_progress:
+ return {"message": "Celebrity scan already in progress."}, 409
+ _scan_in_progress = True
+
background_tasks.add_task(process_celebrity_scan)
return {"message": "Celebrity scan started in background."}Update process_celebrity_scan to reset the flag:
def process_celebrity_scan():
+ global _scan_in_progress
try:
logger.info("Starting background celebrity scan...")
# ... existing code ...
except Exception as e:
logger.error(f"Error during celebrity scan: {e}")
+ finally:
+ with _scan_lock:
+ _scan_in_progress = False
+ logger.info("Celebrity scan completed.")Authentication note: If this application has authentication middleware, ensure this endpoint is protected. If not, consider adding at least a simple API key or rate limiting.
Committable suggestion skipped: line range outside the PR's diff.
| def run_celebrity_detection_demo(image_path): | ||
| print(f"--- Starting Celebrity Detection on {image_path} ---") | ||
|
|
||
| # 1. Initialize Models | ||
| print("Initializing models...") | ||
| try: | ||
| # Note: FaceDetector initializes its own FaceNet internally, but for this custom pipeline | ||
| # as requested, we might want to use individual components if FaceDetector is too coupled. | ||
| # However, FaceDetector.yolo_detector is accessible. | ||
| face_detector = FaceDetector() | ||
|
|
||
| # We need a separate FaceNet instance for our explicit embedding step | ||
| # if we don't want to use face_detector internals, | ||
| # OR we can reuse face_detector.facenet if it exposes what we need. | ||
| # FaceDetector has self.facenet. | ||
| facenet = face_detector.facenet | ||
|
|
||
| # Use default path (rel to CelebrityMatcher file) | ||
| celebrity_matcher = CelebrityMatcher() | ||
|
|
||
| except Exception as e: | ||
| print(f"Failed to initialize models: {e}") | ||
| return | ||
|
|
||
| # 2. Load Image | ||
| if not os.path.exists(image_path): | ||
| print(f"Image not found at {image_path}") | ||
| return | ||
|
|
||
| img = cv2.imread(image_path) | ||
| if img is None: | ||
| print(f"Failed to load image from {image_path}") | ||
| return | ||
|
|
||
| # 3. Detect Faces | ||
| # Using the YOLO detector inside FaceDetector as requested ("Use FaceDetector to find face bounding boxes") | ||
| # FaceDetector.detect_faces() does everything, but we want to demonstrate the pipeline steps. | ||
| # So we access the underlying YOLO detector. | ||
| print("Detecting faces...") | ||
| # The yolo_detector call returns boxes, scores, class_ids | ||
| boxes, scores, class_ids = face_detector.yolo_detector(img) | ||
|
|
||
| if len(boxes) == 0: | ||
| print("No faces detected.") | ||
| return | ||
|
|
||
| print(f"Found {len(boxes)} faces.") | ||
|
|
||
| # 4. Process Each Face | ||
| for i, (box, score) in enumerate(zip(boxes, scores)): | ||
| # Filter by confidence if needed (YOLO class usually handles this internally or returns all) | ||
| # FaceDetector uses 0.45 threshold. | ||
| if score < face_detector.yolo_detector.conf_threshold: | ||
| continue | ||
|
|
||
| x1, y1, x2, y2 = map(int, box) | ||
| print(f"Processing Face {i+1} at [{x1}, {y1}, {x2}, {y2}]...") | ||
|
|
||
| # Crop Face (with some padding like FaceDetector does) | ||
| padding = 20 | ||
| h, w, _ = img.shape | ||
| face_img = img[ | ||
| max(0, y1 - padding) : min(h, y2 + padding), | ||
| max(0, x1 - padding) : min(w, x2 + padding) | ||
| ] | ||
|
|
||
| if face_img.size == 0: | ||
| print("Empty crop, skipping.") | ||
| continue | ||
|
|
||
| # 5. Preprocess for FaceNet (Resize to 160x160) | ||
| # The user requested: "preprocess it (resize to 160x160)" | ||
| # FaceNet_util_preprocess_image handles: Resize -> RGB -> Transpose -> ExpandDims -> Normalize | ||
| # We'll use the util to ensure compatibility with the model. | ||
| # If manual resize is strictly required separately: | ||
| # resized_face = cv2.resize(face_img, (160, 160)) | ||
| # preprocessed_face = FaceNet_util_preprocess_image_from_resized(resized_face) | ||
| # But FaceNet_util_preprocess_image includes the resize, so we use it directly. | ||
|
|
||
| try: | ||
| # FaceNet expects the processed tensor | ||
| preprocessed_face = FaceNet_util_preprocess_image(face_img) | ||
|
|
||
| # 6. Get Embedding | ||
| embedding = facenet.get_embedding(preprocessed_face) | ||
|
|
||
| # 7. Match Celebrity | ||
| name = celebrity_matcher.identify_face(embedding) | ||
|
|
||
| if name: | ||
| print(f"Result: Found {name} at [{x1}, {y1}, {x2}, {y2}]") | ||
| # Optional: specific logging format | ||
| else: | ||
| print(f"Result: Unknown person at [{x1}, {y1}, {x2}, {y2}]") | ||
|
|
||
| except Exception as e: | ||
| print(f"Error processing face {i+1}: {e}") | ||
|
|
||
| print("--- Scanning Complete ---") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add resource cleanup for models.
The demo initializes face_detector and celebrity_matcher but never closes them. The FaceDetector class has a close() method (line 81-90 in FaceDetector.py) that should be called to release resources.
🧹 Proposed fix with proper cleanup
def run_celebrity_detection_demo(image_path):
print(f"--- Starting Celebrity Detection on {image_path} ---")
# 1. Initialize Models
print("Initializing models...")
try:
face_detector = FaceDetector()
facenet = face_detector.facenet
celebrity_matcher = CelebrityMatcher()
except Exception as e:
print(f"Failed to initialize models: {e}")
return
+ try:
- # 2. Load Image
- if not os.path.exists(image_path):
+ # 2. Load Image
+ if not os.path.exists(image_path):
+ print(f"Image not found at {image_path}")
+ return
+
# ... rest of the function logic ...
+
+ print("--- Scanning Complete ---")
+ finally:
+ # Clean up resources
+ if face_detector:
+ face_detector.close()
- print("--- Scanning Complete ---")Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @backend/celebrity_detection_demo.py around lines 28 - 127, The demo never
releases resources — call face_detector.close() and any cleanup on
celebrity_matcher (e.g., celebrity_matcher.close() if available or
celebrity_matcher.shutdown()) before returning or at the end of
run_celebrity_detection_demo; ensure cleanup runs on all exit paths (after
initialization failure, image-not-found/failed-load, no-faces return, and at
normal completion) by invoking the close/shutdown methods for FaceDetector
(face_detector.close()) and CelebrityMatcher (celebrity_matcher.close()) inside
finally blocks or before each early return so resources are always freed.
Closes #904
Changes done
Why second commit "fix merge issue"
Video was larger than 10MB so i have attached the link
https://collection.cloudinary.com/db6cpegxg/5256757d53b4723bee8050226eeaa9ee
or
https://res.cloudinary.com/db6cpegxg/video/upload/v1767805877/Screen_Recording_2026-01-07_at_10.32.52_PM_co6scx.mov
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.