# Facial Recognition & Vector Database Pipeline

## Table of Contents
1. [Setup & Imports](#setup--imports)
2. [Face Detection Function](#face-detection-function)
3. [Test Face Detection](#test-face-detection)
4. [Upload to Qdrant Function](#upload-to-qdrant-function)
5. [Run Pipeline](#run-pipeline)
6. [Face Count Analysis](#face-count-analysis)


## Setup & Imports
Import required libraries for face detection and vector database operations.


In [None]:
# Core libraries
import os
import glob
import uuid
import cv2

# DeepFace for face detection + embeddings
from deepface import DeepFace

# Qdrant client
from qdrant_client import QdrantClient
from qdrant_client.http import models

## I Image Preprocessing Function : 
This function takes in images of different formats (.JPG, .HEIC, etc. ) and then converts it into a CV2 format, which is accessible for DeepFace 
Input : Image Path
Output : Image in a Pillow or CV2 format

## Face Detection Function
Extract face crops and embeddings from images using DeepFace (RetinaFace + ArcFace).


In [2]:
def extract_faces_and_embeddings(image_path):
    """
    Extract face crops + embeddings for a single image.
    Uses DeepFace (RetinaFace + ArcFace under the hood).
    """
    img = cv2.imread(image_path)
    results = DeepFace.represent(
        img_path=image_path,
        model_name="ArcFace",
        detector_backend="retinaface",
        enforce_detection=False
    )

    output = []
    for res in results:
        fa = res["facial_area"]
        x, y, w, h = fa["x"], fa["y"], fa["w"], fa["h"]
        face_crop = img[y:y+h, x:x+w]
        output.append({
            "face_crop": face_crop,
            "embedding": res["embedding"]
        })
    return output


## Test Face Detection
Test the face detection function on sample images to verify it works.


In [34]:
# Test with the correct path
parent_dir = "../data/query_images/"
collection_name = "faces_collection"

# First, test if we can find images
image_paths = glob.glob(os.path.join(parent_dir, "*.[jJpP][pPnN]*[gG]"))
print(f"Found {len(image_paths)} images: {image_paths}")

# Test face detection on first image
if image_paths:
    test_faces = extract_faces_and_embeddings(image_paths[0])
    print(f"Faces detected: {len(test_faces)}")


Found 2 images: ['../data/query_images/27c3e45a-57a4-473a-a927-ccb5f048de12.JPG', '../data/query_images/query.JPG']
Faces detected: 4


## Upload to Qdrant Function
Process all images in a directory and upload face embeddings to Qdrant vector database.


In [35]:
def push_directory_to_qdrant(parent_dir, collection_name, qdrant_url="http://localhost:6333"):
    """
    Loop through all images in parent directory, extract embeddings, push to Qdrant.
    Each face gets its own vector with image metadata in payload.
    """
    client = QdrantClient(url=qdrant_url)

    # Create / reset collection (ArcFace = 512 dims)
    client.recreate_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(size=512, distance=models.Distance.COSINE)
    )

    # Collect all images in folder
    image_paths = glob.glob(os.path.join(parent_dir, "*.[jJpP][pPnN]*[gG]"))
    print(f"🔍 Found {len(image_paths)} images")

    all_points = []
    for image_id, image_path in enumerate(image_paths, start=1):
        print(f"\n🖼️  Processing image {image_id}: {os.path.basename(image_path)}")
        
        try:
            faces = extract_faces_and_embeddings(image_path)
            print(f"   ✅ Found {len(faces)} faces")
            
            for face_idx, face in enumerate(faces):
                # Generate proper UUID
                point_id = str(uuid.uuid4())
                print(f"   📝 Adding face {face_idx + 1} with ID: {point_id}")
                
                all_points.append(models.PointStruct(
                    id=point_id,  # Proper UUID
                    vector=face["embedding"],
                    payload={
                        "image_id": image_id,
                        "image_url": os.path.abspath(image_path),
                        "face_index": face_idx
                    }
                ))
        except Exception as e:
            print(f"   ❌ Error processing {image_path}: {e}")

    print(f"\n📊 Total points collected: {len(all_points)}")
    
    if len(all_points) == 0:
        print("❌ No points to upload!")
        return

    # Upload to Qdrant
    print(f"⬆️  Uploading {len(all_points)} points to Qdrant...")
    client.upsert(collection_name=collection_name, points=all_points)

    print(f"✅ Inserted {len(all_points)} faces from {len(image_paths)} images into collection '{collection_name}'")


## Run Pipeline
Execute the complete pipeline to process images and upload to Qdrant.


In [36]:
# Test with verbose output
parent_dir = "../data/query_images/"
collection_name = "detected_faces_collection"

print(f"📁 Directory: {parent_dir}")
print(f"📊 Collection: {collection_name}")
print("🚀 Running push_directory_to_qdrant...")

push_directory_to_qdrant(parent_dir, collection_name)

print("✅ Function completed!")


📁 Directory: ../data/query_images/
📊 Collection: detected_faces_collection
🚀 Running push_directory_to_qdrant...
🔍 Found 2 images

🖼️  Processing image 1: 27c3e45a-57a4-473a-a927-ccb5f048de12.JPG
   ✅ Found 4 faces
   📝 Adding face 1 with ID: 36c932d2-4635-4cb0-81e4-37da49800eee
   📝 Adding face 2 with ID: d3a79381-8230-4456-8829-98d6418b2248
   📝 Adding face 3 with ID: 08b8a479-7c2f-4e44-b429-5f0f055f7b95
   📝 Adding face 4 with ID: 2f5d7299-cc98-4a63-9f24-16115fb52891

🖼️  Processing image 2: query.JPG
   ✅ Found 4 faces
   📝 Adding face 1 with ID: d89bd7b8-1d4f-4fc0-8fe0-535629f77f92
   📝 Adding face 2 with ID: a8cd5a83-8699-431c-9133-af1014c04dd0
   📝 Adding face 3 with ID: 6f2ee0ba-d3cf-4cb6-8de1-fb36635b4fbc
   📝 Adding face 4 with ID: eb7b978c-7bf9-4d85-a48d-571ef372b2a3

📊 Total points collected: 8
⬆️  Uploading 8 points to Qdrant...
✅ Inserted 8 faces from 2 images into collection 'detected_faces_collection'
✅ Function completed!


## Face Count Analysis
Analyze how many faces were detected in each processed image.


In [37]:
def count_faces_per_image(collection_name="detected_faces_collection"):
    """Count how many faces were detected for each image"""
    client = QdrantClient(url="http://localhost:6333")
    
    # Get all points
    points = client.scroll(collection_name=collection_name, limit=1000)
    
    # Count faces per image
    face_counts = {}
    for point in points[0]:
        image_url = point.payload['image_url']
        image_name = os.path.basename(image_url)
        
        if image_name not in face_counts:
            face_counts[image_name] = 0
        face_counts[image_name] += 1
    
    # Display results
    print("📊 Faces detected per image:")
    for image_name, count in face_counts.items():
        print(f"  {image_name}: {count} faces")
    
    return face_counts

# Run it
count_faces_per_image()


📊 Faces detected per image:
  27c3e45a-57a4-473a-a927-ccb5f048de12.JPG: 4 faces
  query.JPG: 4 faces


{'27c3e45a-57a4-473a-a927-ccb5f048de12.JPG': 4, 'query.JPG': 4}

## Rough Notes

1. The pipeline :
A function which takes in the image, applies DeepFace ( which uses retinaface and ArcFace in the background.)
Crops for each image, and then for each of those faces embedded those faces. An image is taken a an input and the output is a list of face crops, in whatever way deepface will support it. 
The function takes in an image_path, and returns a list of dictionaries, where each dictionary corresponds to one detected face. 

2. Take a list of images pass it into the function 'extract_faces_and_embeddings', get this added to a qdrant collection, where each row point is an image, where the multiple facial embeddings are stored for each image, each record. Also ensure, that these embeddings can be matched with a reference collection. To the payload, add the url for the image, imafe id. 

## Next Day Notes :
1. Add more images. 
2. Ensure that the same pipeline works.
3. See if this can be a reproducible function, so you can use it for reference images! 
4. Now that you have 2 Qdrant collections, one for reference images and one for query images, you need to perform a vector match, get the names for the reference images, add those names to the payload for the images. Reference_faces_collection for reference faces, and detected_faces_collection for the detected faces. Get to this to perform a vector match.
5. Take a good look at the payload for the image. Revisit how many collections!
