The goal of this repo is to build a real-time video analytics system that is capable of solving multiple various tasks.
- Detect and count all people within a frame
- Count unique people using tracking+reid
- Identify number of unique customers in the shop
- Track each customer
- Identify how many people entered and exited shop during the video
- Identify gender (male/female) and age (child/adult as classification) for each customer
- Identify number of unique customers that made a purchase
- Identify if cashier is right or left-handed
- Install poetry and use system env (in case using conda or other):
poetry env use system
- Install torch and torchvision with conda (or other):
conda install pytorch torchvision -c pytorch
- Install dependencies (poetry is recommended but requirements.txt is also available):
poetry install
- Install ffmpeg as per system.
- Fiftyone is used as the main tool for managing data and labels.
- Fire is the main cli entrypoint to run all scripts.
- User inputs are validated with pydantic.
The FiftyoneDataset
class is the main entrypoint to run all commands from cli.
Check out its usage below:
NAME
main.py FiftyoneDataset - Usage docs: https://docs.pydantic.dev/2.7/concepts/models/
SYNOPSIS
main.py FiftyoneDataset GROUP | COMMAND | <flags>
FLAGS
-n, --name=NAME (required)
Type: str
-o, --overwrite=OVERWRITE
Type: bool
Default: False
- Create a fiftyone dataset to visualize data and labels:
python main.py FiftyoneDataset --name <name> create <path/to/video/folder>
# keep fiftyone running to observe changes
fiftyone app launch <name>
- Run detection and tracking:
python main.py FiftyoneDataset --name <name> track --model yolov8s --label-field yolov8s
This command will download a respective yolov8 model to the current directory and create a label field "yolov8s" that will contain detections with COCO classes and track ids.
- Ultralytics yolov8 family of models is used for object detection.
- BoxSort is used to connect frame-level detections into tracklets.
In order to tracking algorithms with a reid model,
install boxmot
and run it on top of existing detections.
pip install boxmot
python main.py FiftyoneDataset --name <name>
track_reid
--label-field yolob8s
--tracking_method deepocsort
--reid-model osnet_x0_25_market1501.pt
--new_field deepocsort_market
A new label field will be created, it will contain an index parameter and will be available in the fiftyone app for inspection.
Other reid models and tracking methods are also available: check (the source repo)[https://github.com/mikel-brostrom/boxmot?tab=readme-ov-file\] for more details.
To be able to identify a customer we need to annotate a cashier's zone in a yaml config file. Once a fiftyone dataset is created, find out video sample id from web UI and add an entry to the config file following the existing example.
- Add these annotations to fiftyone:
python main.py FiftyoneDataset --name <name> annotate_zones configs/annotations.yaml --label_field zone
Verify the correctness of zone annotations in the fiftyone app.
- After adding zones, run the following command to label each person as a cashier or a customer:
❯ python main.py FiftyoneDataset --name <name> identify_customer --tracking_field yolov8s --zone_field zone --iou-threshold 0.5
Output:
6665610ec7102da65e0f95ef cashier count: 2
6665610ec7102da65e0f95ef customer count: 41
A new label field is created that maps a "person" class to either "cashier" or "customer" based on the intersection with the "cashier" zone.
This step assumes that the exit zone has been annotated at the previous step, as well as customer identification.
❯ python main.py FiftyoneDataset --name <name> identify_exit --tracking_field visitor_type --zone_field zone
Output:
6665610ec7102da65e0f95ef: Number of customers exiting is 4
An "exit" is defined as a last frame for each tracklet to intersect with the annotated "exit" zone.
Age and gender classification is handled by the clip
model with 4 classes:
- adult woman
- adult man
- girl
- boy
The script below will create a new dataset with customer patches,
download the clip
model using fiftyone zoo
and classify each patch into the defined categories. Then,
the predictions will be grouped for each customer and simple voting
will decides the final class probabilities.
❯ python main.py FiftyoneDataset --name <name> classify_customers --patch-field visitor_type --export-dir data/interim/patches
Output:
Customer 'customer-1' class probabilities: {'adult man': 0.64, 'boy': 0.18, 'adult w
oman': 0.18}
Customer 'customer-10' class probabilities: {'adult woman': 0.87, 'girl': 0.13}
Customer 'customer-11' class probabilities: {'adult man': 0.62, 'adult woman': 0.38}
Customer 'customer-14' class probabilities: {'adult man': 0.85, 'adult woman': 0.03,
'boy': 0.13}
Customer 'customer-15' class probabilities: {'adult man': 0.33, 'adult woman': 0.51,
'girl': 0.16}
Customer 'customer-17' class probabilities: {'adult man': 0.87, 'adult woman': 0.1,
It's also possible to review the clip predictions in the fiftyone app
by opening the <name>-patches
dataset. Filter customers by the customer_id
field
and review the label counts in the clip
field.
Example age and gender prediction for a single customer:
Possible improvements:
- Select top-k patches per customer to optimize speed
- Run a face detection algorithm and apply a model trained on the adience dataset
There are few steps to tackle this problem:
- Detect customers spending at least n number of seconds around a register.
- Estimate customer pose, ensure hands intersect with the cashier's desk.
- Add these identifications for human verification.
- Train a video action recognition model to identify a target event.
To identify a cashier's handedness we apply pose estimation model on a patch-level for cashiers and compute a magnitude (i.e. xy-variance) of each hand movement across frames.
❯ python main.py FiftyoneDataset --name <name> cashier_hand
Output:
Cashier 'cashier-12' is 'right'-handed: left_var=0.07, right_var=0.12
Cashier 'cashier-16' is 'left'-handed: left_var=0.03, right_var=0.03
Keypoint detection can be review in the fiftyone app as well.
Filter on a specific cashier with the cashier_id
field and
then select wrist labels in the keypoints
field. It might be
useful to set a confidence threshold around 0.4-0.6.
Example hand detection for a cashier:
Possible improvements:
- Select top patches for each cashier
- Track hand detection across frames and compute movement magnitude.