This repository implements a complete computer vision pipeline for detecting and counting humans and cars in drone/aerial imagery using the VisDrone dataset and Ultralytics YOLO.
The project is designed for an AI/ML internship technical assessment and covers:
- Dataset understanding and preprocessing.
- VisDrone to YOLO conversion for a 2-class task.
- YOLO training/fine-tuning.
- Human and car detection.
- Per-image and per-frame human counting.
- Bounding-box visualization.
- Evaluation with precision, recall, mAP, and speed.
- Optional tracking with ByteTrack or BoT-SORT.
Local dataset path:
D:\Antlings\archive\VisDrone_DatasetExpected VisDrone-style folders include:
VisDrone2019-DET-train\images
VisDrone2019-DET-train\annotations or labels
VisDrone2019-DET-val\images
VisDrone2019-DET-val\annotations or labels
VisDrone2019-DET-test-dev\images
VisDrone2019-DET-test-dev\annotations or labels
VisDrone2019-DET-test-challenge\images
The converter supports both raw VisDrone annotation rows and YOLO-style VisDrone labels. For this assessment, classes are remapped to:
0: human <- VisDrone pedestrian + people/person categories
1: car <- VisDrone car category only
Other categories are ignored. Invalid boxes, ignored regions, zero/negative boxes, and boxes outside image bounds are skipped or clipped safely.
D:\Antlings
README.md
REPORT.md
requirements.txt
.gitignore
configs\
dataset.yaml
train.yaml
src\
convert_visdrone_to_yolo.py
explore_dataset.py
train.py
infer.py
evaluate.py
visualize.py
utils.py
track.py
notebooks\
01_dataset_understanding.ipynb
outputs\
samples\
predictions\
metrics\
tracking\
runs\
.gitkeep
Run from Windows PowerShell:
cd D:\Antlings
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtcd D:\Antlings
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python src\convert_visdrone_to_yolo.py --dataset-root "D:\Antlings\archive\VisDrone_Dataset" --output-root "D:\Antlings\data\visdrone_human_car"
python src\explore_dataset.py --dataset-root "D:\Antlings\data\visdrone_human_car" --output-dir "D:\Antlings\outputs\samples"
python src\train.py --data "D:\Antlings\configs\dataset.yaml" --model yolov8n.pt --epochs 50 --imgsz 640 --batch 16 --project "D:\Antlings\runs\train" --name human_car_yolo
python src\infer.py --weights "D:\Antlings\runs\train\human_car_yolo\weights\best.pt" --source "D:\Antlings\data\visdrone_human_car\images\val" --output-dir "D:\Antlings\outputs\predictions" --conf 0.25 --save-csv
python src\evaluate.py --weights "D:\Antlings\runs\train\human_car_yolo\weights\best.pt" --data "D:\Antlings\configs\dataset.yaml" --output-dir "D:\Antlings\outputs\metrics"Default conversion:
python src\convert_visdrone_to_yolo.py --dataset-root "D:\Antlings\archive\VisDrone_Dataset" --output-root "D:\Antlings\data\visdrone_human_car"Custom paths:
python src\convert_visdrone_to_yolo.py --dataset-root "PATH_TO_VISDRONE" --output-root "PATH_TO_OUTPUT"Generated dataset:
D:\Antlings\data\visdrone_human_car\images\train
D:\Antlings\data\visdrone_human_car\images\val
D:\Antlings\data\visdrone_human_car\images\test
D:\Antlings\data\visdrone_human_car\labels\train
D:\Antlings\data\visdrone_human_car\labels\val
D:\Antlings\data\visdrone_human_car\labels\test
The script also writes:
D:\Antlings\configs\dataset.yaml
python src\explore_dataset.py --dataset-root "D:\Antlings\data\visdrone_human_car" --output-dir "D:\Antlings\outputs\samples"Outputs:
outputs\samples\class_distribution.pngoutputs\samples\objects_per_image_histogram.png- Annotated sample images with bounding boxes and counts.
outputs\metrics\dataset_summary.csv
python src\train.py --data "D:\Antlings\configs\dataset.yaml" --model yolov8n.pt --epochs 50 --imgsz 640 --batch 16 --project "D:\Antlings\runs\train" --name human_car_yoloDefault model: yolov8n.pt, chosen because it is lightweight and suitable for Colab or modest GPUs. You can use a larger model when hardware allows:
python src\train.py --data "D:\Antlings\configs\dataset.yaml" --model yolov8s.pt --epochs 50
python src\train.py --data "D:\Antlings\configs\dataset.yaml" --model yolo11n.pt --epochs 50Best weights are saved under:
D:\Antlings\runs\train\human_car_yolo\weights\best.pt
Run on a folder:
python src\infer.py --weights "D:\Antlings\runs\train\human_car_yolo\weights\best.pt" --source "D:\Antlings\data\visdrone_human_car\images\val" --output-dir "D:\Antlings\outputs\predictions" --conf 0.25 --save-csvRun on one image:
python src\infer.py --weights "D:\Antlings\runs\train\human_car_yolo\weights\best.pt" --source "D:\Antlings\data\visdrone_human_car\images\val\example.jpg" --output-dir "D:\Antlings\outputs\predictions" --conf 0.25 --save-csvRun on a video:
python src\infer.py --weights "D:\Antlings\runs\train\human_car_yolo\weights\best.pt" --source "D:\Antlings\demo_video.mp4" --output-dir "D:\Antlings\outputs\predictions" --conf 0.25 --save-csvEach output image/video displays bounding boxes and a readable counter:
Humans: N Cars: M
CSV outputs:
outputs\predictions\predictions.csvoutputs\predictions\counts.csv
python src\evaluate.py --weights "D:\Antlings\runs\train\human_car_yolo\weights\best.pt" --data "D:\Antlings\configs\dataset.yaml" --output-dir "D:\Antlings\outputs\metrics"Outputs:
outputs\metrics\metrics_summary.jsonoutputs\metrics\metrics_summary.csvoutputs\metrics\evaluation_notes.md
Metrics printed include precision, recall, mAP50, mAP50-95, and estimated FPS when Ultralytics exposes speed values.
For a demo video:
python src\track.py --weights "D:\Antlings\runs\train\human_car_yolo\weights\best.pt" --source "D:\Antlings\demo_video.mp4" --output-dir "D:\Antlings\outputs\tracking" --tracker bytetrack.yamlThe tracking script saves:
- Tracked video with IDs.
outputs\tracking\tracking_counts.csv
If tracker dependencies or configs are unavailable, the script prints a clear message explaining how to enable tracking.
Use this structure for a 3-5 minute demo:
- Project goal: detect humans and cars from drone images and count humans.
- Dataset: show
D:\Antlings\archive\VisDrone_Dataset, annotation format, and class remapping. - Preprocessing: run or explain
convert_visdrone_to_yolo.pyand showconfigs\dataset.yaml. - Dataset understanding: show class distribution, object-count histogram, and sample annotated images.
- Training: show the YOLO training command and explain why
yolov8n.ptis used. - Inference: run
infer.py, show predicted images, bounding boxes, andHumans: N. - Evaluation: show metrics JSON/CSV and discuss precision, recall, mAP, speed, limitations.
- Bonus: briefly show tracking output if a video is available.
Strengths:
- End-to-end pipeline from dataset conversion to inference and evaluation.
- Clear human counting logic from detection class IDs.
- Works with raw VisDrone annotations and YOLO-style VisDrone label archives.
- Uses small YOLO model by default for practical training and demo speed.
Limitations:
- Small aerial humans are difficult, especially under occlusion.
- Counting accuracy depends on detection recall.
- Dense scenes can create missed detections and duplicate boxes.
- The optional tracker is most useful on video, not independent still images.