![intro](./proj_intro.png)
## Proj 5: Object Detection+Object-level Scene Recognition 
Since the computational cost of training an entire scene recognition deep learning network is too high for most students, we intend to utilize open-source object detection APIs (mmdetection here) to first detect the common objects in the scene images. Based on the semantic information derived from object detection, a narrow network or machine learning methods can be adopted to further classify the scene.

### Object Detection API: [MMDetection](https://github.com/open-mmlab/mmdetection)
![mmdetectuib](./mmdet-logo.png)
1. Get start following [official document](https://mmdetection.readthedocs.io/zh_CN/latest/get_started.html)
   - pytorch >= 1.6.0 (cpuonly if no nvidia GPU card in your device.)
   - mmcv >= 2.0.0 
      ```
      pip install -U openmim 
      pip install chardet (Optional: if encounter "No module named 'chardet'")
      mim install mmengine 
      mim install "mmcv>=2.0.0"
      ```
   - mmdetection
      ```
      min install mmdet
      ```
   - Verify the installation [link](https://mmdetection.readthedocs.io/en/latest/get_started.html#verify-the-installation)
2. Inference Code with [RTMDet](https://arxiv.org/abs/2212.07784) (All object detection methods in MMDet you can try):

In [None]:
import mmcv
from mmdet.apis import init_detector, inference_detector

config_file = '../data/mmdet/rtmdet_tiny_8xb32-300e_coco.py'
checkpoint_file = '../data/mmdet/rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth'
img = mmcv.imread('../data/demo.jpg', channel_order='rgb')
model = init_detector(config_file, checkpoint_file, device='cpu')  # or device='cuda:0'

result = inference_detector(model, img)
print(result)

## Start the main body of this project!

### Collecting your scene datasets and inferencing them with Object Detection APIs.

# mmdet API

In [7]:
from mmdet.registry import VISUALIZERS
import mmcv
from mmdet.apis import init_detector, inference_detector
import cv2

# init the visualizer(execute this block only once)
config_file = '../data/mmdet/rtmdet_tiny_8xb32-300e_coco.py'
checkpoint_file = '../data/mmdet/rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth'
model = init_detector(config_file, checkpoint_file, device='cpu')  # or device='cuda:0'
visualizer = VISUALIZERS.build(model.cfg.visualizer)
# the dataset_meta is loaded from the checkpoint and
# then pass to the model in init_detector
visualizer.dataset_meta = model.dataset_meta

# show the results
for i in range(4):
    img = mmcv.imread(f'../data/{i}.jpg', channel_order='rgb')
    result = inference_detector(model, img)
    visualizer.add_datasample(
        'result',
        img,
        data_sample=result,
        draw_gt=False,
        wait_time=0,
    )
    visualizer.show()
    visualized_img = visualizer.get_image()
    cv2.imwrite(f'../data/mmdet/{i}_result.jpg', visualized_img)

Loads checkpoint by local backend from path: ../data/mmdet/rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth
The model and loaded state dict do not match exactly

unexpected key in source state_dict: data_preprocessor.mean, data_preprocessor.std



# YOLO API

In [4]:
from ultralytics import YOLO

# Create a new YOLO model from scratch
model = YOLO("yolo11.yaml")

# Load a pretrained YOLO model (recommended for training)
model = YOLO("yolo11n.pt")

# Train the model using the 'coco8.yaml' dataset for 3 epochs
results = model.train(data="coco8.yaml", epochs=3)

# Evaluate the model's performance on the validation set
results = model.val()

Run 'pip install torchvision==0.19' to fix torchvision or 'pip install -U torch torchvision' to update both.
For a full compatibility table see https://github.com/pytorch/vision#installation
Ultralytics 8.3.43  Python-3.8.20 torch-2.4.1 CPU (Intel Core(TM) i5-9400F 2.90GHz)
[34m[1mengine\trainer: [0mtask=detect, mode=train, model=yolo11n.pt, data=coco8.yaml, epochs=3, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=None, workers=8, project=None, name=train7, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, a

[34m[1mtrain: [0mScanning E:\XQY\Object-Detection\data\yolo\datasets\coco8\labels\train.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<?, ?it/s]
[34m[1mval: [0mScanning E:\XQY\Object-Detection\data\yolo\datasets\coco8\labels\val.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<?, ?it/s]


Plotting labels to runs\detect\train7\labels.jpg... 
[34m[1moptimizer:[0m 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
[34m[1moptimizer:[0m AdamW(lr=0.000119, momentum=0.9) with parameter groups 81 weight(decay=0.0), 88 weight(decay=0.0005), 87 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to [1mruns\detect\train7[0m
Starting training for 3 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


        1/3         0G      1.336      2.869      1.644         22        640: 100%|██████████| 1/1 [00:01<00:00,  1.77s/it]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  1.68it/s]

                   all          4         17      0.575       0.85      0.878      0.635






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


        2/3         0G      1.204      2.694      1.511         23        640: 100%|██████████| 1/1 [00:01<00:00,  1.58s/it]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  1.55it/s]

                   all          4         17      0.564       0.85      0.849      0.632






      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size


        3/3         0G      1.165      3.632      1.574         16        640: 100%|██████████| 1/1 [00:01<00:00,  1.52s/it]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  1.62it/s]

                   all          4         17      0.559       0.85      0.851      0.616






3 epochs completed in 0.002 hours.
Optimizer stripped from runs\detect\train7\weights\last.pt, 5.5MB
Optimizer stripped from runs\detect\train7\weights\best.pt, 5.5MB

Validating runs\detect\train7\weights\best.pt...
Ultralytics 8.3.43  Python-3.8.20 torch-2.4.1 CPU (Intel Core(TM) i5-9400F 2.90GHz)
YOLO11n summary (fused): 238 layers, 2,616,248 parameters, 0 gradients


                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  2.42it/s]


                   all          4         17      0.573       0.85      0.878      0.635
                person          3         10      0.563        0.6      0.594      0.273
                   dog          1          1      0.547          1      0.995      0.697
                 horse          1          2      0.541          1      0.995      0.674
              elephant          1          2       0.37        0.5      0.695      0.275
              umbrella          1          1       0.57          1      0.995      0.995
          potted plant          1          1      0.849          1      0.995      0.895
Speed: 3.4ms preprocess, 91.6ms inference, 0.0ms loss, 3.5ms postprocess per image
Results saved to [1mruns\detect\train7[0m
Ultralytics 8.3.43  Python-3.8.20 torch-2.4.1 CPU (Intel Core(TM) i5-9400F 2.90GHz)
YOLO11n summary (fused): 238 layers, 2,616,248 parameters, 0 gradients


[34m[1mval: [0mScanning E:\XQY\Object-Detection\data\yolo\datasets\coco8\labels\val.cache... 4 images, 0 backgrounds, 0 corrupt: 100%|██████████| 4/4 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 1/1 [00:00<00:00,  2.43it/s]


                   all          4         17      0.573       0.85      0.878      0.635
                person          3         10      0.563        0.6      0.594      0.273
                   dog          1          1      0.547          1      0.995      0.697
                 horse          1          2      0.541          1      0.995      0.674
              elephant          1          2       0.37        0.5      0.695      0.275
              umbrella          1          1       0.57          1      0.995      0.995
          potted plant          1          1      0.849          1      0.995      0.895
Speed: 0.0ms preprocess, 96.0ms inference, 0.0ms loss, 0.0ms postprocess per image
Results saved to [1mruns\detect\train72[0m


In [5]:
# Perform object detection on an image using the model
for i in range(4):
    results = model(f"../data/{i}.jpg")
    results[0].save(f'../data/yolo/{i}_result.jpg')

# Export the model to ONNX format
success = model.export(format="onnx")




image 1/1 e:\XQY\Object-Detection\code\..\data\0.jpg: 480x640 1 airplane, 7 benchs, 1 tv, 106.9ms
Speed: 3.0ms preprocess, 106.9ms inference, 0.0ms postprocess per image at shape (1, 3, 480, 640)

image 1/1 e:\XQY\Object-Detection\code\..\data\1.jpg: 480x640 4 persons, 3 motorcycles, 1 umbrella, 66.5ms
Speed: 3.0ms preprocess, 66.5ms inference, 9.2ms postprocess per image at shape (1, 3, 480, 640)

image 1/1 e:\XQY\Object-Detection\code\..\data\2.jpg: 480x640 3 chairs, 63.1ms
Speed: 2.0ms preprocess, 63.1ms inference, 0.0ms postprocess per image at shape (1, 3, 480, 640)

image 1/1 e:\XQY\Object-Detection\code\..\data\3.jpg: 448x640 6 persons, 3 traffic lights, 1 umbrella, 1 chair, 1 dining table, 85.3ms
Speed: 0.0ms preprocess, 85.3ms inference, 1.0ms postprocess per image at shape (1, 3, 448, 640)
Ultralytics 8.3.43  Python-3.8.20 torch-2.4.1 CPU (Intel Core(TM) i5-9400F 2.90GHz)

[34m[1mPyTorch:[0m starting from 'runs\detect\train7\weights\best.pt' with input shape (1, 3, 640, 64

# Faster R-CNN API()

In [1]:
import torch
import torchvision
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_Weights
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.transforms import functional as F
from PIL import Image

# 1. 加载在 COCO 数据集上预训练好的 Faster R-CNN 模型
# 使用 Weights 参数可以指定加载预训练权重
weights = FasterRCNN_ResNet50_FPN_Weights.DEFAULT # 使用默认的最佳权重
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights=weights)

print(model)

FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d(64, eps=0.0)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d(64, eps=0.0)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d(64, eps=0.0)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d(256, eps=0.0)
          (relu): ReLU(

In [2]:

# 2. 如果您的任务类别数量不同于 COCO (91类，包括背景类)，需要修改分类器
num_classes = 2 # 例如，如果您只有1个目标类别 + 1个背景类别
# 获取分类器头部的输入特征数量
in_features = model.roi_heads.box_predictor.cls_score.in_features
# 替换预训练的分类器头部为一个新的，包含您的任务所需的类别数量
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

# 3. 将模型设置为评估模式 (如果进行推理)
# 如果要进行训练，设置为 model.train()
model.eval()

# 4. 准备一个示例输入图像张量
# 实际应用中，您需要加载您的图片并进行预处理
# 这里创建一个假的张量作为示例 (Batch size = 1, 3 channels, 800x800 size)
# 请注意，模型期望的输入是 (Batch, Channels, Height, Width)
dummy_image = torch.rand(1, 3, 800, 800)

In [None]:
img_path = "../data/3.jpg"
img = Image.open(img_path).convert("RGB")
preprocess = weights.transforms()
image_tensor = preprocess(img).unsqueeze(0) # Add batch dimension

with torch.no_grad(): 
    predictions = model(image_tensor) 
    print(predictions[0])

# 示例输出结果 (对于 dummy_image，结果是随机的或为空)
# boxes: [N, 4] Tensor, N是检测到的目标数量，4是边界框坐标 (x1, y1, x2, y2)
# labels: [N] Tensor, 每个边界框对应的类别标签
# scores: [N] Tensor, 每个边界框对应的置信度得分


{'boxes': tensor([[1.3278e+03, 1.4951e+03, 1.6784e+03, 1.6596e+03],
        [6.6823e+02, 7.0854e+02, 8.9868e+02, 7.9751e+02],
        [2.3777e+01, 1.2602e+03, 5.4252e+02, 1.4784e+03],
        [2.5740e+02, 7.9877e+02, 3.4375e+02, 8.8838e+02],
        [1.6363e+03, 7.4854e+02, 1.6630e+03, 8.1343e+02],
        [8.9966e+02, 8.9614e+02, 9.7760e+02, 1.0058e+03],
        [1.3277e+03, 8.8486e+02, 1.9712e+03, 1.6680e+03],
        [1.7997e+03, 7.2787e+02, 2.0464e+03, 8.2396e+02],
        [0.0000e+00, 9.1061e+02, 9.6811e+02, 1.6680e+03],
        [1.9098e+02, 1.5709e+03, 5.3074e+02, 1.6645e+03],
        [1.7703e+03, 6.9046e+02, 2.0895e+03, 8.4730e+02],
        [1.1131e+03, 1.5817e+03, 1.3007e+03, 1.6671e+03],
        [1.6759e+03, 6.9241e+02, 1.7498e+03, 8.0722e+02],
        [6.1619e+02, 6.7785e+02, 1.0083e+03, 8.0257e+02],
        [2.1501e+03, 1.2197e+03, 2.3735e+03, 1.6431e+03],
        [8.5105e+02, 1.5256e+03, 1.3394e+03, 1.6680e+03],
        [1.0981e+03, 5.5331e+02, 1.2884e+03, 8.0066e+02],
    