Import Necessary Libraries to ensure the code run successfully. **Please note that we developed our project based on a dedicated server with multiple GPUs. Please adjust the system and cuda setting here accordingly to run it in your environment.**

In [None]:
#
import os,sys,re,math,datetime,time,globe,random
# please adjust this setting accordingly for your environment
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import torchvision
# import torchvision.transforms as transforms
from torchvision import datasets, transforms
from torchvision.models import vision_transformer
from torch.utils.data import DataLoader,random_split
from torch import nn, optim
import torch.nn.functional as F
import torch
import numpy as np
import torchvision.models as models
import random
# import matplotlib as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from ultralytics import YOLO

#
from tqdm import tqdm
import pandas as pd
import cv2
import moviepy.editor as moviepy

# **1. Introduction**

Bicycles are becoming a preference transportation mean for many young people. Based on the American Community Survey (ACS), there are roughly 780,000 people in America reporting commute by bicycle as it can reduce their carbon footprint and additionally it is efficient during rush hour. Based on the data collected by MAED IN CA, there are 16% of Canadians cycle at least once a week. The bike ownership in Canada is 36%. In Montreal, the portion of people bike to work grew from 1.3% in 1996 to 3.6% in 2016. In Toronto, the cycling commuters also grew from 1.1% to 2.7%. However, almost half of Canadians feel that it is too dangerous to cycle in their area, and on average, there are around 74 cycling fatalities per year in Canada. Every year around 41,000 cyclists die in road traffic-related crashes worldwide (report by WHO). Additionally, most cycling fatalities occurred during evening rush hour. The highest percentage, which is 16%, of cycling fatalities occurred during 16:01 and 20:00, since the environmental conditions will affect visibility, such as darkness, rain or blinding sunlight appear to have played a role in 21% of fatal cycling events.

<img src="https://i.postimg.cc/qqKP8BYm/2024-04-05-22-52-05.png"  width="350" height="300">    <img src="https://i.postimg.cc/vHFFG9gC/2024-04-05-22-52-25.png"  width="600" height="300">

Figure 1. Cyclist and Pedestrain Safety
<br><br>
It is with this reality in mind that the main goal of our project is to develop a lightweight cyclist detection program that can perform real-time cyclist detection in a stream-in, stream-out manner with high accuracy. Based on our research, there have been numerous studies on this topic, and object detection models nowadays have become very mature. However, they mainly focus on either predicting the existence/location of the cyclist or the direction the cyclists are moving. We believe that constantly sending alarms about the existence of a nearby biker regardless of their riding status will cause the truck driver to lose/decrease attention. For example, when a truck is stopped at a red light at an intersection, every time there is a nearby biker, those systems would send the truck driver an alarm, and typically, the bike is not moving, just waiting for the traffic light. And after constantly receiving those alarms, the driver would potentially assume that every time the cyclists are waiting for the traffic light and not riding as well, making the truck driver less cautious about it. If the biker rides the bike at this time, it could cause a tragedy. By applying our method, we not only send alarms for nearby bikers but also send different alarms for different biking statuses, then the truck driver can easily know if those bikers are in a riding status, he should pay more attention to them, thereby, decrease the risk of traffic accidents.

Moreover, our project is designed (and works well) for this scenario but not limited to this scenario. According to Canadian laws, there are many places in this country where riding a bike is prohibited, like in some crosswalks and bridges, riding a bike could cause accidents. Our application can also be used in such scenarios to detect if the biker is riding the bike or just walking with it. By accurately identifying these behaviors, we can issue targeted warnings in real time, enhancing the safety of pedestrians and bicyclists. Such scenario increases the complexity of the task as the two statuses are very similar, however, our novel classification method achieved relatively great performance on this task.

Regarding our system, the proposed system contains two integrated parts, a customized yolo object detection model and a classification model. The detection model takes in a frame and predicts bounding boxes of possible cyclists with values of confidence. In our project, we consider the confidence threshold as a hyperparameter and we selected it based on emprical analysis. The classification model then takes the boudingbox images and predicts if the cyclists are riding or not. Then we label the results in the corresponding frame. An overview of our system is shown in Fig.2.

<img src="https://i.postimg.cc/tJcK6Z2n/2024-04-05-22-49-37.png"  width=100%>

Figure 2: Overview of System
<br><br>

Overall, our contributions can be summarized in the following:

(1) We proposed the approach that incorporates the cyclist's stem information into the classifier model to provide extra information. And we conducted extensive experiments comparing multiple popular CNN models and showed that our approach leads to significantly increased accuracy in the classification task across all models.

(2) We designed and developed a cyclist riding status detection system integrated with a detection model and a classification model that shows great performance in the complicated cyclist status classification task.

(3) We proposed the approach that finds good quality bounding boxes for cyclists in the unride class by minimizing the person-bike objective function.

(4) We curated around 10k high-quality cropped street view images for the cyclist riding status classification task. We have published our dataset on Kaggle, and other researchers could use our project as a starting point.









## **1.1 Dataset Description**

We've collected 97 videos online of cyclists riding or walking with a bike. There are 37 videos for riding class, 55 videos for un-riding class, as well as 5 mix videos containing both riding and un-riding classes. Specifically, riding videos are videos that most of its scenes are cyclists in riding status. While unride videos contains scenes that cyclists in unriding status for most of the time.
### Data for training yolo on customized class
We utilized the data from [Cyclist-Detection-Dataset](https://gitlab03.wpi.edu/ctang5/cyclistdetectiondataset/-/tree/main/train) to trian our yolo model on a customized class of cyclist. The dataset mainly contains 15k street views images with bounding box information of the cyclist.

### Gathering cropped images for training classifer model
We developed a piple to crop cyclist bouding box images from raw videos. Our scripts takes the raw videos and extract 3 frames per second, then each of the frames is passed to a pretrained yolo detection model for bouding box prediction. We gather around 5k cropped cyclist images in unriding status only due to it's relatively relexed bouding box determination criteria. Please refer to section 1.2.2 Object Detection for detailed explaination. To gahter high quality images of cyclists in riding status, we used a customized yolo detection model to identify the cyclists directly.

After processing the raw videos with the two different detection model for each class, we obtained around 30000 riding images, but around only 5,000 unriding images. The varies length of videos constitutes the primary reason, as although we have more unriding video than riding videos, most riding video are much longer than un-riding videos and includes many similar/repeated scenes according to the content of those videos. We removed a substantial amount of images from near frames that look very similar, and mannually curated around 5000 high quality image from riding class. With all the curated iamges, here we introduce our cyclist riding status dection dataset with around 10k street view images distributed evenly in riding class and unriding class. We split this dataset into trian, test, and validation set, then we trained and tested our classifier model on this dataset. Besides, the training set, we also gathered some extra unseen cyclists videos from youtube to test our model, please refer to the result secitons for details.

A draft version of our dataset is now available at: https://www.kaggle.com/datasets/optmllab01/bike-riding-cls






Here is the shape of our dataset. There are totally 10885 images and the distribution is as the graph showed:

<img src="https://i.postimg.cc/NGPqSSBG/2024-04-05-22-49-21.png"  width="500" height="300"> <img src="https://i.postimg.cc/wTy8p2jX/2024-04-05-22-49-27.png"  width="400" height="300">

Figure 3: Data Structure & Distribution

Here are some data samples in our cropped dataset [Bike riding classification](https://www.kaggle.com/datasets/optmllab01/bike-riding-cls):

<img src="https://i.postimg.cc/1zZZMFPp/IMG-2339.jpg"  width="700" height="600">


Here are some frames from uncropped video:

<img src="https://i.postimg.cc/KvRFYFJg/IMG-2334.jpg"  width="700" height="500">
、






Figure 4: Data Sample

In [None]:
## The following fucntions are for cyclist bounding box detection approach described
## in section 1.2.2 Object Detection.

from scipy.optimize import linear_sum_assignment

# This function takes the bounding box and returns the center's coordinate
def get_bbox_center(bbox):
    print(bbox)
    x_center = (bbox[0] + bbox[2]) / 2
    y_center = (bbox[1] + bbox[3]) / 2
    return np.array([x_center, y_center])

# This method take two bouding boxes then calcualte the euc distance of them
def get_bboxPair_dist(bboxA, bboxB, metric="euc"):
    center_bboxA, center_bboxB = get_bbox_center(bboxA), get_bbox_center(bboxB)
    if metric=="euc":
        return np.linalg.norm(center_bboxA - center_bboxB)

# This method takes the two bouding boxes and calcualte their overlapping area
def get_bboxPair_overlap_area(bboxA, bboxB):
    xA = max(bboxA[0], bboxB[0]) # x1
    yA = max(bboxA[1], bboxB[1]) # y1
    xB = min(bboxA[2], bboxB[2]) # x2
    yB = min(bboxA[3], bboxB[3]) # y2
    interArea = max(0, xB - xA) * max(0, yB - yA)
    return interArea


# This function takes in the list of person bouding box and a list of bike bouding
# boxes, then prepare the pairwise cost matrix based on objective we proposed.
def prepare_cost_matrix(person_bbox_list, bike_bbox_list):
    num_persons = len(person_bbox_list)
    num_bikes = len(bike_bbox_list)
    cost_matrix = np.zeros((num_persons, num_bikes))
    # iteraet over all the person and bike bbox to get the cost matrix
    for i, person in enumerate(person_bbox_list):
        for j, bike in enumerate(bike_bbox_list):
            dist = get_bboxPair_dist(person['xyxy'], bike['xyxy'])
            overlap = get_bboxPair_overlap_area(person['xyxy'], bike['xyxy'])
            # here convert the bi-obj optimization into one cost function, that is
            # to find the combo of human and bike objs that have the largest overlapping
            # area and the least center distance
            cost_matrix[i, j] = dist - overlap  # min distance and max overlapping

    return cost_matrix

# This function takes the bounding box list of person and bike and returns the
# best combination based on our objective function, the returned bouding boxes are
# identifyed as bounding box of cyclist
def match_and_merge(person_bbox_list, bike_bbox_list):
    """
    Find the combo that has the min cost, then merge that combo into on obj, also
    we increase the size of the bbox accordingly to cover the person and the bike
    """
    cost_matrix = prepare_cost_matrix(person_bbox_list, bike_bbox_list)
    row_ind, col_ind = linear_sum_assignment(cost_matrix)

    merged_bbox_ind=0
    matched_pairs = []
    for i, j in zip(row_ind, col_ind):
        person = person_bbox_list[i]['xyxy']
        bike = bike_bbox_list[j]['xyxy']
        # note that here we use the xyxy format from yolo
        merged_box = [
            min(person[0], bike[0]),
            min(person[1], bike[1]),
            max(person[2], bike[2]),
            max(person[3], bike[3])
        ]
        matched_pairs.append({'bbox_id':merged_bbox_ind,
                              'person': i,
                              'bike': j,
                              'merged_box': merged_box})
        merged_bbox_ind+=1
    return matched_pairs

"""
Function that takes in a video and then extract n frames, for each frame we apply
our yolo model to process and then return the processed annotated images, videos,
and cropped bboxes of potential cyclists.
"""
def yolo_process_video(data_home_dir, img_out_dir, vid_in_dir, model, cls, yolo_imgsz=640, out_fps=3,vid_out_dir=None):

    vid_in_name,vid_in_ext = os.path.splitext(os.path.basename(vid_in_dir))

    # load video and start recieving output
    cap = cv2.VideoCapture(vid_in_dir)
    print(f"{cap.isOpened()}")
    # get specs of video
    fps = cap.get(cv2.CAP_PROP_FPS)
    # print("here")
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    # extract part of the frames only to save computation and storage
    skip_frames = int(fps / out_fps)
    ## output to mp4
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    vid_out_dir = os.path.join(img_out_dir ,vid_in_name) # output video to same folder as frames
    os.makedirs(vid_out_dir, exist_ok=True)
    video_out = cv2.VideoWriter(os.path.join(vid_out_dir,f'res_{vid_in_name}{vid_in_ext}'), fourcc, out_fps, (frame_width, frame_height))

    #### process each iamge with yolo
    frame_id = 0
    while cap.isOpened():
        ret,frame = cap.read()
        # print("cap ret frame read")
        if ret:
            # skip frames
            if frame_id % skip_frames == 0:
                frame_rgb = frame # cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                # save curr frame
                # inference
                results = model([frame_rgb], imgsz=yolo_imgsz, verbose=False)
                #
                # #
                for img_index, result in enumerate(results):
                    frame_folder_dir  = f'{vid_in_name}/res_img_{img_index}_frame_{frame_id}'
                    frame_output_filename = f'{cls}_res_yolo_vid_{vid_in_name}_img_{img_index}_fid_{frame_id}.jpg'
                    frame_folder_dir = os.path.join(img_out_dir, frame_folder_dir)
                    os.makedirs(frame_folder_dir, exist_ok=True)

                    # save yolo annoated result
                    result.save(filename=f"{frame_folder_dir}/{frame_output_filename}")
                    result_array = result.plot(labels=False, probs=False, boxes=False)
                    # retireve all the bbox predicted by yolo, here, only classes labeled as human or bike are gathered
                    # these resultes are stored in a dict which will then be further processed
                    person_bbox_list, bike_bbox_list = [],[]
                    bbox_cls_dict = result.names
                    boxes_labs = result.boxes.cls.tolist()
                    boxes_conf = result.boxes.conf.tolist()
                    for box_id, (box, box_label, box_conf) in enumerate(zip(result.boxes.xyxy, boxes_labs,boxes_conf)):
                        bbox_path=f"{frame_folder_dir}/box_id_{box_id}_{bbox_cls_dict[box_label]}.jpg"
                        x1,y1,x2,y2 = [int(_) for _ in box.tolist()]
                        x1, y1 = max(x1, 0), max(y1, 0)
                        x2, y2 = min(x2, frame_width), min(y2, frame_height)
                        # print(x1,y1,x2,y2)
                        # cropped_img = frame_rgb[y1:y2, x1:x2]
                        # cv2.imwrite(bbox_path, cropped_img)
                        # store the current bounding box
                        if box_label==0:
                            person_bbox_list.append({"bbox_ind":box_id, "conf":box_conf, "xyxy":[x1,y1,x2,y2]})
                        if box_label==1:
                            bike_bbox_list.append({"bbox_ind":box_id, "conf":box_conf, "xyxy":[x1,y1,x2,y2]})


                    # for each image label out the merged person and bike bounding box
                    person_bike_merged_bbox_list = match_and_merge(person_bbox_list, bike_bbox_list)
                    for pb_bbox_ind, pb_bbox in enumerate(person_bike_merged_bbox_list):
                        pb_bbox_path=f"{frame_folder_dir}/{cls}_res_cropbp_vid_{vid_in_name}_img_{img_index}_fid_{frame_id}_bbox_{pb_bbox_ind}.jpg"
                        x1,y1,x2,y2 = pb_bbox['merged_box']
                        label=f"{cls}_bike_person"
                        cv2.rectangle(result_array, (x1, y1), (x2, y2), (0, 255, 0), 2)
                        # Put the label near the rectangle
                        cv2.putText(result_array, label, (x1, max(y1 - 10, 0)), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
                        # save boxes to local
                        pb_bbox_img = result_array[y1:y2, x1:x2]
                        cv2.imwrite(pb_bbox_path, pb_bbox_img)


                    video_out.write(result_array) # f'{cls}_res_img_{img_index}_fid_{frame_id}.jpg'
                    cv2.imwrite(f"{frame_folder_dir}/{cls}_res_yolobp_vid_{vid_in_name}_img_{img_index}_fid_{frame_id}.jpg", result_array)

            frame_id += 1
        else:
            break

    # print("here end")
    cap.release()
    video_out.release()

In [None]:
# This part of code is to generate data of cropped bouding boxes of human and bike
# to finetune the classifier. In our dataset, we have scence with ride, unride and
# mixed. Ride videos are videos that most of its scences are cyclists in riding status.
# While unride videos are videos that most of its scences are cyclists in unriding status.
# We use this method to gather cropped iamges mainly for unride class.
# Please refer to section 1.2.2 Object Detection for more details.

model = YOLO('./model/yolov8m.pt')
#
cls_list=['ride', 'unride', 'mixed']
# cls=cls_list[0]
for cls in cls_list:
    task_name = "bike_data"
    data_home_dir="/home/yuliang/1517_proj/ultralytics/datasets"
    img_out_dir=f"/home/yuliang/1517_proj/ultralytics/datasets/processed_data/{task_name}_img/{cls}"
    vid_out_dir=f"/home/yuliang/1517_proj/ultralytics/datasets/processed_data/{task_name}_vid/{cls}"
    os.makedirs(img_out_dir, exist_ok=True)
    os.makedirs(vid_out_dir, exist_ok=True)

    raw_data_dir = f"/home/yuliang/1517_proj/ultralytics/datasets/raw_data/{task_name}_vid/{cls}/"
    for file in os.listdir(raw_data_dir):
        if file.endswith('.mp4'):
            vid_in_dir = os.path.join(raw_data_dir, file)
            yolo_process_video(data_home_dir, img_out_dir, vid_in_dir, model=model, cls=cls, yolo_imgsz=640, out_fps=3)

In [None]:
# We train yolo model with customized cyclist class with bouding box CDD datasets.

# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

customized_yolo_model =YOLO('/home/yuliang/1517_proj/ultralytics/checkpoint/yolov8s.pt')
results = customized_yolo_model.train(data='/home/yuliang/1517_proj/ultralytics/datasets/cyclist.yaml',
                      epochs=200,
                      imgsz=640,
                      patience=30,
                      batch=64,
                      lr0=0.01,
                      lrf=0.01, # final lr = (lr0 * lrf)
                      weight_decay=5e-4,
                      dropout=0.1,
                      pretrained=True,
                      seed=0,
                      box=8.5,
                      device="cuda:1",
                      name="detect_s",
                      project="/home/yuliang/1517_proj/ultralytics/checkpoint/yolov8n")

## **1.2 Model structure**




### **1.2.1 Overview**


This section details the architecture and workflow of our system designed for detecting cyclists and classifying their biking status within images. Our system integrates two main components: an object detection model using YOLOv8 detection model [2], and a subsequent classification model that determines if detected individuals are actively riding their bicycles. We process the input video by extracting 'n' frames per second; each frame is then passed to the detection model to identify potential cyclists through a bounding box prediction task. Then for each bouding box identified within a frame, we pass it to our classifier model, which predicts whether the observed cyclist is in riding or unriding status, defining a binary classifcation task.

### **1.2.2 Object Detection**

We adapted the implementation of YOLOv8n detection model from UltraLytics, a lightweight deep learning model (3.2m parameters) optimized for real-time object detection. The performacne and the inference time of this model aligns with our ultimate goal to facilitate a stream-in, stream-out scenario for real-time classification tasks.

Initially, we utilized the vanilla pretrained yolov8 model which dosen't have the ability to detect the class of cyclist. To address this issuse, for each image, we search for all the bouding boxes of person and bikes. Then we optimize for person-bike combinations that minimize their center distance and maximize their overlapping area as in formula (1) where D is a function that calcuate center distance of two object, and A is a function that calcuate overlapping area of two objects.

$$
\begin{equation}
\underset{\substack{i \in |P|, j \in |B| \\ |b| = |B|}}{\arg\min} \left( D(p_i, b_j) - A(p_i, b_j) \right) \tag{1}
\end{equation}
$$


An illustrate of this process is shown in Fig.5.
<br><br>
<img src="https://i.postimg.cc/MKcXjm3v/bbox-opt.png" width=75%>

Fig.5 : process of finding person-bike combinations

Upon experimentation, we observed that the bounding box quality detected by this method was relatively poor. Specifically, the method could misidentify combinations of a standing bike and a nearby pedestrian as a cyclist, often linking a bike to a random nearby person instead of the actual cyclist. We investigated into this issue, and we found that since our video resources are captured with a single camera, we don't have depth information of them. So people who stand closer to the camera appears larger in the images, and they tend to have more overlapping with the bike than the actual cyclists, which leads to such individuals incorrectly been identified as cyclists. For instance, Fig 6 (a) and Fig.6 (b) shows that when the person in the red shirt stands closer to the camera, she has the same center distance to the bike but larger overlapping area with the bike than the actual biker, causing the incorrect identification.
<br><br>
<img src="https://i.postimg.cc/T15cPgrW/demo2-opt-gif.gif" width=55%>

Fig.6 (a): gif illustration of misclassification caused by the lack of depth information
<br><br>
<img src="https://i.postimg.cc/Njy8FYYr/frames.png" width=65%>

Fig.6 (b): image illustration of misclassification caused by the lack of depth information
<br><br>
Given this limitation, we determined that while this approach was not viable for sourcing high-quality images of riding cyclists due to the increased manual selection workload. However, it proved effective for generating quality images of non-riding scenarios, such as pedestrians near bikes, people standing by bikes, or holding them. Therefore, we retained this method to generate cropped iamges of cyclists, from which we manuaaly select images for unriding classes.

To gather high-quality cropped iamges for ridding classes, we opted to customize a YOLOv8 detection model for our specific cyclist class. We utilized the CyclistDetectionDataset, which primarily features street view images with cyclist bounding boxes. This dataset contains around 13k training, 1k testing, and 500 validation iamges, providing suficient data source for our goal. By training the pretrained vanila YOLOv8n model on this dataset, we endowed it the ability to predict our custom class -cyclist. During the training process, we used a combination of loss with respect to bouding box prediction and classficaiton, with more weights on bouding box predicting. We selected our best model with bouding box loss of 1.396 and classification loss of 0.64411, producing generally accurate results according to our experiments.
<br><br>
<img src="https://i.postimg.cc/02sYCdVN/system-r1.png" width="60%" >

Fig. 7: left part of system

### **1.2.3 Classifier model**

The classification of cycling status based on images alone introduces a significant challenge due to the visual similarities between riding and non-riding postures, both characterized by a person's proximity and significant overlap with a bicycle. Considering factors including image resolution, background complexity, this similarity often makes it difficult, even for humans, to distinguish between the two states. Current popular models have been proven their capacity on more complicated task, however, they didn't show satisfactory result according to our experiment.

Our investigated into hundreds of cyclist videos revealed that there seems to be a pattern of the body stem information with the biking status.
In light of this discovery, and as a novelty in our project, we incorporated person stem inforamtion to the classifer model as we believe this extra information could help the model better understand what is happening in the image and the model can potentially make use of the location of arms and legs.

<img src="https://i.postimg.cc/257C0V7q/demo3-stem-gif.gif" width="50%" >

Fig.8 (a): Stem information of a biker

Our classifier contains 3 integrated parts as shown in Fig 8. Initially, a CNN model learns the image representations, and a PoseNet model identifies body key points of the cyclist, which are then further analyzed by a MLP to derive a meaningful representation. Subsequently, the CNN output is flattened and concatenated with the keypoints representation. Then this combined vector is fed into our MLP classifier to perform the binary classification.

<img src="https://i.postimg.cc/GpqGLWR8/model1.png" width="70%" >

Fig.8 (b): Structure of classifer model with 3 integrated parts


# **2. Model Training and Comparison**

## **2.1 Preprocessing and Load Image Dataset**
At this stage, we are gathering cropped images from a previous part of our project, segmenting them into training, validation, and test sets. We utilized ImageFolder to load the data, and each image was transformed to a size of 224x224 pixels. Given that the images were cropped using bounding boxes, some are rectangular rather than square. To minimize image distortion and enhance the likelihood that the YOLO model could accurately extract stem information, we resized the images to maintain their aspect ratio, ensuring one dimension was 224 pixels. We then padded the images with zeros to achieve a uniform size of 224x224 pixels.

It's important to note that although the YOLO pose model is trained on 640x640 data, some images in our dataset do not possess such high resolution. Furthermore, due to limited computational resources, our hardware is unable to process 640x640 size images with complex neural network models. Therefore, we tested our model with 224x224 size images.

In [None]:
from PIL import Image, ImageOps

class ResizeAndPad:
    def __init__(self, desired_size):
        self.desired_size = desired_size

    def __call__(self, img):
        # Resize the image to maintain aspect ratio with one side being 224
        aspect_ratio = img.width / img.height
        if img.width < img.height:  # Width is the smaller dimension
            new_width = self.desired_size
            new_height = int(self.desired_size / aspect_ratio)
        else:  # Height is the smaller dimension or they are equal
            new_height = self.desired_size
            new_width = int(self.desired_size * aspect_ratio)

        # Use Image.Resampling.LANCZOS for high-quality downsampling
        img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)

        # Calculate padding
        delta_width = self.desired_size - new_width
        delta_height = self.desired_size - new_height
        padding = (delta_width // 2, delta_height // 2, delta_width - (delta_width // 2), delta_height - (delta_height // 2))

        # Pad the resized image
        img = ImageOps.expand(img, padding, fill=0)  # Fill is the color for padding, 0 means black

        return img



Define a transformation that resizes images to 224x224 pixels and normalizes them. Then, load the training, validation, and test datasets with these settings, ensuring all images are consistently formatted for model training and evaluation.

In [None]:
# Now integrate this into your transforms.Compose
transform = transforms.Compose([
    ResizeAndPad(224),  # Custom transform to resize and pad
    transforms.ToTensor()
])

base_dir = 'F:/UT_MIE/MIE1517/Project/finallllll/finallllll'

# Load each dataset separately
train_dataset = datasets.ImageFolder(root=f'{base_dir}/train', transform=transform)
val_dataset = datasets.ImageFolder(root=f'{base_dir}/val', transform=transform)
test_dataset = datasets.ImageFolder(root=f'{base_dir}/test', transform=transform)

# Display the shape of each dataset
print(f"train: {len(train_dataset)}, val: {len(val_dataset)}, test: {len(test_dataset)}")


The show_random_images_with_labels function displays a selection of transformed images alongside their labels, demonstrating that the images have been correctly transformed and labeled, as illustrated in the picture below. This visualization aids in verifying the accuracy of the image preprocessing and labeling process. The images are loaded properly!

<img src="https://i.postimg.cc/m21b7JvX/show-random-transformed-images.png"  width="1000" height="180">

Figure 9: Transformed Images


In [None]:
def show_random_images_with_labels(dataset, num_images=5):
    # Randomly select `num_images` samples from the dataset
    selected_indices = random.sample(range(len(dataset)), num_images)
    selected_samples = [dataset[i] for i in selected_indices]

    # Create a figure for plotting
    fig, axes = plt.subplots(1, num_images, figsize=(15, 3))
    for i, (image, label) in enumerate(selected_samples):
        # Convert image to numpy array for plotting
        image_np = image.numpy().transpose((1, 2, 0))
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        image_np = std * image_np + mean  # Unnormalize
        image_np = np.clip(image_np, 0, 1)  # Clip values to be between 0 and 1

        # Plot
        if num_images == 1:
            ax = axes
        else:
            ax = axes[i]
        ax.imshow(image_np)
        ax.axis('off')  # Turn off axis numbers and ticks
        if label == 1:
            ax.set_title(f"Label: Unride({label})")  # Set the title to the image's label
        else:
            ax.set_title(f"Label: Ride({label})")  # Set the title to the image's label

    plt.show()

# Assuming your `train_dataset` is already loaded and transformed
show_random_images_with_labels(train_dataset)

The count_labels_by_folder function is designed to display the distribution of the dataset by showing how many images in the training, validation, and test datasets belong to the ride and unride classes. This function is crucial for ensuring that the dataset used for training is balanced, allowing for a fair representation of both classes and potentially leading to more accurate and generalized model performance.

In [None]:
def count_labels_by_folder(datasets):
    # Initialize counters for each dataset
    counts = {
        'Train': {'ride': 0, 'unride': 0},
        'Validation': {'ride': 0, 'unride': 0},
        'Test': {'ride': 0, 'unride': 0}
    }

    for name, dataset in datasets.items():
        # Iterate through the images and labels in the ImageFolder
        for path, _ in dataset.imgs:
            if 'unride' in path:
                counts[name]['unride'] += 1
            elif 'ride' in path:
                counts[name]['ride'] += 1

    return counts

# Assuming train_dataset, val_dataset, and test_dataset are ImageFolder instances
datasets = {
    'Train': train_dataset,
    'Validation': val_dataset,
    'Test': test_dataset
}

# Count and print the labels
label_counts = count_labels_by_folder(datasets)
for dataset_name, counts in label_counts.items():
    print(f"{dataset_name}: Ride: {counts['ride']}, Unride: {counts['unride']}")


In this section, we compare the performance of various deep learning models in predicting ride and unride classes. Additionally, we'll explore whether incorporating stem or pose information into the models enhances their predictive accuracy.

## **2.2 Models Without Stem**

In our project, since we are conducting binary classification, we have chosen to use accuracy and the F1 score as performance measures, and then plotted the confusion matrix to view detailed classification results.

Initially, we experimented with three models with out stem information: CNN model, SqueezeNet model and ResNet18 model.


In [None]:
def set_seed(seed_value=42, use_cuda=True):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    if use_cuda and torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)


To enhance our understanding of the training process and better interpret the validation results, we utilized the 'show_misclassified_images function'. This function helps in displaying images that were incorrectly classified, allowing us to visually assess where and why our model may be making mistakes.

In [None]:
def show_misclassified_images(misclassified_samples, num_images=30):
    # Calculate the number of rows needed to display the images
    num_rows = num_images // 5 + (num_images % 5 > 0)
    fig, axes = plt.subplots(num_rows, 5, figsize=(20, num_rows * 4))  # Adjust the size as needed

    # Flatten the axes array for easy indexing
    axes = axes.flatten()

    for i, (image, label, pred) in enumerate(misclassified_samples[:num_images]):
        image_np = image.numpy().transpose((1, 2, 0))
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        image_np = std * image_np + mean  # Unnormalize
        image_np = np.clip(image_np, 0, 1)  # Clip values to be between 0 and 1

        ax = axes[i]
        ax.imshow(image_np)
        ax.set_title(f"Label: {int(label)}\nPred: {int(pred.squeeze().item())}", fontsize=10)
        ax.axis('off')

    # Hide any unused subplots if the number of images is not a multiple of 5
    for j in range(i + 1, num_rows * 5):
        axes[j].axis('off')

    plt.subplots_adjust(wspace=0.4, hspace=0.6)  # Adjust spacing between images
    plt.show()

For training our model, we used 'BCEWithLogitsLoss' for its binary classification of ride/unride states, alongside the Adam optimizer known for its efficiency. Our training function outputs key metrics: training/validation accuracy, performance curves, a confusion matrix, and shows misclassified validation images for a clear understanding of model performance.

Accuracy is the most intuitive performance measure; it represents the ratio of correctly predicted observations to the total number of observations. However, accuracy alone can be misleading if the data is imbalanced, as it does not distinguish between the types of errors our model makes. To gain deeper insight, we utilize the F1 Score, which is the harmonic mean of precision and recall. This score is particularly useful in scenarios where class distribution is uneven. Finally, we visualize our results with a confusion matrix, enabling us to observe instances of true positives, true negatives, false positives, and false negatives.

In [None]:
# Function that train the model
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix

def train_model(train_dataset, val_dataset, model, num_epochs, bsz, lr, patience,
                weight_decay=0, use_gpu=True, plot_=False, confusion = True, misclass = True):
    set_seed()
    #
    device = torch.device("cuda" if torch.cuda.is_available() and use_gpu else "cpu")
    print(f"curr device: {device}")

    # prepare data loader
    train_loader = DataLoader(train_dataset, batch_size=bsz, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=bsz, shuffle=False)
    # data parallel
    # model.to(device)
    # loss function and optimizer
    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
    #
    # early stop
    best_val_loss = float('inf')
    epochs_no_improve=0

    # stats
    train_corr_list = np.full(num_epochs, np.nan)
    train_loss_list = np.full(num_epochs, np.nan)
    val_corr_list = np.full(num_epochs, np.nan)
    val_loss_list = np.full(num_epochs, np.nan)
    train_f1_list = np.full(num_epochs, np.nan)
    val_f1_list = np.full(num_epochs, np.nan)
    train_precision_list = np.full(num_epochs, np.nan)
    train_recall_list = np.full(num_epochs, np.nan)
    val_precision_list = np.full(num_epochs, np.nan)
    val_recall_list = np.full(num_epochs, np.nan)

    best_acc_epoch = 0  # Track the epoch with the highest validation accuracy
    best_val_accuracy = 0

    # train loop
    for epoch in range(num_epochs):
        model.train()
        train_loss, train_correct = 0.0,0
        train_bsz, val_bsz = 0,0
        train_preds, train_labels_list = [], []
        with tqdm(enumerate(train_loader), total=len(train_loader), desc=f"Epoch {epoch+1}/{num_epochs}") as t:
            for batch_ind, (inputs, labels) in t:
                inputs, labels = inputs.to(device), labels.to(device)
                optimizer.zero_grad()
                # forward
                outputs = model(inputs)
                loss = criterion(outputs, labels.unsqueeze(1).float())
                # backward
                loss.backward()
                optimizer.step()
                # stats
                train_loss += loss.item()
                # for binary cls case\
                preds = torch.round(torch.sigmoid(outputs))  # Using sigmoid and rounding for binary classification
                train_correct += (preds.squeeze() == labels).sum().item()
                train_bsz += labels.size(0)

                # Collecting labels and predictions for metrics calculation
                train_labels_list.extend(labels.cpu().numpy())
                train_preds.extend(preds.squeeze().detach().cpu().numpy())

        avg_train_acc = train_correct / train_bsz
        avg_train_loss = train_loss / len(train_loader)
        train_loss_list[epoch] = avg_train_loss
        train_corr_list[epoch] = avg_train_acc
        train_f1 = f1_score(train_labels_list, train_preds)
        train_f1_list[epoch] = train_f1
        train_precision = precision_score(train_labels_list, train_preds)
        train_precision_list[epoch] = train_precision
        train_recall = recall_score(train_labels_list, train_preds)
        train_recall_list[epoch] = train_recall


        # val per epoch
        val_labs, val_preds = [],[]
        #
        val_loss, val_correct = 0.0,0
        misclassified_samples = []

        model.eval()
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, labels.unsqueeze(1).float())
                val_loss += loss.item()

                # stats and confusion matrix
                # binary cls case
                preds = torch.round(torch.sigmoid(outputs))

                misclassified = (preds.squeeze(1) != labels)
                for i, mis in enumerate(misclassified):
                    if mis:
                        misclassified_samples.append((inputs[i].cpu(), labels[i].cpu(), preds[i].cpu()))

                # print(preds.shape, preds.squeeze(1))
                val_correct += (preds.squeeze(1) == labels).sum().item()
                val_bsz += labels.size(0)
                # prepare pres and labs for cf matrix
                val_labs.extend(labels.cpu().numpy())
                val_preds.extend(preds.squeeze(1).cpu().numpy())


        avg_val_acc = val_correct / val_bsz
        avg_val_loss = val_loss / len(val_loader)
        val_loss_list[epoch]=avg_val_loss
        val_corr_list[epoch] = avg_val_acc
        val_f1 = f1_score(val_labs, val_preds)
        val_f1_list[epoch] = val_f1
        val_precision = precision_score(val_labs, val_preds)
        val_precision_list[epoch] = val_precision
        val_recall = recall_score(val_labs, val_preds)
        val_recall_list[epoch] = val_recall

        ## log info
        print(f'epoch {epoch+1}/{num_epochs} | train | loss: {avg_train_loss:.6f}, acc: {avg_train_acc:.4f}, Precision: {train_precision:.4f}, Recall: {train_recall:.4f}, f1: {train_f1:.4f}')
        print(f'epoch {epoch+1}/{num_epochs} | val | loss: {avg_val_loss:.6f} acc: {avg_val_acc:.4f}, Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, f1: {val_f1:.4f}')


        if avg_val_acc > best_val_accuracy:
            best_val_accuracy = avg_val_acc
            best_acc_epoch = epoch
            # Store labels and predictions for the best epoch
            best_val_labels_list = val_labs.copy()
            best_val_preds_list = val_preds.copy()

        ## early stop based on val dataset result
        # check if val loss improved
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            epochs_no_improve = 0
        else:
            epochs_no_improve += 1
            print(f"no improvement in val loss for {epochs_no_improve} epochs.")
            if epochs_no_improve == patience:
                    print(f"patient exceeded, early stop triggered at epoch {epoch}")
                    break
    print(f"\nBest val corr observed at {np.nanargmax(val_corr_list)+1}: {np.nanmax(val_corr_list)}")
    print(f"Best val loss observed at {np.nanargmin(val_loss_list)+1}: {np.nanmin(val_loss_list)}")
    print(f"Best val f1 observed at {np.nanargmax(val_f1_list)+1}: {np.nanmax(val_f1_list)}")

    if confusion == True:
        cm = confusion_matrix(best_val_labels_list, best_val_preds_list)
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0, 1], yticklabels=[0, 1])
        plt.title(f'Confusion Matrix for Best Val Acc Epoch: {best_acc_epoch + 1}')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        plt.show()

    if plot_ == True:
        # Generating the plot
        plt.figure(figsize=(10, 6))
        plt.plot(train_loss_list, label='train Loss')
        plt.plot(val_loss_list, label='val Loss')
        # Adding title and labels
        plt.title(f'loss for bsz:{bsz} epc:{epoch+1} lr:{lr}')
        plt.xlabel('Epochs')
        plt.ylabel('Loss')
        plt.legend()
        # Display the plot
        plt.show()
        # Generating the plot
        plt.figure(figsize=(10, 6))
        plt.plot(train_corr_list, label='train accr')
        plt.plot(val_corr_list, label='val accr')
        # Adding title and labels
        plt.title(f'accr for bsz:{bsz} epc:{epoch+1} lr:{lr}')
        plt.xlabel('Epochs')
        plt.ylabel('Accr')
        plt.legend()
        # Display the plot
        plt.show()

    if misclass == True:
        show_misclassified_images(misclassified_samples)

### **2.2.1 Simple 3-layers CNN**

The simple CNN model comprises 3 convolutional layers and 2 linear layers, combined with pooling and ReLU activation functions.

In [None]:
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()

        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)  # Input channels = 3 (RGB)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)  # Pooling layer
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(128 * 28 * 28, 512)  # Adjust the input size accordingly
        self.fc2 = nn.Linear(512, 1)  # Output layer for binary classification

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # 224x224x3 -> 112x112x32
        x = self.pool(F.relu(self.conv2(x)))  # 112x112x32 -> 56x56x64
        x = self.pool(F.relu(self.conv3(x)))  # 56x56x64 -> 28x28x128

        x = x.view(-1, 128 * 28 * 28)  # Flatten the output for the dense layer
        x = F.relu(self.fc1(x))
        x = self.fc2(x)  # Sigmoid activation for binary classification
        return x

The CNN model without STEM information is trained as following:

In [None]:
# Load a Simple CNN model
model = SimpleCNN()

# data parallel
# model = nn.DataParallel(model, device_ids=[0, 1])
model.to(torch.device("cuda"))

train_model(train_dataset=train_dataset,
            val_dataset=val_dataset,
            model=model,
            num_epochs=10,
            bsz=64,
            lr=0.001,
            patience=3,
            use_gpu=True,
            plot_=True)

### **2.2.2 SqueezeNet**

SqueezeNet, a pre-trained model designed to achieve AlexNet-level accuracy with significantly fewer parameters and a smaller model size, is ideal for deployment in environments with limited computational resources or for applications requiring real-time processing. And the SqueezeNet model with out stem information is trained as following:

In [None]:
# Simple 2 layer MLP for classification
class mlp1(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(mlp1, self).__init__()
        self.fc1 = nn.Linear(input_dim, 512)
        self.fc2 = nn.Linear(512, output_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Load a pre-trained SqueezeNet model
model = models.squeezenet1_0(weights=True)
model.classifier = nn.Sequential(
    nn.AdaptiveAvgPool2d(1),
    nn.Flatten(),
    mlp1(512, 1)
)

# data parallel
# model = nn.DataParallel(model, device_ids=[0, 1])
model.to(torch.device("cuda:0"))

train_model(train_dataset=train_dataset,
            val_dataset=val_dataset,
            model=model,
            num_epochs=10,
            bsz=64,
            lr=0.001,
            patience=3,
            use_gpu=True)

### **2.2.3 ResNet18**

ResNet18 is an 18-layer pre-trained model, known for its use of residual connections that help to overcome the vanishing gradient problem. Which is ideal for our project. And the ResNet18 model with out stem information is trained as following:

In [None]:
# Load a pre-trained ResNet18 model
model = models.resnet18(weights=True)
# Remove the fully connected layer
num_ftrs = model.fc.in_features
# Replace with MLP
model.fc = mlp1(num_ftrs, output_dim=1)

# data parallel
# model = nn.DataParallel(model, device_ids=[0, 1])
model.to(torch.device("cuda:0"))

train_model(train_dataset=train_dataset,
            val_dataset=val_dataset,
            model=model,
            num_epochs=10,
            bsz=64,
            lr=0.001,
            patience=3,
            use_gpu=True)

## **2.3 Models With PoseNet**



The main adjustment on the training function for models with pose information is:
* Find the posture keypoints information of every batch
* feed both img information and keypoint information to the model to make prediction

In [None]:
# Function that train the model
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
import copy

def train_PoseNetCNN(train_dataset, val_dataset, model, num_epochs, bsz, lr, patience,
                     weight_decay=0, use_gpu=True, plot_=False, PoseNetModel=None,
                     confusion=True, misclass=True, model_dir=None):
    set_seed()

    if not os.path.exists(model_dir):
        os.makedirs(model_dir, exist_ok=True)

    device = torch.device("cuda" if torch.cuda.is_available() and use_gpu else "cpu")
    print(f"curr device: {device}")

    train_loader = DataLoader(train_dataset, batch_size=bsz, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=bsz, shuffle=False)

    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

    best_val_loss = float('inf')
    epochs_no_improve = 0

    train_corr_list = np.full(num_epochs, np.nan)
    train_loss_list = np.full(num_epochs, np.nan)
    val_corr_list = np.full(num_epochs, np.nan)
    val_loss_list = np.full(num_epochs, np.nan)

    # New metrics storage
    train_precision_list = np.full(num_epochs, np.nan)
    train_recall_list = np.full(num_epochs, np.nan)
    train_f1_list = np.full(num_epochs, np.nan)
    val_precision_list = np.full(num_epochs, np.nan)
    val_recall_list = np.full(num_epochs, np.nan)
    val_f1_list = np.full(num_epochs, np.nan)

    best_acc_epoch = 0  # Track the epoch with the highest validation accuracy
    best_val_accuracy = 0

    best_model_wts=None

    for epoch in range(num_epochs):
        model.train()
        train_loss, train_correct = 0.0, 0
        train_total, val_total = 0, 0
        train_preds, train_labels_list, val_preds, val_labels_list = [], [], [], []

        # Training loop
        with tqdm(enumerate(train_loader), total=len(train_loader), desc=f"Epoch {epoch+1}/{num_epochs}") as t:
            for batch_ind, (inputs, labels) in t:
                inputs, labels = inputs.to(device), labels.to(device)
                with torch.no_grad():
                    # Assuming PoseNetModel and necessary preprocessing are correctly defined
                    res = PoseNetModel(inputs, verbose=False)
                    res_keypoints = torch.stack([i.keypoints[0].xy if i.keypoints[0].conf!=None else torch.zeros(1,17,2).to(device) for i in res ]).squeeze(1)
                    mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1).to(device)
                    std = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1).to(device)
                    inputs = (inputs - mean) / std


                optimizer.zero_grad()
                outputs = model(inputs, res_keypoints)
                loss = criterion(outputs, labels.unsqueeze(1).float())
                loss.backward()
                optimizer.step()

                train_loss += loss.item()
                preds = torch.round(torch.sigmoid(outputs))
                train_correct += (preds.squeeze() == labels).sum().item()
                train_total += labels.size(0)

                # Collecting labels and predictions for metrics calculation
                train_labels_list.extend(labels.cpu().numpy())
                train_preds.extend(preds.squeeze().detach().cpu().numpy())

        # Validation loop
        model.eval()
        val_loss, val_correct = 0.0, 0
        misclassified_samples = []
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)

                res = PoseNetModel(inputs, verbose=False)
                res_keypoints = torch.stack([i.keypoints[0].xy if i.keypoints[0].conf!=None else torch.zeros(1,17,2).to(device) for i in res ]).squeeze(1)

                ################
                mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1).to(device)
                std = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1).to(device)
                inputs = (inputs - mean) / std

                outputs = model(inputs, res_keypoints)
                loss = criterion(outputs, labels.unsqueeze(1).float())
                val_loss += loss.item()

                preds = torch.round(torch.sigmoid(outputs))

                misclassified = (preds.squeeze(1) != labels)
                for i, mis in enumerate(misclassified):
                    if mis:
                        misclassified_samples.append((inputs[i].cpu(), labels[i].cpu(), preds[i].cpu()))

                val_correct += (preds.squeeze(1) == labels).sum().item()
                val_total += labels.size(0)

                # Collecting labels and predictions for metrics calculation
                val_labels_list.extend(labels.cpu().numpy())
                val_preds.extend(preds.squeeze(1).detach().cpu().numpy())

        # Calculating metrics
        avg_train_acc = train_correct / train_total
        avg_val_acc = val_correct / val_total
        train_precision = precision_score(train_labels_list, train_preds)
        train_recall = recall_score(train_labels_list, train_preds)
        train_f1 = f1_score(train_labels_list, train_preds)
        val_precision = precision_score(val_labels_list, val_preds)
        val_recall = recall_score(val_labels_list, val_preds)
        val_f1 = f1_score(val_labels_list, val_preds)

        # Updating lists with metrics
        train_corr_list[epoch] = avg_train_acc
        val_corr_list[epoch] = avg_val_acc
        avg_train_loss = train_loss / len(train_loader)
        train_loss_list[epoch] = avg_train_loss
        avg_val_loss = val_loss / len(val_loader)
        val_loss_list[epoch] = avg_val_loss
        train_precision_list[epoch] = train_precision
        train_recall_list[epoch] = train_recall
        train_f1_list[epoch] = train_f1
        val_precision_list[epoch] = val_precision
        val_recall_list[epoch] = val_recall
        val_f1_list[epoch] = val_f1

        print(f'Epoch {epoch+1}: Train Loss: {train_loss / len(train_loader):.4f}, Acc: {avg_train_acc:.4f}, Precision: {train_precision:.4f}, Recall: {train_recall:.4f}, F1: {train_f1:.4f}')
        print(f'         Val Loss: {val_loss / len(val_loader):.4f}, Acc: {avg_val_acc:.4f}, Precision: {val_precision:.4f}, Recall: {val_recall:.4f}, F1: {val_f1:.4f}')


        if avg_val_acc > best_val_accuracy:
            best_val_accuracy = avg_val_acc
            best_acc_epoch = epoch
            # Store labels and predictions for the best epoch
            best_val_labels_list = val_labels_list.copy()
            best_val_preds_list = val_preds.copy()

         # check if val loss improved
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            epochs_no_improve = 0

            best_model_wts = copy.deepcopy(model.state_dict())
            torch.save(best_model_wts, f"{model_dir}/{model.name}_best.pth")

        else:
            epochs_no_improve += 1
            print(f"no improvement in val loss for {epochs_no_improve} epochs.")
            if epochs_no_improve == patience:
                    print(f"patient exceeded, early stop triggered at epoch {epoch}")
                    break
    print(f"\nBest val corr observed at {np.nanargmax(val_corr_list)+1}: {np.nanmax(val_corr_list)}")
    print(f"Best val loss observed at {np.nanargmin(val_loss_list)+1}: {np.nanmin(val_loss_list)}")
    print(f"Best val F1 observed at {np.nanargmax(val_f1_list)+1}: {np.nanmax(val_f1_list)}")

    # Plotting if requested
    if plot_ == True:
        # Generating the plot
        plt.figure(figsize=(10, 6))
        plt.plot(train_loss_list, label='train Loss')
        plt.plot(val_loss_list, label='val Loss')
        # Adding title and labels
        plt.title(f'loss for {model.name} bsz:{bsz} epc:{epoch+1} lr:{lr}')
        plt.xlabel('Epochs')
        plt.ylabel('Loss')
        plt.legend()
        # Display the plot
        plt.show()
        # Generating the plot
        plt.figure(figsize=(10, 6))
        plt.plot(train_corr_list, label='train accr')
        plt.plot(val_corr_list, label='val accr')
        # Adding title and labels
        plt.title(f'accr for {model.name} bsz:{bsz} epc:{epoch+1} lr:{lr}')
        plt.xlabel('Epochs')
        plt.ylabel('Accr')
        plt.legend()
        # Display the plot
        plt.show()

    # Plotting confusion matrix for the best accuracy epoch
    if confusion:
        # Recalculate or retrieve predictions for the best_acc_epoch if necessary
        cm = confusion_matrix(best_val_labels_list, best_val_preds_list)
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0, 1], yticklabels=[0, 1])
        plt.title(f'Confusion Matrix for Best Val Acc Epoch: {best_acc_epoch + 1}')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        plt.show()

    if misclass == True:
        show_misclassified_images(misclassified_samples)

### **2.3.1 PoseNet + simple CNN**

The PoseNetCNN model combines traditional CNN features with pose key points data. This model processes key points through a separate multi-layer perceptron (MLP), merges them with CNN-derived features, and further processes the combined data through an extensive MLP for classification. The model employs Leaky ReLU activations for the key points MLP to mitigate the dying ReLU problem.

In [None]:
class PoseNetCNN(nn.Module):
    def __init__(self, num_keypoints=17, num_classes=1, lrelu_tr=0.1):
        super(PoseNetCNN, self).__init__()
        self.name="PoseNetCNN"

        self.cnn = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 112x112x32
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 56x56x64
            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 28x28x128
        )


        self.yoloMLP = nn.Sequential(
            nn.Linear(num_keypoints * 2, 128),  # 17 * 2 (x, y)
            nn.LeakyReLU(lrelu_tr),
            nn.Linear(128, 64),
            nn.LeakyReLU(lrelu_tr)
        )

        self.combinedMLP = nn.Sequential(
            nn.Linear(128 * 28 * 28 + 64, 1024),
            nn.ReLU(),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, x, keypoints):
        # CNN
        x = self.cnn(x)
        x = torch.flatten(x, 1)

        # yoloMLP
        keypoints = keypoints.view(keypoints.size(0), -1) #
        yoloMLP_out = self.yoloMLP(keypoints)

        # concat
        concat_features = torch.cat((x, yoloMLP_out), dim=1)
        #
        output = self.combinedMLP(concat_features)

        return output

In [None]:
yoloPoseModel = YOLO('F:/UT_MIE/MIE1517/Project/yolov8n-pose.pt')

PoseNetCNN_model = PoseNetCNN()
# # data parallel
# # model = nn.DataParallel(model, device_ids=[0, 1])
PoseNetCNN_model.to(torch.device("cuda:0"))

train_PoseNetCNN(train_dataset=train_dataset,
            val_dataset=val_dataset,
            model=PoseNetCNN_model,
            num_epochs=10,
            bsz=64,
            lr=0.001,
            patience=3,
            PoseNetModel=yoloPoseModel,
            use_gpu=True,
            plot_=True,
            model_dir='models')

### **2.3.2 PoseNet + SqueezeNet**

The PoseNetSqueeze class modifies the SqueezeNet architecture to specialize in pose estimation by integrating pose key points with SqueezeNet's extracted features. Key adjustments include removing SqueezeNet's original classifier and replacing it with a new sequence tailored for pose estimation, incorporating a separate MLP for processing key points data, and combining these features through an additional MLP for final predictions.

In [None]:
class PoseNetSqueeze(nn.Module):
    def __init__(self, num_keypoints=17, num_classes=1, lrelu_tr=0.1, pretrained=True):
        super(PoseNetSqueeze, self).__init__()
        self.name = "PoseNetSqueeze"

        # Load a pretrained SqueezeNet model
        self.squeezenet = models.squeezenet1_0(pretrained=pretrained)

        # Remove the classifier of SqueezeNet
        # SqueezeNet uses a classifier that starts with a 'dropout' layer
        self.squeezenet.features = nn.Sequential(*list(self.squeezenet.children())[0])

        # SqueezeNet's final convolution layer outputs 512 channels with a 13x13 spatial dimension when the input is 224x224
        # We need to adapt the combined MLP to this new structure
        final_conv = nn.Conv2d(512, num_classes, kernel_size=1)
        self.squeezenet.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            final_conv,
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1, 1))
        )

        # MLP for processing keypoints
        self.yoloMLP = nn.Sequential(
            nn.Linear(num_keypoints * 2, 128),  # 17 * 2 (x, y)
            nn.LeakyReLU(lrelu_tr),
            nn.Linear(128, 64),
            nn.LeakyReLU(lrelu_tr)
        )

        # Combined MLP
        # Adjust the input features of the combinedMLP to match SqueezeNet's output
        self.combinedMLP = nn.Sequential(
            nn.Linear(num_classes + 64, 1024),
            nn.ReLU(),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, x, keypoints):
        # Process images through the SqueezeNet model
        x = self.squeezenet.features(x)
        x = self.squeezenet.classifier(x)
        x = torch.flatten(x, 1)

        # yoloMLP
        keypoints = keypoints.view(keypoints.size(0), -1)
        yoloMLP_out = self.yoloMLP(keypoints)

        # Concatenate SqueezeNet features and keypoints MLP output
        concat_features = torch.cat((x, yoloMLP_out), dim=1)

        # Combined MLP
        output = self.combinedMLP(concat_features)

        return output

In [None]:
yoloPoseModel = YOLO('F:/UT_MIE/MIE1517/Project/yolov8n-pose.pt')

PoseNetCNN_model = PoseNetSqueeze()
# # data parallel
# # model = nn.DataParallel(model, device_ids=[0, 1])
PoseNetCNN_model.to(torch.device("cuda:0"))

train_PoseNetCNN(train_dataset=train_dataset,
            val_dataset=val_dataset,
            model=PoseNetCNN_model,
            num_epochs=10,
            bsz=64,
            lr=0.001,
            patience=3,
            PoseNetModel=yoloPoseModel,
            use_gpu=True,
            plot_=True,
            model_dir='models')

### **2.3.3 PoseNet + ResNet18**


The PoseNetResnet class adapts the ResNet18 architecture for pose estimation tasks by incorporating additional pose key points information alongside the deep features extracted by ResNet18. This model leverages a pretrained ResNet18, omitting its original fully connected layers to use it purely as a feature extractor. An MLP is introduced specifically for processing the pose key points data, which is then combined with the ResNet18 features. The combined data is further processed through a custom MLP for the final prediction.

In [None]:
class PoseNetResnet(nn.Module):
    def __init__(self, num_keypoints=17, num_classes=1, lrelu_tr=0.1, pretrained=True):
        super(PoseNetResnet, self).__init__()
        self.name = "PoseNetResnet"

        # Load a pretrained ResNet18 model
        self.resnet = models.resnet18(pretrained=pretrained)

        # Remove the fully connected layers (fc) of ResNet18
        self.resnet = nn.Sequential(*(list(self.resnet.children())[:-2]))

        # MLP for processing keypoints
        self.yoloMLP = nn.Sequential(
            nn.Linear(num_keypoints * 2, 128),  # 17 * 2 (x, y)
            nn.LeakyReLU(lrelu_tr),
            nn.Linear(128, 64),
            nn.LeakyReLU(lrelu_tr)
        )

        # Combined MLP
        # ResNet18 with default settings ends with 512 channels at 7x7 spatial dimensions, when the input is 224x224
        self.combinedMLP = nn.Sequential(
            nn.Linear(512 * 7 * 7 + 64, 1024),
            nn.ReLU(),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, x, keypoints):
        # Process images through the ResNet18 model
        x = self.resnet(x)
        x = torch.flatten(x, 1)

        # yoloMLP
        keypoints = keypoints.view(keypoints.size(0), -1)
        yoloMLP_out = self.yoloMLP(keypoints)

        # concat
        concat_features = torch.cat((x, yoloMLP_out), dim=1)

        # combined MLP
        output = self.combinedMLP(concat_features)

        return output

In [None]:
yoloPoseModel = YOLO('F:/UT_MIE/MIE1517/Project/yolov8n-pose.pt')

PoseNetCNN_model = PoseNetResnet()
# # data parallel
# # model = nn.DataParallel(model, device_ids=[0, 1])
PoseNetCNN_model.to(torch.device("cuda:0"))

train_PoseNetCNN(train_dataset=train_dataset,
            val_dataset=val_dataset,
            model=PoseNetCNN_model,
            num_epochs=10,
            bsz=64,
            lr=0.001,
            patience=3,
            PoseNetModel=yoloPoseModel,
            use_gpu=True,
            plot_=True,
            model_dir='models')

###  **2.3.4 Hyperparameter Tuning**

We also adjusted the hyperparameters of our models, particularly for PoseNet with a simple CNN and PoseNet with ResNet. We experimented with various batch sizes, learning rates, and then incorporated a dropout layer into the model, fine-tuning the dropout rate. We discovered that the optimal model performance was achieved with a batch size of 64, a learning rate of 0.001, and a dropout rate of 0.2. However, the training and validation accuracies for models with different hyperparameter sets did not show significant variation.

In [None]:
class PoseNetCNN_drop(nn.Module):
    def __init__(self, num_keypoints=17, num_classes=1, lrelu_tr=0.1, dropout_cnn=0.2, dropout_mlp=0.2):
        super(PoseNetCNN_drop, self).__init__()
        self.name = "PoseNetCNN_drop"

        # Convolutional layers with dropout after activation and pooling
        self.cnn = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Dropout(dropout_cnn),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Dropout(dropout_cnn),
            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Dropout(dropout_cnn))

        # MLP for YOLO keypoints with dropout after LeakyReLU
        self.yoloMLP = nn.Sequential(
            nn.Linear(num_keypoints * 2, 128),
            nn.LeakyReLU(lrelu_tr),
            nn.Dropout(dropout_mlp),
            nn.Linear(128, 64),
            nn.LeakyReLU(lrelu_tr),
            nn.Dropout(dropout_mlp))

        # Combined MLP with dropout after activation
        self.combinedMLP = nn.Sequential(
            nn.Linear(128 * 28 * 28 + 64, 1024),
            nn.ReLU(),
            nn.Dropout(dropout_mlp),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Dropout(dropout_mlp),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(dropout_mlp),
            nn.Linear(128, num_classes))

    def forward(self, x, keypoints):
        # CNN
        x = self.cnn(x)
        x = torch.flatten(x, 1)
        # YOLO MLP
        keypoints = keypoints.view(keypoints.size(0), -1)
        yoloMLP_out = self.yoloMLP(keypoints)
        # Concatenation
        concat_features = torch.cat((x, yoloMLP_out), dim=1)
        # Combined MLP
        output = self.combinedMLP(concat_features)

        return output

In [None]:
class PoseNetResnet_Drop(nn.Module):
    def __init__(self, num_keypoints=17, num_classes=1, lrelu_tr=0.1, pretrained=True, dropout_rate=0.2):
        super(PoseNetResnet_Drop, self).__init__()
        self.name = "PoseNetResnet_Drop"
        # Load a pretrained ResNet18 model
        self.resnet = models.resnet18(pretrained=pretrained)
        # Remove the fully connected layer of ResNet18
        self.resnet = nn.Sequential(*(list(self.resnet.children())[:-2]))
        # MLP for processing keypoints, with dropout after activations
        self.yoloMLP = nn.Sequential(
            nn.Linear(num_keypoints * 2, 128),
            nn.LeakyReLU(lrelu_tr),
            nn.Dropout(dropout_rate),
            nn.Linear(128, 64),
            nn.LeakyReLU(lrelu_tr),
            nn.Dropout(dropout_rate))

        # Combined MLP with dropout after activations
        self.combinedMLP = nn.Sequential(
            nn.Linear(512 * 7 * 7 + 64, 1024),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(128, num_classes))

    def forward(self, x, keypoints):
        # Process images through the ResNet18 model
        x = self.resnet(x)
        x = torch.flatten(x, 1)
        # Process keypoints through the yoloMLP
        keypoints = keypoints.view(keypoints.size(0), -1)
        yoloMLP_out = self.yoloMLP(keypoints)
        # Concatenate features
        concat_features = torch.cat((x, yoloMLP_out), dim=1)
        # Final classification through the combined MLP
        output = self.combinedMLP(concat_features)

        return output

###  **2.3.5 Discussion on Validation Result**


The validation accuracy and F1 scores for the different models are summarized as follows:

<img src="https://i.postimg.cc/T2b568Qh/val-acc.jpg" width="1000" height="500">

Figure 10. Summary of Validation Results


A notable observation from the confusion matrix is the significant reduction in false positive and false negative with the incorporation of pose information in the PoseNet+ResNet18 model. Compared to the standard ResNet model, which often misclassified individuals walking with a bicycle, the inclusion of keypoint information allows the enhanced model to accurately discern a person's posture, effectively differentiating between walking and riding.








<img src="https://i.postimg.cc/RVQ3WckB/cm.jpg"  width="1000" height="400">

Figure 11: Comparison between ResNet with and without PoseNet

## **2.4 Model Testing on the Best Model**

### **2.4.1 Model Testing**

Based on the results from training and validation, the ResNet18 model enhanced with pose information (ResNet18 + Pose) demonstrated the best performance. Therefore, we used the test dataset to further evaluate this model's performance on new data.

The get_accuracy function calculates the model's accuracy on the test dataset and provides a confusion matrix along with images that were incorrectly classified.

In [None]:
def get_accuracy(model, data_loader, PoseNetModel):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    correct, total = 0, 0
    misclassified_samples = []
    label_list, pred_list = [], []
    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            res = PoseNetModel(inputs, verbose=False)
            res_keypoints = torch.stack([i.keypoints[0].xy if i.keypoints[0].conf!=None else torch.zeros(1,17,2).to(device) for i in res ]).squeeze(1)

            ################
            mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1).to(device)
            std = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1).to(device)
            inputs = (inputs - mean) / std

            outputs = model(inputs, res_keypoints)  # Normalize inputs
            preds = torch.round(torch.sigmoid(outputs))

            correct += (preds.squeeze(1) == labels).sum().item()
            total += labels.size(0)

            label_list.extend(labels.cpu().numpy())
            pred_list.extend(preds.squeeze(1).detach().cpu().numpy())

            misclassified = (preds.squeeze(1) != labels)
            for i, mis in enumerate(misclassified):
                if mis:
                    misclassified_samples.append((inputs[i].cpu(), labels[i].cpu(), preds[i].cpu()))

        # Recalculate or retrieve predictions for the best_acc_epoch if necessary
        cm = confusion_matrix(label_list, pred_list)
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0, 1], yticklabels=[0, 1])
        plt.title(f'Confusion Matrix for Test Set')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        plt.show()

        show_misclassified_images(misclassified_samples)

    return correct / total

Load the test data:

In [None]:
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

Test the model with test dataset:

In [None]:
yoloPoseModel = YOLO('F:/UT_MIE/MIE1517/Project/yolov8n-pose.pt')
best_model = PoseNetCNN()
model_dir = 'models'
best_filename = "PoseNetCNN_best.pth"
best_path = os.path.join(model_dir, best_filename)
best_model.load_state_dict(torch.load(best_path))

# Test on the unseen test dataset
test_acc = get_accuracy(best_model, test_loader, yoloPoseModel)
print("The test accuracy of best model is: ", test_acc)

### **2.4.2 Integrated system**

We also developed a function that systematically processes video input, utilizing our pipeline to identify cyclists and classify their riding status. This function, along with our test results, serves as a tool for evaluating our model's performance.

In [None]:
"""
The function cyclist_detect_system serves as our integrated system that takes in a
raw video and performs cyclist biking status classification. The annotated frames
and videos will be stored in a folder.

Note that in the annotated videos, the bounding boxes are cyclist bounding boxes
and the green dots are the 17 keypoints on human stem. We highlight the riding
class with red bounding box and green bouding box for unride.


"""
from PIL import Image
import torchvision.transforms as transforms

def resize_and_pad(img_size):
    return transforms.Compose([
        transforms.Resize(img_size, interpolation=transforms.InterpolationMode.BILINEAR),
        transforms.CenterCrop(img_size),
        transforms.ToTensor(),
        # Optional: Apply normalization if needed
        # transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])

def pad_to_square(img, fill=0):
    max_size = max(img.size)
    padding = [0, 0, max_size - img.size[0], max_size - img.size[1]]  # Calculate padding
    return transforms.Pad(padding, fill=fill)(img)

def cyclist_detect_system(data_home_dir, img_out_dir, vid_in_dir, model, cls, PoseNetModel,PoseNetCNN, yolo_imgsz=640, out_fps=3,vid_out_dir=None, box_conf_tr=0, device="cpu"):

    ####################
    inference_transform = transforms.Compose([
            transforms.Lambda(lambda img: pad_to_square(img, fill=0)),
            resize_and_pad((224, 224)),
            # transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ])

    PoseNetModel = PoseNetModel.to(device)
    PoseNetCNN  = PoseNetCNN.to(device)
    #####################

    vid_in_name,vid_in_ext = os.path.splitext(os.path.basename(vid_in_dir))

    # load video and start recieving output
    cap = cv2.VideoCapture(vid_in_dir)
    print(f"processing vid: {vid_in_dir}")
    if not cap.isOpened():
        print(f"cap is not opened: status-{cap.isOpened()}")
        sys.exit()
    # get specs of video
    fps = cap.get(cv2.CAP_PROP_FPS)
    # print("here")
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    # extract part of the frames only to save computation and storage
    skip_frames = int(fps / out_fps)
    ## output to mp4
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    vid_out_dir = os.path.join(img_out_dir ,vid_in_name) # output video to same folder as frames
    os.makedirs(vid_out_dir, exist_ok=True)
    video_out = cv2.VideoWriter(os.path.join(vid_out_dir,f'res_{vid_in_name}{vid_in_ext}'), fourcc, out_fps, (frame_width, frame_height))

    #### process each iamge with yolo
    frame_id = 0

    while cap.isOpened():
        ret,frame = cap.read()
        # print("cap ret frame read")
        if ret:
            # skip frames
            if frame_id % skip_frames == 0:
                frame_rgb = frame # cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                # save curr frame
                # cv2.imwrite(f'{output_path}/{frame_id}.jpg', frame_rgb)

                # inference
                results = model([frame_rgb], imgsz=yolo_imgsz, verbose=False)
                #
                # out.write(results.numpy())
                # #
                for img_index, result in enumerate(results):
                    frame_folder_dir  = f'{vid_in_name}/res_img_{img_index}_frame_{frame_id}'
                    frame_output_filename = f'{cls}_res_yolo_vid_{vid_in_name}_img_{img_index}_fid_{frame_id}.jpg'
                    frame_folder_dir = os.path.join(img_out_dir, frame_folder_dir)
                    os.makedirs(frame_folder_dir, exist_ok=True)

                    # print(img_index)
                    # save yolo annoated result
                    result.save(filename=f"{frame_folder_dir}/{frame_output_filename}")
                    result_array = result.plot(labels=False, probs=False, boxes=False)
                    # retireve all the bbox predicted by yolo, here, only classes labeled as human or bike are gathered
                    # these resultes are stored in a dict which will then be further processed
                    person_bbox_list, bike_bbox_list = [],[]
                    bbox_cls_dict = result.names
                    boxes_labs = result.boxes.cls.tolist()
                    boxes_conf = result.boxes.conf.tolist()
                    for box_id, (box, box_label, box_conf) in enumerate(zip(result.boxes.xyxy, boxes_labs,boxes_conf)):
                        bbox_path=f"{frame_folder_dir}/box_id_{box_id}_{bbox_cls_dict[box_label]}.jpg"
                        x1,y1,x2,y2 = [int(_) for _ in box.tolist()]
                        x1, y1 = max(x1, 0), max(y1, 0)
                        x2, y2 = min(x2, frame_width), min(y2, frame_height)
                        # print(x1,y1,x2,y2)
                        # cropped_img = frame_rgb[y1:y2, x1:x2]
                        # cv2.imwrite(bbox_path, cropped_img)
                        # store the current bounding box
                        #################################
                        if box_conf>box_conf_tr:
                            pb_bbox_path=f"{frame_folder_dir}/{cls}_res_cropbp_vid_{vid_in_name}_img_{img_index}_fid_{frame_id}_bbox_{box_id}.jpg"
                            pb_bbox_img = frame_rgb[y1:y2, x1:x2]
                            cv2.imwrite(pb_bbox_path, pb_bbox_img)

                            #
                            pb_bbox_img = cv2.cvtColor(pb_bbox_img, cv2.COLOR_BGR2RGB)
                            #
                            device="cuda"
                            pb_bbox_img_tran = inference_transform(Image.fromarray(pb_bbox_img)).unsqueeze(0).to(device)
                            PoseNetCNN.eval()
                            with torch.no_grad():
                                res = PoseNetModel(pb_bbox_path, verbose=False)
                                if res[0].keypoints.conf!=None:
                                    res_keypoints_np = res[0].keypoints[0].xy[0]
                                    res_keypoints = torch.tensor(res[0].keypoints.xy[0].unsqueeze(0))   #torch.stack([i.keypoints.xy if i.keypoints.conf!=None else torch.zeros(1,17,2).to(device) for i in res ]).squeeze(1).to(device)

                                    keypoints_large_img = [(x + x1, y + y1) for (x, y) in res_keypoints_np if not x==y==0.0]

                                    for (x, y) in keypoints_large_img:
                                        cv2.circle(frame_rgb, (int(x), int(y)), radius=3, color=(0, 255, 0), thickness=-1)
                                        cv2.circle(result_array, (int(x), int(y)), radius=3, color=(0, 255, 0), thickness=-1)
                                else:
                                    res_keypoints = torch.zeros(1,17,2).to(device)
                                    # print("reach 0")

                                ################ normalize the input image
                                mean = torch.tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1).to(device)
                                std = torch.tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1).to(device)
                                pb_bbox_img_tran = (pb_bbox_img_tran - mean) / std
                                ################

                                curr_bbox_outputs = PoseNetCNN(pb_bbox_img_tran, res_keypoints)
                                preds = torch.round(torch.sigmoid(curr_bbox_outputs))

                                #
                                label={0:"ride",1:"unride"}[preds.item()]
                                color_ = {0:(0, 0, 255), 1:(0, 255, 0)}[preds.item()]
                                cv2.rectangle(result_array, (x1, y1), (x2, y2), color_, 2)
                                cv2.putText(result_array, label, (x1, max(y1 - 10, 0)), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color_, 2)

                        #################################



                    video_out.write(result_array) # f'{cls}_res_img_{img_index}_fid_{frame_id}.jpg'
                    cv2.imwrite(f"{frame_folder_dir}/{cls}_res_yolobp_vid_{vid_in_name}_img_{img_index}_fid_{frame_id}.jpg", result_array)

            frame_id += 1
        else:
            break

    # print("here end")
    cap.release()
    video_out.release()

In [None]:
# This code serves as our integrated system, this takes in all the raw videos in
# a folder and perform cyclist riding status classification. The results will be
# stored in a seperate file containing the results for each frame and the anntated
# video.

objDetectModel = YOLO('/home/yuliang/1517proj/ultralytics/checkpoints/yolov8n_e46_v0.pt')
PoseNetModel = YOLO('/home/yuliang/1517proj/ultralytics/checkpoints/yolov8n-pose.pt')
PoseNetCNN_ = PoseNetCNN3() #torch.load('/home/yuliang/1517_proj/ultralytics/checkpoint/PoseNetResnet_96') #PoseNetResnet_96
PoseNetCNN_.load_state_dict(torch.load('/home/yuliang/1517proj/ultralytics/checkpoints/PoseNetCNN3/PoseNetCNN3_best.pth'))
# PoseNetCNN = torch.load('/home/yuliang/1517_proj/ultralytics/checkpoint/PoseNetCNN/PoseNetCNN_best.pth')

#
cls_list=['test']
# cls=cls_list[0]
for cls in cls_list:
    task_name = "bike_data"
    data_home_dir="/home/yuliang/1517proj/ultralytics/datasets/test_vid"
    img_out_dir=f"/home/yuliang/1517proj/ultralytics/datasets/test_vid/v1/{task_name}_img/"
    vid_out_dir=f"/home/yuliang/1517proj/ultralytics/datasets/test_vid/v1/{task_name}_vid/"
    os.makedirs(img_out_dir, exist_ok=True)
    os.makedirs(vid_out_dir, exist_ok=True)

    raw_data_dir = f"/home/yuliang/1517proj/ultralytics/datasets/test_vid"
    for file in os.listdir(raw_data_dir):
        if file.endswith('.mp4'):
            vid_in_dir = os.path.join(raw_data_dir, file)
            cyclist_detect_system(data_home_dir, img_out_dir, vid_in_dir, device="cuda:0",
                               model=objDetectModel,
                               PoseNetModel=PoseNetModel,
                               PoseNetCNN=PoseNetCNN_,
                                 cls=cls, yolo_imgsz=640, out_fps=5,box_conf_tr=0.5 )

### **2.4.3 Testing Result & Discussion**

Before we dive into the final test results, it's important to note that our current validation accuracy and F1 score differ from what was shared during the presentation. In the presentation, we mentioned that the test accuracy we got at that time was about 65%. We did some troubleshooting and found the problem: the YOLO model has its own image transform methods, which conflicts with our image transform, leading to a failure to detect stem information. For images without keypoints information, we simply set the input parameter to 0. Besides, some images generate multiple stems because there are many people in one image or due to other problems. After presentation, We modified our image transform and the training loop to extract the stem information and fixed the bug.

After implementing the new image transformation method and adjusting the training loop to extract stem information from images, our test accuracy improved significantly. The PoseNet + ResNet18 model reached an 84% accuracy on the test set, which aligns with its validation accuracy. Surprisingly, the PoseNet + Simple CNN model achieved an 88.8% accuracy on the test set, not only surpassing the PoseNet model with ResNet18 but also its own validation accuracy. Furthermore, the fine-tuned PoseNet + Simple CNN model outperformed the PoseNet + ResNet18 in cyclist detection systems processing new video input.

[due to data distribution, picked new videos are hard to identify]
Besides, we observed that our model did not seems like achieved a 80% accuracy on new, unseen videos featuring a mix of riding and non-riding individuals. Below are example outputs (selected frames from the entire video) from our cyclist detection system. The first image shows perfect performance, but the second and third images exhibit some misclassifications.

<img src="https://i.postimg.cc/5ypVVNRZ/test-res-yolobp-vid-mixed-1-img-0-fid-0.jpg"  width="500" height="280">

Figure 12 (a): Good Performance

<img src="https://i.postimg.cc/wxkHJ0j0/test-res-yolobp-vid-mixed-2-img-0-fid-130.jpg"  width="500" height="280">

Figure 12 (b): Misclassification

<img src="https://i.postimg.cc/3JJhZmy7/test-res-yolobp-vid-mixed-4-img-0-fid-64.jpg"  width="500" height="280">

Figure 12 (c): Misclassification


We identified two main problems may cause the issues.

Firstly, we found our model exhibits variable performance across different test videos, facing particular challenges in distinguishing between 'unride' and 'ride' states of cyclists from specific viewpoints. As an example, in the second and third images shown above that contain misclassifications, we find that the locations of the keypoints (green dots) are very similar for ride and unride when viewed from the front or back. This means that in some angles of the cyclists, our keypoint data only provide limited valid information.

Interestingly, the model underperforms when interpreting side perspectives, which are depicted in the image below. This difficulty arises from the two-dimensional nature of the images, where it is challenging to differentiate between a cyclist stationary with one foot on the ground and a pedestrian walking alongside a bicycle. Furthermore, the dataset comprises images captured from various orientations. An imbalanced distribution of these orientations may inadvertently lead the model to favor the prediction of the most prevalent orientation rather than accurately classifying 'unride' or 'ride' states.




<img src="https://i.postimg.cc/Nfxw8tNx/sideview2.png"  width="650" height="350">

Figure 13: Misclassifcation From Side View

Secondly, our team recognizes the need to enhance the diversity of our dataset. Due to time constraints and hardware limitations, we initially sourced data exclusively from online videos. We take different frames from multiple videos and use a bounding box to crop images containing people and bicycles. This approach, while practical, resulted in a dataset with similar images from the same videos, potentially reducing the model's ability to generalize across varied data patterns. Moreover, to expedite training, we had to limit the dataset size, further constraining our model's learning potential and leading to results that may not fully reflect its capabilities.

Moving forward, we aim to improve both the quality and quantity of the data to bolster the model's training process. By expanding our dataset with more diverse sources and images, we can better train our model to recognize and differentiate between a wider range of scenarios, enhancing its reliability and performance.

# **3. Related Work & Discussion**

## **3.1 Summary of Related Work**

Our project has greatly benefited from insights provided by the paper titled 'On the safety of vulnerable road users by cyclist orientation detection using Deep Learning' by Garcia-Venegas et al. (2020).

The paper’s emphasis on the safeguarding of vulnerable road users resonated with our mission, deepening our understanding beyond the technical aspects to the project's societal impact.
Garcia-Venegas and colleagues evaluated various models for detecting cyclists on roads, including SSD, Faster R-CNN, and R-FCN, in conjunction with ResNet50 and ResNet101 architectures. While ResNet50 outperformed others in precision, its slower processing rate was notable. Alternatively, SSD paired with InceptionV2 struck a balance between precision and speed [3]. Speed of detection being crucial for real-time application, based on this paper's findings, our project has adopted YOLO V8 (You Only Look Once) as the foundational model to detect both stationary and moving cyclists.

## **3.2 Discussion**

### **3.2.1 Findings & Future Work**

<img src="https://i.postimg.cc/Dwy3BNH3/8-angels.png"  width="400" height="400">

Figure 14: Proposed Eight Orientations [3, Fig4]

To address the issues we identified earlier, we have come up with several ideas.

In future, a strategy for model refinement could involve adopting the approach from the paper 'On the safety of vulnerable road users by cyclist orientation detection using Deep Learning.' The systematic categorization of our 'unride/ride' dataset into distinct orientation-based groups, as suggested by Figure 14, may generate improvements in our model’s capability to distinguish 'unride' and 'ride' states from varied angles. Implementing this strategy could mitigate the challenges caused by orientation disparities while concurrently supplying our detection system with detailed orientation data. This enhancement has the potential to raise the model's accuracy and reliability. However, assimilating this additional information necessitates expanding the model's complexity, which will likely involve integrating more layers and parameters. Such expansion is critical to ensure that our model can process and learn from the richer, more complex dataset effectively.

Alongside these, we can also leverage information beyond the keypoints to gain additional insights into the rider's posture. For instance, calculating the angle and length between different parts of the stem or limbs could be beneficial. Such information, like the angle between the thigh and calf, may help us predict riding behavior more accurately.

In addition, we can also continue to look for high-quality training, validation, and test images to increase the diversity of the dataset. The use of online videos for testing, while practical, is not without its limitations, particularly in representing the diversity of real-world conditions. The severe weather that impeded our ability to shoot original footage meant that we could not capture the nuanced variables that occur in natural settings. In future endeavors, if we have the opportunity to test our model with data derived from actual street scenarios, the reliability of our test scores is poised to improve. Real street data would introduce a spectrum of environmental factors—such as varying weather conditions, lighting, and traffic scenarios—that are crucial for a comprehensive assessment of the model's performance and its applicability in real-world situations. On top of that, we can also appropriately apply game screenshots, modeling images, AI-generated images, etc. to enrich our data sources. These avenues can break through our data collection limitations to a certain extent.

### **3.2.2 Lessons Learned**

Throughout this practical project, our group has gained a deeper appreciation for the significance of dataset quality and the necessity of diligent monitoring during the training process. In many instances, suitable datasets are not readily available; our project was no exception. While there was an abundance of data for cyclists in motion (ride state), we faced a scarcity when it came to the stationary (unride state) images, prompting us to source data from online videos and alternative outlets.

The confusion matrix became a pivotal tool for us, revealing misclassified samples and offering insights into the training process, which allowed us to identify and address issues within our dataset. For instance, the initial iterations of our customized YOLO model would occasionally detect only a portion of a cyclist's body. These errors were brought to our attention during training, as depicted in the image below. Through refinement of the customized YOLO model, we were able to enhance the quality of our dataset. This improvement underscored a fundamental lesson: the quality of the dataset is paramount for optimal model performance.

<img src="https://i.postimg.cc/mg6JVg77/image-improve.png"  width="800" height="400">

Figure 15: Improved Yolo Image Quality

# **4. Final words**


Our group really enjoyed this project and learned a lot from it. First, we want to say a big thank you to Professor Sinisa Colic and our teaching assistants. They gave us a lot of good advice and showed us how to improve our model, like adding pose information and the best ways to train it. They also gave us flowcharts which made things clearer and helped us stay on the right path.

Getting the chance to work on a project from start to finish was a great experience. Even though we faced many challenges and difficulties, we learned a lot about applying what we know about deep learning in real situations. We're also really thankful for the deep learning community online. We found a lot of projects on GitHub that inspired us and helped us with our project.



<img src="https://i.postimg.cc/QM9PzZqw/4ebf8e03ad65ef21f429602203aed3f.jpg"  width="600" height="400">

Figure 16: Yuliang's Discussion with Professor on How to Add Pose Information.

# **5. Citation**

[1] C. Y. Tang, “cyclistdetectiondataset · GITLAB - dataset,” GitLab, https://gitlab03.wpi.edu/ctang5/cyclistdetectiondataset/-/tree/main (accessed Apr. 6, 2024).

[2] G. Jocher, A. Chaurasia, and J. Qiu, "Ultralytics YOLO," version 8.0.0, Jan. 10, 2023. [Online]. Available: https://github.com/ultralytics/yolov5. (Accessed: Apr. 6, 2024).

[3] M. García-Venegas, D. A. Mercado-Ravell, L. A. Pinedo-Sánchez, and C. A. Carballo-Monsivais, “On the safety of vulnerable road users by cyclist detection and tracking,” Machine Vision and Applications, vol. 32, no. 5, Aug. 2021. doi:10.1007/s00138-021-01231-4 (Accessed: Apr. 6, 2024).

In [None]:
%%shell
jupyter nbconvert --to html /content/MIE1517_Final_Report.ipynb

[NbConvertApp] Converting notebook /content/MIE1517_Final_Report.ipynb to html
[NbConvertApp] Writing 928197 bytes to /content/MIE1517_Final_Report.html


