# Mini-project 1:
Write a program that recognizes several different body positions. (Assume a person is in the image)

To simplify the problem, you can focus on two positions: standing and sitting. If you are interested, you can add positions.

Depending on your taste, you can use a classifier, but it is not required.

Please work on the program on the video and write your prediction on each frame.

Test the output on a small video and send the code along with the input video.

Similar work to this article:

https://ieeexplore.ieee.org/abstract/document/9116911/

# Mini-project 2:
Write a program that compares all faces on a webcam or video with a reference face and marks the faces that are different from the reference face with a red rectangle and the person who is the same as the reference image with a green rectangle.

# Mini-project 3:
a) Write a program that first finds a desired object with an object detection model and then tracks it with an object tracking method that we learned in the previous module.

b) Search the Internet to find a method that covers the shortcomings of the method you wrote in the first section. What is the difference?

(Hint: What happens if the object goes off the page in your method? What is the solution?)

In [8]:
pip install opencv-python mediapipe numpy

Note: you may need to restart the kernel to use updated packages.


In [None]:
import cv2
import mediapipe as mp
import numpy as np
from collections import deque

# Initialize MediaPipe
mp_pose = mp.solutions.pose
mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils

class BodyPositionRecognizer:
    def __init__(self):
        self.pose = mp_pose.Pose(static_image_mode=False, min_detection_confidence=0.7)
        self.hands = mp_hands.Hands(static_image_mode=False, max_num_hands=2, min_detection_confidence=0.5)
        self.next_person_id = 0
        self.people_data = {}
        
        # Constants
        self.HANDSHAKE_DIST_THRESH = 0.15
        self.GREETING_DIST_THRESH = 0.25
        self.OBJECT_HOLD_DIST_THRESH = 0.1
        self.HISTORY_LENGTH = 10
    
    def calculate_body_metrics(self, landmarks):
        """Calculate key body metrics for position detection"""
        left_shoulder = landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value]
        right_shoulder = landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value]
        left_hip = landmarks[mp_pose.PoseLandmark.LEFT_HIP.value]
        right_hip = landmarks[mp_pose.PoseLandmark.RIGHT_HIP.value]
        left_knee = landmarks[mp_pose.PoseLandmark.LEFT_KNEE.value]
        right_knee = landmarks[mp_pose.PoseLandmark.RIGHT_KNEE.value]
        left_ankle = landmarks[mp_pose.PoseLandmark.LEFT_ANKLE.value]
        left_elbow = landmarks[mp_pose.PoseLandmark.LEFT_ELBOW.value]
        left_wrist = landmarks[mp_pose.PoseLandmark.LEFT_WRIST.value]
        right_wrist = landmarks[mp_pose.PoseLandmark.RIGHT_WRIST.value]
        nose = landmarks[mp_pose.PoseLandmark.NOSE.value]
        
        # Calculate key metrics
        shoulder_hip_ratio = ((left_shoulder.y + right_shoulder.y) / 
                            (left_hip.y + right_hip.y))
        hip_knee_ratio = ((left_hip.y + right_hip.y) / 
                         (left_knee.y + right_knee.y))
        torso_angle = np.degrees(np.arctan2(left_hip.y - left_shoulder.y, 
                                          left_hip.x - left_shoulder.x))
        arm_angle = np.degrees(np.arctan2(left_elbow.y - left_shoulder.y,
                                        left_elbow.x - left_shoulder.x))
        
        # Hand to face distance for greeting detection
        hand_to_face_dist = min(
            np.sqrt((left_wrist.x - nose.x)**2 + (left_wrist.y - nose.y)**2),
            np.sqrt((right_wrist.x - nose.x)**2 + (right_wrist.y - nose.y)**2)
        )
        
        # Wrist to elbow distance for object holding
        left_wrist_elbow_dist = np.sqrt((left_wrist.x - left_elbow.x)**2 + 
                                      (left_wrist.y - left_elbow.y)**2)
        right_wrist_elbow_dist = np.sqrt((right_wrist.x - left_elbow.x)**2 + 
                                       (right_wrist.y - left_elbow.y)**2)
        
        return {
            'shoulder_hip_ratio': shoulder_hip_ratio,
            'hip_knee_ratio': hip_knee_ratio,
            'torso_angle': torso_angle,
            'arm_angle': arm_angle,
            'ankle_y': left_ankle.y,
            'hand_to_face_dist': hand_to_face_dist,
            'wrist_elbow_dist': (left_wrist_elbow_dist + right_wrist_elbow_dist)/2,
            'wrist_y': min(left_wrist.y, right_wrist.y)
        }
    
    def detect_interactions(self, person1, person2):
        """Detect interactions between two people"""
        if not person1['hand_landmarks'] or not person2['hand_landmarks']:
            return None
        
        p1_wrist = person1['hand_landmarks'][0].landmark[mp_hands.HandLandmark.WRIST]
        p2_wrist = person2['hand_landmarks'][0].landmark[mp_hands.HandLandmark.WRIST]
        
        distance = np.sqrt((p1_wrist.x - p2_wrist.x)**2 + 
                          (p1_wrist.y - p2_wrist.y)**2)
        
        if distance < self.HANDSHAKE_DIST_THRESH:
            return "Handshake"
        elif distance < self.GREETING_DIST_THRESH:
            return "Greeting"
        return None
    
    def detect_hand_gestures(self, metrics, hand_landmarks):
        """Detect hand gestures and positions"""
        gestures = []
        
        # Check for hand raising
        if metrics['wrist_y'] < 0.3:  # Normalized y-coordinate threshold
            gestures.append("Hands Raised")
            
        # Check for greeting (hand near face)
        if metrics['hand_to_face_dist'] < 0.2:
            gestures.append("Greeting")
            
        # Check for object holding (closed fist)
        if hand_landmarks:
            for hand in hand_landmarks:
                # Calculate distance between wrist and middle finger tip
                wrist = hand.landmark[mp_hands.HandLandmark.WRIST]
                middle_tip = hand.landmark[mp_hands.HandLandmark.MIDDLE_FINGER_TIP]
                dist = np.sqrt((wrist.x - middle_tip.x)**2 + (wrist.y - middle_tip.y)**2)
                
                if dist < self.OBJECT_HOLD_DIST_THRESH:
                    gestures.append("Holding Object")
                    break
                    
        return gestures
    
    def detect_body_position(self, landmarks, person_id, hand_landmarks=None):
        """Detect body position with multiple states"""
        metrics = self.calculate_body_metrics(landmarks)
        gestures = self.detect_hand_gestures(metrics, hand_landmarks)
        
        if person_id not in self.people_data:
            self.people_data[person_id] = {
                'history': deque(maxlen=self.HISTORY_LENGTH),
                'hand_landmarks': None,
                'position_history': deque(maxlen=5)
            }
        
        current_pos = (landmarks[mp_pose.PoseLandmark.LEFT_HIP.value].x,
                      landmarks[mp_pose.PoseLandmark.LEFT_HIP.value].y)
        self.people_data[person_id]['history'].append(current_pos)
        self.people_data[person_id]['hand_landmarks'] = hand_landmarks
        
        # Improved sitting detection using multiple metrics
        sitting_condition = (
            metrics['hip_knee_ratio'] < 1.2 and
            metrics['ankle_y'] > 0.85 and
            abs(metrics['torso_angle']) < 30
        )
        
        if sitting_condition:
            position = "Sitting"
        elif abs(metrics['torso_angle']) > 45:
            position = "Bending"
        elif metrics['shoulder_hip_ratio'] > 1.25:
            if len(self.people_data[person_id]['history']) == self.HISTORY_LENGTH:
                movement = sum(
                    np.sqrt((self.people_data[person_id]['history'][i][0] - 
                            self.people_data[person_id]['history'][i-1][0])**2 + 
                           (self.people_data[person_id]['history'][i][1] - 
                            self.people_data[person_id]['history'][i-1][1])**2)
                    for i in range(1, self.HISTORY_LENGTH))
                if movement > 0.08:
                    position = "Walking"
                else:
                    position = "Standing"
            else:
                position = "Standing"
        elif metrics['shoulder_hip_ratio'] < 0.9:
            position = "Lying Down"
        else:
            position = "Unknown"
        
        # Combine position with gestures
        if gestures:
            position += " + " + " + ".join(gestures)
        
        # Store position history for smoothing
        self.people_data[person_id]['position_history'].append(position)
        
        # Get most frequent position in history
        if len(self.people_data[person_id]['position_history']) > 0:
            position = max(set(self.people_data[person_id]['position_history']), 
                          key=self.people_data[person_id]['position_history'].count)
        
        return position

    def process_frame(self, frame):
        """Process a single frame and return annotated frame"""
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        pose_results = self.pose.process(frame_rgb)
        hand_results = self.hands.process(frame_rgb)
        
        current_people = {}
        
        if pose_results.pose_landmarks:
            person_id = self.next_person_id
            self.next_person_id += 1
            
            # Associate hands with person (simplified approach)
            person_hand_landmarks = []
            if hand_results.multi_hand_landmarks:
                for hand in hand_results.multi_hand_landmarks:
                    hand_x = hand.landmark[mp_hands.HandLandmark.WRIST].x
                    hand_y = hand.landmark[mp_hands.HandLandmark.WRIST].y
                    body_x = pose_results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].x
                    body_y = pose_results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_SHOULDER].y
                    
                    if abs(hand_x - body_x) < 0.3 and abs(hand_y - body_y) < 0.3:
                        person_hand_landmarks.append(hand)
            
            position = self.detect_body_position(
                pose_results.pose_landmarks.landmark, 
                person_id,
                person_hand_landmarks)
            
            current_people[person_id] = {
                'position': position,
                'landmarks': pose_results.pose_landmarks,
                'hand_landmarks': person_hand_landmarks
            }
            
            # Draw pose landmarks
            mp_drawing.draw_landmarks(
                frame, pose_results.pose_landmarks, mp_pose.POSE_CONNECTIONS)
            
            # Draw hand landmarks
            if person_hand_landmarks:
                for hand in person_hand_landmarks:
                    mp_drawing.draw_landmarks(
                        frame, hand, mp_hands.HAND_CONNECTIONS)
            
            # Choose color based on position
            color = (0, 255, 0)  # Default green
            if "Sitting" in position:
                color = (0, 0, 255)  # Red
            elif "Walking" in position:
                color = (255, 255, 0)  # Cyan
            elif "Bending" in position:
                color = (0, 255, 255)  # Yellow
            elif "Lying Down" in position:
                color = (255, 0, 255)  # Purple
            elif "Hands Raised" in position:
                color = (255, 165, 0)  # Orange
            elif "Holding Object" in position:
                color = (0, 165, 255)  # Blue
            elif "Greeting" in position:
                color = (255, 192, 203)  # Pink
            
            # Display position
            cv2.putText(frame, f"Person {person_id}: {position}", 
                       (10, 30 + person_id * 30),
                       cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 2)
        
        # Detect interactions between people
        if len(current_people) >= 2:
            person_ids = list(current_people.keys())
            interaction = self.detect_interactions(
                current_people[person_ids[0]],
                current_people[person_ids[1]])
            
            if interaction:
                cv2.putText(frame, f"Interaction: {interaction}", 
                           (frame.shape[1]//2 - 100, 30),
                           cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2)
        
        return frame

def process_video(input_video, output_video):
    recognizer = BodyPositionRecognizer()
    
    cap = cv2.VideoCapture(input_video)
    if not cap.isOpened():
        print(f"Error: Could not open video {input_video}")
        return
    
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS)
    
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_video, fourcc, fps, (width, height))
    
    while cap.isOpened():
        success, frame = cap.read()
        if not success:
            break
            
        processed_frame = recognizer.process_frame(frame)
        out.write(processed_frame)
        
    cap.release()
    out.release()
    cv2.destroyAllWindows()

# Example usage
input_vid = r'D:\exam\vtest.avi'
output_vid = "output_video_enhanced.mp4"
process_video(input_vid, output_vid)
print(f"Processing complete. Output saved to {output_vid}")

In [12]:
input_vid = r'D:\exam\vtest.avi'
output_vid = "output_video_enhanced.mp4"
process_video(input_vid, output_vid)

# Mini-project 2:
Write a program that compares all faces on a webcam or video with a reference face and marks the faces that are different from the reference face with a red rectangle and the person who is the same as the reference image with a green rectangle.

In [15]:
import cv2
import numpy as np

# تنظیمات اولیه
REFERENCE_IMAGE_PATH = "reference.jpg"  # مسیر تصویر مرجع
MODEL_PATH = "C:/Users/Matin/face_detection_yunet_2023mar.onnx"  # مسیر مدل تشخیص چهره
RECOGNIZER_PATH = "C:/Users/Matin/face_recognition_sface_2021dec.onnx"  # مسیر مدل تشخیص هویت

# آستانه‌های شباهت
L2_THRESHOLD = 1.128
COSINE_THRESHOLD = 0.363

# بارگذاری مدل‌ها
detector = cv2.FaceDetectorYN.create(MODEL_PATH, "C:/Users/Matin/face_detection_yunet_2023mar.onnx", (320, 320), 0.9, 0.3, 5000)
recognizer = cv2.FaceRecognizerSF.create(RECOGNIZER_PATH, "C:/Users/Matin/face_recognition_sface_2021dec.onnx")

# بارگذاری تصویر مرجع و استخراج ویژگی‌ها
ref_image = cv2.imread(r'D:\exam\refrence_image.jpg')
if ref_image is None:
    raise ValueError("تصویر مرجع یافت نشد")

detector.setInputSize((ref_image.shape[1], ref_image.shape[0]))
ref_faces = detector.detect(ref_image)

if ref_faces[1] is None:
    raise ValueError("هیچ چهره‌ای در تصویر مرجع یافت نشد")

ref_face_align = recognizer.alignCrop(ref_image, ref_faces[1][0])
ref_feature = recognizer.feature(ref_face_align)

# راه‌اندازی وبکم
cap = cv2.VideoCapture(0)
if not cap.isOpened():
    raise RuntimeError("وبکم راه‌اندازی نشد")

while True:
    ret, frame = cap.read()
    if not ret:
        break
    
    # تشخیص چهره‌ها در فریم
    height, width = frame.shape[:2]
    detector.setInputSize((width, height))
    faces = detector.detect(frame)
    
    if faces[1] is not None:
        for face in faces[1]:
            # ترازبندی و استخراج ویژگی چهره
            face_align = recognizer.alignCrop(frame, face)
            face_feature = recognizer.feature(face_align)
            
            # محاسبه شباهت با تصویر مرجع
            l2_score = recognizer.match(ref_feature, face_feature, cv2.FaceRecognizerSF_FR_NORM_L2)
            cosine_score = recognizer.match(ref_feature, face_feature, cv2.FaceRecognizerSF_FR_COSINE)
            
            # تعیین رنگ مستطیل بر اساس شباهت
            if l2_score <= L2_THRESHOLD and cosine_score >= COSINE_THRESHOLD:
                color = (0, 255, 0)  # سبز برای چهره مشابه
                label = "Match"
            else:
                color = (0, 0, 255)  # قرمز برای چهره متفاوت
                label = "Unknown"
            
            # رسم مستطیل و برچسب
            coords = face[:-1].astype(np.int32)
            cv2.rectangle(frame, (coords[0], coords[1]), 
                         (coords[0]+coords[2], coords[1]+coords[3]), color, 2)
            cv2.putText(frame, label, (coords[0], coords[1]-10), 
                        cv2.FONT_HERSHEY_SIMPLEX, 0.9, color, 2)
    
    # نمایش فریم
    cv2.imshow('Face Recognition', frame)
    
    # خروج با کلید ESC
    if cv2.waitKey(1) & 0xFF == 27:
        break

# آزادسازی منابع
cap.release()
cv2.destroyAllWindows()

# Mini-project 3:
a) Write a program that first finds a desired object with an object detection model and then tracks it with an object tracking method that we learned in the previous module.

b) Search the Internet to find a method that covers the shortcomings of the method you wrote in the first section. What is the difference?

(Hint: What happens if the object goes off the page in your method? What is the solution?)

# Part a: Object Detection and Tracking


In [1]:
import cv2
import numpy as np

# Initialize object detection model (YOLOv3)
net = cv2.dnn.readNet("yolov3.weights", "yolov3.cfg")
classes = []
with open("coco.names", "r") as f:
    classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i-1] for i in net.getUnconnectedOutLayers()]

# Initialize tracker (using CSRT)
tracker = cv2.legacy.TrackerCSRT.create()

# Video capture
cap = cv2.VideoCapture("D:/exam/race_car.mp4")

# Detection phase
ret, frame = cap.read()
height, width = frame.shape[:2]

# Detect objects using YOLO
blob = cv2.dnn.blobFromImage(frame, 0.00392, (416, 416), (0, 0, 0), True, crop=False)
net.setInput(blob)
outs = net.forward(output_layers)

# Process detections
class_ids = []
confidences = []
boxes = []
for out in outs:
    for detection in out:
        scores = detection[5:]
        class_id = np.argmax(scores)
        confidence = scores[class_id]
        if confidence > 0.5 and class_id == 0:  # class_id 0 is for person
            center_x = int(detection[0] * width)
            center_y = int(detection[1] * height)
            w = int(detection[2] * width)
            h = int(detection[3] * height)
            x = int(center_x - w / 2)
            y = int(center_y - h / 2)
            boxes.append([x, y, w, h])
            confidences.append(float(confidence))
            class_ids.append(class_id)

# Apply non-max suppression
indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)

# Select first detected object to track
if len(indices) > 0:
    bbox = boxes[indices[0]]
    tracker.init(frame, tuple(bbox))
else:
    print("No objects detected!")
    exit()

# Tracking phase
while True:
    ret, frame = cap.read()
    if not ret:
        break
    
    # Update tracker
    success, bbox = tracker.update(frame)
    
    # Draw bounding box if tracking was successful
    if success:
        x, y, w, h = [int(v) for v in bbox]
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
    else:
        cv2.putText(frame, "Tracking failure", (100, 80), 
                   cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 0, 255), 2)
    
    cv2.imshow("Tracking", frame)
    
    # Exit if ESC pressed
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

No objects detected!


# Part b: Improved Method with Re-detection Capabilit

In [3]:
import cv2
import numpy as np

# Initialize YOLO
net = cv2.dnn.readNet("yolov3.weights", "yolov3.cfg")
classes = []
with open("coco.names", "r") as f:
    classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i-1] for i in net.getUnconnectedOutLayers()]

# Initialize tracker (using CSRT)
tracker = cv2.legacy.TrackerCSRT.create()
tracking = False
frames_since_last_detection = 0
DETECTION_INTERVAL = 30  # Re-detect every 30 frames

cap = cv2.VideoCapture("D:/exam/race_car.mp4")

while True:
    ret, frame = cap.read()
    if not ret:
        break
    
    if not tracking or frames_since_last_detection >= DETECTION_INTERVAL:
        # Detection phase
        height, width = frame.shape[:2]
        blob = cv2.dnn.blobFromImage(frame, 0.00392, (416, 416), (0, 0, 0), True, crop=False)
        net.setInput(blob)
        outs = net.forward(output_layers)
        
        # Process detections
        boxes = []
        confidences = []
        for out in outs:
            for detection in out:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                if confidence > 0.5 and class_id == 0:  # Person class
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)
                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
        
        # Apply non-max suppression
        indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        
        if len(indices) > 0:
            bbox = boxes[indices[0]]
            tracker = cv2.legacy.TrackerCSRT.create()  # Reinitialize tracker
            tracker.init(frame, tuple(bbox))
            tracking = True
            frames_since_last_detection = 0
        else:
            tracking = False
    else:
        # Tracking phase
        success, bbox = tracker.update(frame)
        frames_since_last_detection += 1
        
        if success:
            x, y, w, h = [int(v) for v in bbox]
            cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
        else:
            tracking = False
    
    cv2.imshow("Improved Tracking", frame)
    
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

# مقایسه روش‌های ردیابی شیء و بهبودها

## روش اولیه (ردیابی ساده)
```python
# کد پیاده‌سازی
tracker = cv2.legacy.TrackerCSRT.create()
success, bbox = tracker.update(frame)

# Differences and Improvements

## Problem with Initial Method:
- If the object leaves the frame, tracking fails permanently.
- No recovery mechanism when tracking fails.
- Drift can accumulate over time.

## Improved Method Features:
- **Re-detection capability**: Performs object detection periodically (every 30 frames).
- **Tracking recovery**: Can re-acquire the object if it returns to the frame.
- **Adaptive switching**: Automatically switches between detection and tracking modes.
- **Tracker reinitialization**: Creates a fresh tracker after each detection.

## Key Differences:

| Feature               | Initial Method | Improved Method |
|-----------------------|----------------|-----------------|
| Handles occlusions    | ❌ No          | ✅ Yes          |
| Recovers from loss    | ❌ No          | ✅ Yes          |
| Periodic re-detection | ❌ No          | ✅ Yes          |
| Computational cost    | Lower          | Higher          |