# Notebook Explanation and Important Link
This notebook follows the same data format as Aditya's, so you should be able to use the data by simply changing the directory or link, if necessary. <br>

There are 3 main changes that have been made:
1.   Instead of using B's videos (which was highly optimized for CNN), Wiame's videos are used. It maintains the original resolution but has standardized frames (113 frames). If the duration still seems too long, you can simply select 1 frame every 2 or 3 frames to reduce it.
2.   Face Landmark Removal. Previously, in addition to Pose (33 landmarks) and Hands (21x2 landmarks), the Face model with 468 landmarks was included. However, due to the imbalance in the number of landmarks and the limited contribution of facial data to sign language recognition, it was removed.
3.   When the pose or hand is not detected, instead of using a zero array as a padding, the previous detected coordinates are used to maintain continuity.

Data Format<br>
The output numpy array has the shape (113, 75, 3):<br>
113 = Frame count <br>
75 = Key points (0-32 pose, 33-53 left hand, 54-74 right hand). You can select specific hand indices if needed.<br>
3 = Coordinates (x,y,z).

Link:
1.   [All data for and from this notebook, drive](https://drive.google.com/drive/folders/1rTRZxMkvAyf805AuPoVvrfw8KnB3Ttod?usp=share_link)
2.   [Aditya original notebook, slack post](https://omdenaindones-9mu9399.slack.com/archives/C07MH4C0YLF/p1732443924936359)
3.   [Wiame processed videos, slack post](https://omdenaindones-9mu9399.slack.com/archives/C07N05MQNCC/p1732105984337299)




# Future Improvement

1.   **Landmark-Level Augmentation.** Similar to video augmentation, but applied only to the coordinates. This includes mirroring, rotation, and adding noise.
2.   **Model Result Comparison(Zero vs. Non-Zero).** A reference for future extraction, comparing results when using zero-filled coordinates versus using previously detected coordinates.
3.   **Specific Hand and Pose Detection (vs. Holistic).** Focusing on specific hands and poses rather than holistic detection could allow for more flexible parameters, improving extraction performance and reducing landmark extraction duration.
4.   **Confidence Parameter Adjustment.** Instead of using the default confidence threshold of 5, adjust it depending on the hand detection frequency. Lower it if hands are often not detected, or increase it for more precise results.
5. **GPU Version?**




# Install and Import Dependencies

In [None]:
!pip install -q mediapipe

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.1/36.1 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import cv2
import numpy as np
import mediapipe as mp

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Extract and Save Keypoints

## Non-Zero Extraction

When the pose or hand is not detected, instead of using [0, 0, 0] to fill the coordinates, the previously detected coordinates are used. This way, the continuity of movement is preserved.

In [7]:
# Initialize Mediapipe Holistic
mp_holistic = mp.solutions.holistic
holistic = mp_holistic.Holistic(static_image_mode=False,
                                min_detection_confidence=0.3,
                                min_tracking_confidence=0.3)

def extract_keypoints(video_path):

    left_hand_keypoints = np.zeros((21, 3))
    right_hand_keypoints = np.zeros((21, 3))
    pose_keypoints = np.zeros((33, 3))

    cap = cv2.VideoCapture(video_path)
    keypoints_sequence = []

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = holistic.process(frame_rgb)

        # If detected update the keypoints
        # Extract pose landmarks
        if results.pose_landmarks:
            pose_keypoints = np.array([[lm.x, lm.y, lm.z] for lm in results.pose_landmarks.landmark])

        # Extract left hand landmarks
        if results.left_hand_landmarks:
            left_hand_keypoints = np.array([[lm.x, lm.y, lm.z] for lm in results.left_hand_landmarks.landmark])
        # Extract left hand landmarks
        if results.right_hand_landmarks:
            right_hand_keypoints = np.array([[lm.x, lm.y, lm.z] for lm in results.right_hand_landmarks.landmark])

        # Concatenate all keypoints into a single vector
        keypoints = np.concatenate([pose_keypoints, left_hand_keypoints, right_hand_keypoints])
        keypoints_sequence.append(keypoints)

    cap.release()

    return keypoints_sequence # Shape: (num_frames, total_keypoints, 3)

In [8]:
DATA_DIR = '/content/drive/MyDrive/Omdena/sign_language_recognition/enhanced_videos_v2'
SAVE_DIR = '/content/drive/MyDrive/Omdena/sign_language_recognition/landmark_non_zero'

os.makedirs(SAVE_DIR, exist_ok=True)

for word in os.listdir(DATA_DIR):
    word_dir = os.path.join(DATA_DIR, word)
    save_word_dir = os.path.join(SAVE_DIR, word)

    os.makedirs(save_word_dir, exist_ok=True)
    print("Processing" , word, "folder")
    for video_file in os.listdir(word_dir):
        save_path = os.path.join(save_word_dir, video_file.replace('.mp4', '.npy'))

        #Skip if the keypoints file already exists
        if os.path.exists(save_path) and os.path.exists(save_path_zero):
            continue

        video_path = os.path.join(word_dir, video_file)
        keypoints = extract_keypoints(video_path)
        np.save(save_path, keypoints)  # Save as .npy

Processing maaf folder


KeyboardInterrupt: 

# Push to Dagshub



In [None]:
# Install the DagsHub python client
!pip install -q dagshub

from dagshub.notebook import save_notebook

save_notebook(repo="Omdena/JakartaIndonesia_SignLanguageTranslation", path="preprocessing", branch="kenji")

# Landmark Level Augmentation
