
# Saudi Sign Gesture Recognition System

## Introduction

This project is a **Saudi Sign Gesture Recognition System** that detects hand gestures and converts them into corresponding speech using the **Google Cloud Text-to-Speech API**. The system recognizes both **sign language gestures** (using one hand) and **numbers** (using two hands). The detected gestures or numbers are then converted into audio, and the system plays the spoken output in English/Arabic.

The core of the system relies on **MediaPipe** for hand landmark detection, **pre-trained machine learning models** for recognizing gestures and numbers, and Google’s **Text-to-Speech (TTS)** service for converting the recognized gestures/numbers into spoken words. Both the sign and number models are trained using **multilayer perceptrons** on a large dataset and have undergone **hyperparameter tuning** to optimize performance.

## How the System Works

1. **Hand Detection and Tracking**:
   - The system captures real-time video from a webcam using **OpenCV**.
   - It uses **MediaPipe** to detect hand landmarks and track hand movements. It works with either one or two hands at a time.

2. **Gesture and Number Recognition**:
   - For **sign language gestures**: If only one hand is detected, the system predicts the hand gesture using a pre-trained model (`finalized_model_hyp_onlysigns.sav`), which classifies the gesture based on hand landmarks.Some preprocessing are carried out before passing the data to the model.
   - For **numbers**: If two hands are detected, the system uses another pre-trained model (`numbers_model_iter2.sav`) to predict the number represented by the hand signs.

3. **Text-to-Speech Conversion**:
   - Once a gesture or number is recognized and held for 5 seconds, the corresponding text is passed to the Google Cloud TTS API, which synthesizes speech in English/Arabic.
   - The synthesized audio is saved as an MP3 file and played back through the speakers.

This system is a powerful tool for enhancing communication through gesture recognition and speech synthesis, making it particularly useful for educational and assistive technologies.


In [1]:
#Reference: https://github.com/kinivi/hand-gesture-recognition-mediapipe

import cv2
import mediapipe as mp
import copy
import itertools
import csv
import playsound
import pickle
import os
import numpy as np
from google.cloud import texttospeech
from pydub import AudioSegment
from pydub.playback import play
import io
import threading
import time
from playsound import playsound

class HandGestureRecognition:
    def __init__(self):
        # Initialize mediapipe
        self.mp_drawing = mp.solutions.drawing_utils
        self.mp_hands = mp.solutions.hands

        # Load pre-trained models for gesture recognition
        self.model_signs = pickle.load(open('model/finalized_model_hyp_onlysigns.sav', 'rb'))
        self.model_numbers = pickle.load(open('model/numbers_model_iter2.sav', 'rb'))

        # Text-to-speech client initialization
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'text-to-speech-435015-82e32527fb00.json'
        self.client = texttospeech.TextToSpeechClient()

        # Text-to-speech parameters
        self.voice = texttospeech.VoiceSelectionParams(
            language_code="ar-XA", ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL)
        self.audio_config = texttospeech.AudioConfig(
            audio_encoding=texttospeech.AudioEncoding.MP3)
        
        
#         # Dictionary for number-to-english word mapping - need to work on accessing this
#         self.sign_to_word = {
#                     "Alif": 'أ', "Ba": 'ب', "Ta": 'ت', "Tha": 'ث', "Jim": 'ج', "Ha": 'ح', "Kha": 'خ',
#                     "Dal": 'د', "Zay": 'ز', "Sien": 'س', "Shien": 'ش', "Sad": 'ص', "Dhad": 'ض',
#                     "Tah": 'ط', "Thah": 'ظ', "Ayn": 'ع', "Ghayn": 'غ', "Fa": 'ف', "Qaf": 'ق',
#                     "Kaf": 'ك', "Lam": 'ل', "Miem": 'م', "Noon": 'ن', "He": 'هـ', "Waw": 'و',
#                     "Taa": 'ت', "La": 'لا'}

        # Dictionary for number-to-english word mapping
        self.num_to_word = {
            1: 'one', 2: 'two', 3: 'three', 4: 'four',
            5: 'five', 6: 'six', 7: 'seven', 8: 'eight',
            9: 'nine', 10: 'ten'
        }
        
        #Dictionary for number-to-arabic word mapping
#         self.num_to_word = { "1": '١', "2": '٢', "3": '٣', "4": '٤', "5": '٥', 
#                             "6": '٦', "7": '٧', "8": '٨', "9": '٩', "10": '١٠'}

        # Global states for gesture tracking and speech synthesis
        self.prev_result = None
        self.response = None
        self.lock = threading.Lock()
        self.prev_time = time.time()
        self.sign_count = 0
        self.same_gesture_duration = 0

    def calc_landmark_list(self, image, landmarks):
        """Calculate the hand landmarks in pixel coordinates."""
        image_width, image_height = image.shape[1], image.shape[0]
        landmark_point = []
        for _, landmark in enumerate(landmarks.landmark):
            landmark_x = min(int(landmark.x * image_width), image_width - 1)
            landmark_y = min(int(landmark.y * image_height), image_height - 1)
            landmark_point.append([landmark_x, landmark_y])
        return landmark_point

    def pre_process_landmark(self, landmark_list):
        """Normalize the hand landmarks and flatten into a single list."""
        base_x, base_y = landmark_list[0][0], landmark_list[0][1]
        for index, point in enumerate(landmark_list):
            landmark_list[index][0] -= base_x
            landmark_list[index][1] -= base_y
        landmark_list = [coord for point in landmark_list for coord in point]
        max_value = max(list(map(abs, landmark_list)))
        return [n / max_value for n in landmark_list]

    def text_to_speech(self, text):
        """Convert text to speech, save audio incrementally, and play it."""
        # Convert text (numpy.int64) to string if it's a number
        if isinstance(text, np.int64):
            text = str(self.num_to_word.get(int(text), text))

        synthesis_input = texttospeech.SynthesisInput(text=text)

        # Retry mechanism for API call
        retries = 3
        while retries > 0:
            try:
                with self.lock:
                    self.response = self.client.synthesize_speech(
                        input=synthesis_input,
                        voice=self.voice,
                        audio_config=self.audio_config
                    )
                break
            except Exception as e:
                print(f"Text-to-speech error: {e}, retrying...")
                retries -= 1
                if retries == 0:
                    print("Failed after 3 retries.")
                    return

        # Increment the sign count to keep track of responses
        self.sign_count += 1

        # Save the synthesized speech to a file incrementally
        audio_file = f"response_{self.sign_count}.mp3"
        try:
            # Writing the synthesized speech content to a file
            with open(audio_file, 'wb') as out:
                out.write(self.response.audio_content)

            print(f"Audio content saved to {audio_file}")

            # Ensure a small delay to avoid potential file access issues
            time.sleep(0.5)

            # Play the saved audio file using playsound
            playsound(audio_file)
            print(f"Played {audio_file}")
        except Exception as e:
            print(f"Failed to save or play {audio_file}: {e}")

    def detect_hands(self):
        """Main method to detect hand gestures and trigger actions based on recognized gestures."""
        cap = cv2.VideoCapture(0)
        with self.mp_hands.Hands(min_detection_confidence=0.7, min_tracking_confidence=0.7, max_num_hands=2) as hands:
            while cap.isOpened():
                success, image = cap.read()
                if not success:
                    print("Ignoring empty camera frame.")
                    continue

                # Flip the image for a selfie view and convert color to RGB
                image = cv2.flip(image, 1)
                image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

                # Process the image for hand landmarks
                results = hands.process(image_rgb)
                debug_image = copy.deepcopy(image)

                if results.multi_hand_landmarks:
                    if len(results.multi_hand_landmarks) == 1:
                        # One hand detected, use the model_signs for gesture prediction (sign language)
                        hand_landmarks = results.multi_hand_landmarks[0]
                        landmark_list = self.calc_landmark_list(debug_image, hand_landmarks)
                        pre_processed_landmark_list = self.pre_process_landmark(landmark_list)

                        # Predict gesture using the pre-trained model for signs
                        result = self.model_signs.predict([pre_processed_landmark_list])[0]

                        # Compare scalars instead of arrays to avoid warning
                        if result != self.prev_result:
                            self.same_gesture_duration = time.time() - self.prev_time
                            self.prev_time = time.time()
                        else:
                            self.same_gesture_duration = time.time() - self.prev_time

                        # Display gesture on the frame
                        cv2.putText(image, str(result), (50, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2, cv2.LINE_AA)

                        # If the gesture stays the same for 5 seconds, trigger text-to-speech and save it
                        if self.same_gesture_duration >= 5:
                            print(f"Gesture '{result}' detected for 5 seconds. Saving and playing sound.")
                            threading.Thread(target=self.text_to_speech, args=(result,)).start()
                            self.prev_time = time.time()

                        self.prev_result = result

                    elif len(results.multi_hand_landmarks) == 2:
                        # Two hands detected, use the model_numbers for number prediction
                        left_hand = results.multi_hand_landmarks[0]
                        right_hand = results.multi_hand_landmarks[1]

                        # Calculate and normalize landmarks for both hands
                        left_landmark_list = self.calc_landmark_list(debug_image, left_hand)
                        right_landmark_list = self.calc_landmark_list(debug_image, right_hand)

                        # Pre-process both hand landmark lists
                        left_pre_processed = self.pre_process_landmark(left_landmark_list)
                        right_pre_processed = self.pre_process_landmark(right_landmark_list)

                        # Combine both hands landmarks for model input
                        combined_landmarks = left_pre_processed + right_pre_processed

                        # Predict numbers using the pre-trained model for numbers
                        result = self.model_numbers.predict([combined_landmarks])[0]

                        # Convert number result to corresponding text before passing to TTS
                        result_text = self.num_to_word.get(int(result), str(result))

                        # Compare scalars instead of arrays to avoid warning
                        if result != self.prev_result:
                            self.same_gesture_duration = time.time() - self.prev_time
                            self.prev_time = time.time()
                        else:
                            self.same_gesture_duration = time.time() - self.prev_time

                        # Display the number prediction on the frame
                        cv2.putText(image, str(result), (50, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2, cv2.LINE_AA)

                        # If the gesture stays the same for 5 seconds, trigger text-to-speech and save it
                        if self.same_gesture_duration >= 5:
                            print(f"Number '{result_text}' detected for 5 seconds. Saving and playing sound.")
                            threading.Thread(target=self.text_to_speech, args=(result_text,)).start()
                            self.prev_time = time.time()

                        self.prev_result = result

                    # Draw landmarks for all hands detected
                    for hand_landmarks in results.multi_hand_landmarks:
                        self.mp_drawing.draw_landmarks(image, hand_landmarks, mp.solutions.hands.HAND_CONNECTIONS)

                # Display the image with detected landmarks
                cv2.imshow('Hand Gesture Recognition', image)

                # Exit if the Esc key is pressed
                if cv2.waitKey(1) & 0xFF == 27:  # 27 is the ASCII code for Esc
                    break

        cap.release()
        cv2.destroyAllWindows()

if __name__ == "__main__":
    hand_gesture_recognition = HandGestureRecognition()
    hand_gesture_recognition.detect_hands()


Number '1' detected for 5 seconds. Saving and playing sound.
Audio content saved to response_1.mp3
Played response_1.mp3
Number '2' detected for 5 seconds. Saving and playing sound.
Audio content saved to response_2.mp3
Played response_2.mp3
Number '3' detected for 5 seconds. Saving and playing sound.
Audio content saved to response_3.mp3
Played response_3.mp3
