# Indian Sign Language Recognition  
**By:** Ipsita Jain (210102039)

## Motivation

Sign language is a vital communication tool for the deaf and hard-of-hearing community. In India, where linguistic diversity is vast, Indian Sign Language (ISL) remains underrepresented in AI/ML applications. This project aims to build a gesture recognition system for ISL using deep learning, fostering inclusivity and accessibility.

**Why this topic?**

- **Social Impact:** Automation of ISL recognition can bridge communication gaps in education, healthcare, and public services.
- **Technical Challenge:** Gesture recognition involves complexities like background noise, lighting variations, and real-time processing.

## Historical Perspective: Multimodal Learning

Recent advancements in multimodal learning have revolutionized gesture recognition:

- **Early 2010s:** Handcrafted features (e.g., HOG, SIFT) combined with classifiers like SVMs.
- **Mid-2010s:** CNNs dominated image-based tasks, but required large labeled datasets.
- **2020s:** Transformer-based models (e.g., ViT, CLIP) enabled cross-modal learning (e.g., text + images).

**Connection to this work:**  
This project adopts a CNN-based approach, inspired by modern architectures like ResNet, but tailored for ISL’s unique gestures and limited dataset sizes.

## Key Learnings

1. **Data Limitations:** ISL datasets are scarce compared to ASL. Data augmentation (rotation, flipping) was critical for generalization.
2. **Model Trade-offs:** Lightweight CNNs (e.g., MobileNet) outperformed deeper models due to faster inference and lower overfitting.
3. **Real-World Gaps:** Accuracy dropped in cluttered backgrounds, highlighting the need for robust preprocessing.

## Code & Experiments

### Importing Required Libraries

In [None]:
import cv2
import numpy as np
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from keras.utils import to_categorical
from keras.optimizers import Adam
import os

### Data Preparation (from `function.py`)

In [None]:
def load_data(data_dir):
    images = []
    labels = []
    label_map = {}
    for idx, folder in enumerate(os.listdir(data_dir)):
        label_map[idx] = folder
        for file in os.listdir(os.path.join(data_dir, folder)):
            img_path = os.path.join(data_dir, folder, file)
            img = cv2.imread(img_path)
            img = cv2.resize(img, (64, 64))
            images.append(img)
            labels.append(idx)
    images = np.array(images)
    labels = to_categorical(labels)
    return images, labels, label_map

# Example usage:
# X_train, y_train, label_map = load_data('data/train')

### Model Architecture (from `train.py`)

In [None]:
model = Sequential()
model.add(Conv2D(32, (3,3), activation='relu', input_shape=(64,64,3)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(64, (3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))  # Adjust according to number of classes
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

### Training the Model (from `train.py`)

In [None]:
# Assuming X_train, y_train, X_test, y_test are prepared
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

### Model Evaluation (from `train.py`)

In [None]:
loss, acc = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss}, Test Accuracy: {acc}")

### Real-time Prediction Demo (from `app.py`)

*This is a simplified version for illustration. See `app.py` for the full webcam-based demo.*

In [None]:
cap = cv2.VideoCapture(0)
while True:
    ret, frame = cap.read()
    if not ret:
        break
    img = cv2.resize(frame, (64, 64))
    img = np.expand_dims(img, axis=0)
    prediction = model.predict(img)
    predicted_class = np.argmax(prediction)
    # Display prediction on frame
    cv2.putText(frame, label_map[predicted_class], (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255,0), 2)
    cv2.imshow('ISL Recognition', frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
cap.release()
cv2.destroyAllWindows()

## Reflections

### Surprises

- Background subtraction alone wasn’t sufficient for reliable predictions.
- Transfer learning (e.g., using pretrained VGG16) initially hurt performance due to domain mismatch.

### Improvements

1. **Dataset Expansion:** Collaborate with ISL communities to collect more diverse samples.
2. **Multimodal Fusion:** Integrate pose estimation (e.g., MediaPipe) to isolate hand movements.
3. **Edge Deployment:** Optimize the model with TensorFlow Lite for mobile use.

## References

1. MediaPipe Hands: On-device Real-time Hand Tracking (https://arxiv.org/abs/2006.10214)
2. Keras Documentation (https://keras.io/)
3. Indian Sign Language Dataset (Kaggle) (https://www.kaggle.com/datasets/grassknoted/asl-alphabet)
4. Visual Haystacks: Berkeley AI Research Blog (https://bair.berkeley.edu/blog/2024/07/20/visual-haystacks/)