# 01 - Data Exploration
## Real-Time Sign Language Translator

This notebook covers:
1. Environment setup and package installation
2. Dataset download and setup
3. Data exploration and visualization
4. Class distribution analysis
5. Sample image visualization

## 0. Install Required Packages

**IMPORTANT:** Run this cell first, then **RESTART THE KERNEL** before continuing.

This fixes the NumPy-TensorFlow compatibility issue.

In [None]:
# Install required packages with compatible versions
# Fix NumPy version for TensorFlow compatibility
!pip install "numpy<1.24" opencv-python pillow matplotlib seaborn kaggle -q

print("[OK] All packages installed successfully!")
print("")
print("[!] IMPORTANT: Please RESTART the kernel now!")
print("    Kernel -> Restart Kernel")
print("")
print("[!] After restart, SKIP this cell and run from Section 1 onwards")

## 1. Import Libraries

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import cv2
from PIL import Image
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("[OK] Libraries imported successfully!")

## 2. Check CUDA Availability

In [None]:
import tensorflow as tf

print(f"TensorFlow Version: {tf.__version__}")
print(f"CUDA Available: {tf.test.is_built_with_cuda()}")
print(f"GPU Available: {tf.config.list_physical_devices('GPU')}")

# Enable GPU memory growth
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        print(f"[OK] {len(gpus)} GPU(s) configured successfully!")
    except RuntimeError as e:
        print(e)

## 3. Setup Kaggle API (for dataset download)

**Instructions:**
1. Go to https://www.kaggle.com/
2. Click on your profile → Account → Create New API Token
3. Download `kaggle.json`
4. Place it in `C:\Users\nasir\.kaggle\kaggle.json`

In [None]:
# Check if Kaggle API is configured
kaggle_dir = Path.home() / '.kaggle'
kaggle_json = kaggle_dir / 'kaggle.json'

if kaggle_json.exists():
    print("[OK] Kaggle API configured!")
else:
    print("[X] Kaggle API not configured. Please follow the instructions above.")
    print(f"Expected location: {kaggle_json}")

## 4. Download ASL Alphabet Dataset

In [None]:
# Download dataset using Kaggle API
!kaggle datasets download -d grassknoted/asl-alphabet -p ../data/raw/ --unzip

print("[OK] Dataset downloaded successfully!")

## 5. Explore Dataset Structure

In [None]:
# Define paths
data_dir = Path('../data/raw/asl_alphabet_train/asl_alphabet_train')

# Get class names
class_names = sorted([d.name for d in data_dir.iterdir() if d.is_dir()])
num_classes = len(class_names)

print(f"Number of classes: {num_classes}")
print(f"\nClass names: {class_names}")

## 6. Analyze Class Distribution

In [None]:
# Count images per class
class_counts = {}
for class_name in class_names:
    class_path = data_dir / class_name
    count = len(list(class_path.glob('*.jpg')))
    class_counts[class_name] = count

# Create DataFrame
df_counts = pd.DataFrame(list(class_counts.items()), columns=['Class', 'Count'])
df_counts = df_counts.sort_values('Count', ascending=False)

print(df_counts)
print(f"\nTotal images: {df_counts['Count'].sum():,}")
print(f"Average images per class: {df_counts['Count'].mean():.0f}")
print(f"Min images: {df_counts['Count'].min()}")
print(f"Max images: {df_counts['Count'].max()}")

## 7. Visualize Class Distribution

In [None]:
# Plot class distribution
plt.figure(figsize=(16, 6))
plt.bar(df_counts['Class'], df_counts['Count'], color='steelblue', edgecolor='black')
plt.xlabel('Class', fontsize=12, fontweight='bold')
plt.ylabel('Number of Images', fontsize=12, fontweight='bold')
plt.title('ASL Alphabet Dataset - Class Distribution', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## 8. Visualize Sample Images

In [None]:
# Display sample images from each class
fig, axes = plt.subplots(5, 6, figsize=(18, 15))
axes = axes.ravel()

for idx, class_name in enumerate(class_names[:30]):
    class_path = data_dir / class_name
    # Get first image
    img_path = list(class_path.glob('*.jpg'))[0]
    img = Image.open(img_path)
    
    axes[idx].imshow(img)
    axes[idx].set_title(f"Class: {class_name}", fontsize=10, fontweight='bold')
    axes[idx].axis('off')

plt.suptitle('Sample Images from Each Class', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

## 9. Analyze Image Properties

In [None]:
# Sample random images to check properties
sample_images = []
for class_name in class_names[:5]:
    class_path = data_dir / class_name
    img_path = list(class_path.glob('*.jpg'))[0]
    img = cv2.imread(str(img_path))
    sample_images.append(img)

# Check dimensions
print("Image dimensions:")
for i, img in enumerate(sample_images):
    print(f"  Class {class_names[i]}: {img.shape}")

# Check data type and range
print(f"\nData type: {sample_images[0].dtype}")
print(f"Pixel value range: [{sample_images[0].min()}, {sample_images[0].max()}]")

## 10. Display Multiple Samples from One Class

In [None]:
# Show variations within a single class
selected_class = 'A'  # Change this to any class
class_path = data_dir / selected_class
image_paths = list(class_path.glob('*.jpg'))[:12]

fig, axes = plt.subplots(3, 4, figsize=(12, 9))
axes = axes.ravel()

for idx, img_path in enumerate(image_paths):
    img = Image.open(img_path)
    axes[idx].imshow(img)
    axes[idx].axis('off')

plt.suptitle(f'Variations in Class "{selected_class}"', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 11. Summary & Next Steps

### Key Findings:
- Dataset contains 29 classes (A-Z + space, delete, nothing)
- ~87,000 total images
- Images are 200x200 pixels, RGB
- Balanced class distribution

### Next Steps:
1. **Data Preprocessing** (Notebook 02)
   - Resize images to 224x224 for transfer learning
   - Normalize pixel values
   - Create train/val/test splits
   
2. **Data Augmentation** (Notebook 02)
   - Apply transformations to increase dataset size
   - Improve model generalization
   
3. **Model Training** (Notebook 03)
   - Build and train CNN model
   - Use transfer learning (MobileNetV2/EfficientNet)

In [None]:
print("[OK] Data exploration complete!")
print("[>>] Ready to proceed to data preprocessing.")