# Urdu Writing Dataset Compiler

This notebook extracts and compiles the Urdu writing dataset collected through our Android/Flutter application. The dataset contains handwritten Urdu text samples including mono-letter, 2-letter, and 3-letter words.

### Overview
- **Source**: Firebase Firestore database
- **Data Type**: Handwritten Urdu text samples
- **Output**: Individual JSON files for each word, packaged as a ZIP file
- **Word Types**: Single letters, 2-letter words, and 3-letter words

### Prerequisites
- Firebase service account key (JSON file)
  - A read-only key file **`urdu-dataset-access-key.json`** is available in this repository for public access
- Access to the Firestore database containing the dataset

### Python Requirements
For **Google Colab** (recommended):
- `firebase-admin` (install via: `!pip install firebase-admin`)

For **local environments**:
```bash
pip install firebase-admin
```

**Optional for local development:**
```bash
pip install jupyter notebook  # If using Jupyter locally
```

### Environment Compatibility
- **Designed for**: Google Colab environment
- **Local/Different environments**: Some modifications may be needed for local Jupyter or other environments:
  - Replace `files.upload()` and `files.download()` with appropriate file handling
  - Install required packages: `pip install firebase-admin`
  - Adjust file paths as needed for your local setup

### ⚠️ Technical Note: Urdu Text Handling
Since this dataset contains **Urdu text**, you may encounter encoding or display issues in:
- **Python console output** (depending on terminal/OS support for RTL scripts)
- **JSON viewers/IDEs** (may not render Urdu characters properly)
- **File systems** (filename encoding issues on some OS)

**Recommendations for developers:**
- Use UTF-8 compatible editors and terminals
- Ensure your system has Urdu font support installed
- JSON data integrity remains intact regardless of display issues

### 1. Firebase Authentication Setup

Upload your Firebase service account key file and initialize the Firebase Admin SDK.

In [None]:
# Import required libraries
import firebase_admin
from firebase_admin import credentials
from google.colab import files

# Upload the Firebase service account key file
print("Please upload your Firebase service account key (urdu-dataset-access-key.json):")
uploaded = files.upload()
uploaded_file = next(iter(uploaded))

# Initialize Firebase Admin SDK
cred = credentials.Certificate(uploaded_file)
app = firebase_admin.initialize_app(cred)
print("✅ Firebase initialized successfully!")

In [None]:
# Connect to Firestore database
from firebase_admin import firestore
db = firestore.client(app=app)

# Access the dataset collection
collection_ref = db.collection("dataset")
print("✅ Connected to Firestore database")

### 3. Data Retrieval

Fetch all dataset documents from Firestore with version 2 (latest version).

### 2. Database Connection

Connect to the Firestore database and access the dataset collection.

In [None]:
# Query documents with version 2 (latest dataset version)
from firebase_admin.firestore import FieldFilter

docs = collection_ref.where(filter=FieldFilter("version", '==', 2)).stream()
all_documents = [doc.to_dict() for doc in docs]

print(f"📊 Retrieved {len(all_documents)} documents from the dataset")

### 4. Dataset Analysis

Analyze the collected data to understand the dataset composition and contributor statistics.

In [None]:
# Analyze unique contributors
user_ids = set()
for doc in all_documents:
    if 'userID' in doc:
        user_ids.add(doc['userID'])

print(f"👥 Number of unique contributors: {len(user_ids)}")

In [None]:
# Analyze contribution sessions (first 4 characters of user ID represent session)
from collections import Counter

first_four_chars = [user_id[:4] for user_id in user_ids]
session_counts = Counter(first_four_chars)

print("📈 Contributors per session:")
for session, count in sorted(session_counts.items()):
    print(f"  Session {session}: {count} contributors")

In [None]:
# Display sample document structure
print(f"📋 Sample document structure:")
print(f"Total documents: {len(all_documents)}")
if all_documents:
    sample_doc = all_documents[0]
    print(f"Sample document keys: {list(sample_doc.keys())}")
    if 'data' in sample_doc:
        print(f"Words in sample document: {list(sample_doc['data'].keys())[:5]}...")  # Show first 5 words

### 5. Data Processing and Grouping

Process the raw data and group handwriting samples by word for easier analysis and usage.

In [None]:
# Group handwriting samples by word
from collections import defaultdict

grouped_data = defaultdict(list)
total_samples = 0

# Process each document and extract handwriting data
for doc in all_documents:
    if 'data' in doc:
        data = doc['data']
        for word in data:
            # Each word contains handwriting sample data
            grouped_data[word].append(data[word])
            total_samples += 1

print(f"🔤 Processed {total_samples} handwriting samples")
print(f"📝 Number of unique words: {len(grouped_data)}")

In [None]:
# Display words categorized by length
mono_letters = []
two_letters = []
three_letters = []

for word in grouped_data:
    if len(word) == 1:
        mono_letters.append(word)
    elif len(word) == 2:
        two_letters.append(word)
    elif len(word) == 3:
        three_letters.append(word)

print(f"📊 Dataset composition:")
print(f"  • Single letters: {len(mono_letters)} words")
print(f"  • Two-letter words: {len(two_letters)} words") 
print(f"  • Three-letter words: {len(three_letters)} words")

print(f"\n🔤 Sample words:")
print(f"  • Single letters: {mono_letters[:10]}")
print(f"  • Two-letter: {two_letters[:10]}")
print(f"  • Three-letter: {three_letters[:10]}")

In [None]:
# Display sample count for a specific word (example: عطر)
example_word = 'عطر'
if example_word in grouped_data:
    print(f"✏️ Number of handwriting samples for '{example_word}': {len(grouped_data[example_word])}")
else:
    # Find the first available word as an example
    example_word = list(grouped_data.keys())[0]
    print(f"✏️ Number of handwriting samples for '{example_word}': {len(grouped_data[example_word])}")

### 6. Export to JSON Files

Create individual JSON files for each word containing all handwriting samples, then package them into a ZIP file (with JSONs directly, not in a subdirectory) for download.

In [None]:
# Create JSON files for each word
import json
import os

print("📁 Creating JSON files in current directory...")

# Export each word's data to a separate JSON file
exported_count = 0
for word in grouped_data:
    try:
        filename = f'{word}.json'
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(grouped_data[word], f, indent=2, ensure_ascii=False)
        exported_count += 1
        
        # Show progress for every 10 files
        if exported_count % 10 == 0:
            print(f"  ✅ Exported {exported_count}/{len(grouped_data)} files...")
            
    except Exception as e:
        print(f"  ❌ Error exporting '{word}': {str(e)}")

print(f"🎉 Successfully exported {exported_count} JSON files")

In [None]:
# Create ZIP archive of all JSON files
print("📦 Creating ZIP archive...")
!zip -r dataset.zip *.json
print("✅ ZIP archive 'dataset.zip' created successfully")

In [None]:
# Download the ZIP file
print("⬇️ Initiating download...")
files.download('dataset.zip')
print("🎊 Dataset compilation complete! Your ZIP file has been downloaded.")

### Summary

This notebook successfully compiled the Urdu handwriting dataset from Firestore into individual JSON files. Each JSON file contains all handwriting samples for a specific word, making it easy to:

- **Train machine learning models** for Urdu handwriting recognition
- **Analyze handwriting patterns** across different contributors
- **Study character formation** in Urdu script
- **Research linguistic patterns** in mono-letter, 2-letter, and 3-letter combinations

#### Output Structure
- **File format**: One JSON file per word
- **Naming convention**: `{word}.json`
- **Content**: Array of handwriting sample data for each word
- **Encoding**: UTF-8 to properly handle Urdu text

#### Dataset Information
- **Source**: Flutter mobile application for Urdu handwriting collection
- **Contributors**: Multiple users across different data collection sessions
- **Word types**: Single letters, 2-letter words, and 3-letter words
- **Data version**: Version 2 (latest)

The compiled dataset is now ready for research and development purposes!