An AI-Powered Visual Assistant for Visually Impaired Users
Features β’ Installation β’ Usage β’ Architecture β’ Development
Glimmer is a native iOS accessibility application prototype designed specifically for visually impaired users. It combines real-time computer vision with voice interaction to provide an intelligent visual assistant that runs entirely on-device using Apple's MLX framework and Qwen 3.5 Vision Language Model.
- Real-time Visual Description: Continuous scene understanding using on-device AI
- Voice-First Interaction: Press-and-hold voice input with automatic TTS management
- On-Device Processing: Privacy-focused local inference using MLX
- Adaptive Captioning: Smart deduplication and throttling to prevent audio queue overflow
- Bilingual Support: Native Chinese language support with English capabilities
- Extensible Architecture: Ready for cloud-backend integration when needed
- Live Camera Feed: Upper screen displays real-time camera preview
- AI-Powered Descriptions: Near real-time scene analysis using Qwen 3.5 VLM (0.8B quantized)
- Smart Captioning: Automatic filtering of redundant descriptions
- Visual Context Awareness: Maintains short-term visual memory for Q&A
- Priority-Based Audio: Voice input automatically pauses visual descriptions
- System TTS Integration: Native iOS text-to-speech with on-screen captions
- Multimodal Q&A: Combines user questions with visual context for intelligent responses
- Hands-Free Operation: Large touch-and-hold button for easy voice activation
- 100% On-Device: No data leaves your iPhone
- Optimized for iPhone: Tested on iPhone 17 Pro with iOS 26
- Battery Efficient: Adaptive throttling balances performance and power consumption
- Offline Capable: Works without internet connection after initial model download
- Hardware: iPhone 15 or later (iPhone 17 Pro recommended)
- OS: iOS 17.0+
- Tools: Xcode 15.0+, XcodeGen
-
Clone the repository
git clone https://github.com/yourusername/glimmer.git cd glimmer -
Generate Xcode project
xcodegen generate
-
Open in Xcode
open Glimmer.xcodeproj
-
Configure signing
- Select your development team in project settings
- Ensure proper code signing certificates
-
Build and run
- Connect your iPhone
- Select device as run destination
- Press
Cmd+Rto build and run
On first run, the app will:
- Request camera and microphone permissions
- Download the Qwen 3.5 model (~500MB) - requires stable internet
- Cache the model locally for offline use
- Launch the app - Grant camera and microphone permissions when prompted
- Point camera at scene - AI will automatically start describing what it sees
- Listen to descriptions - Captions appear on screen while being spoken
- Ask questions:
- Press and hold the large button (bottom half of screen)
- Speak your question in Chinese or English
- Release to get AI-powered answer based on current visual context
- Stable Internet: Ensure good connectivity for first-time model download
- Portrait Mode: Keep phone vertical for optimal camera framing
- Lighting: Works best in well-lit environments
- Distance: Hold phone at comfortable viewing distance (30-100cm from objects)
If you experience lag:
- Switch to 5-bit model for faster inference
- Increase
captionProcessingIntervalto 2.2s in settings
For better quality:
- Use 8-bit model for improved accuracy
- Consider backend integration for complex queries
Glimmer/
βββ App/ # iOS application layer
β βββ Views/ # SwiftUI views
β βββ ViewModels/ # Application state management
β βββ Services/ # Camera, Speech, TTS services
βββ Sources/GlimmerCore/ # Core reusable framework
β βββ Configuration/ # Model & prompt configs
β βββ Domain/ # Protocol definitions
β βββ Inference/ # MLX engine & backends
β βββ Speech/ # Caption & TTS policies
βββ Tests/GlimmerCoreTests/ # Unit tests
βββ docs/ # Documentation
CameraService β AppViewModel β FrameSnapshotWriter β LocalMLXAssistantEngine
β
Qwen 3.5 VLM (MLX)
β
CaptionSpeechPolicy β TTS
User Press β Speech Recognition β Question + Visual Summary + Frame
β
LocalMLXAssistantEngine
β
Answer β TTS β Resume Visual Loop
Current: Qwen3.5-0.8B
Why this model?
- True vision-language model (image-text-to-text)
- Optimized for MLX framework
- 0.8B parameters balance quality and speed on iPhone
Change model in: Sources/GlimmerCore/Configuration/AssistantModelConfiguration.swift
# Core logic tests
swift test
# Full build verification
xcodebuild \
-project Glimmer.xcodeproj \
-scheme Glimmer \
-destination 'generic/platform=iOS' \
buildBuild and publish directly from your local machine:
# 1) Build unsigned archive
mkdir -p build
xcodebuild -project Glimmer.xcodeproj \
-scheme Glimmer \
-configuration Release \
-destination 'generic/platform=iOS' \
-archivePath build/Glimmer.xcarchive \
archive \
CODE_SIGNING_ALLOWED=NO \
CODE_SIGNING_REQUIRED=NO \
CODE_SIGN_IDENTITY=""
# 2) Package IPA
APP_PATH="build/Glimmer.xcarchive/Products/Applications/Glimmer.app"
mkdir -p build/Payload
cp -R "$APP_PATH" build/Payload/
(cd build && /usr/bin/zip -qry Glimmer-unsigned.ipa Payload)
# 3) Create release from local artifact (replace vX.Y.Z)
gh release create vX.Y.Z build/Glimmer-unsigned.ipa \
--title "vX.Y.Z" \
--generate-notesRequirements:
- GitHub CLI installed and authenticated (
gh auth login) - Tag
vX.Y.Zshould match your release version
The architecture supports seamless backend switching:
- Protocol:
AssistantEngine(inGlimmerCore/Domain/) - Local Implementation:
LocalMLXAssistantEngine - Remote Stub:
RemoteAssistantEngine(ready for implementation)
To integrate a cloud backend:
// In AppViewModel initialization
let engine = RemoteAssistantEngine(apiEndpoint: "https://your-api.com")No changes needed in UI, camera, or speech layers.
We welcome contributions! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Audio Session Conflicts: Simultaneous TTS and speech recognition may interfere on some iOS versions
- Metal Performance: Inference speed varies by device thermal state
- Model Download: First launch requires ~500MB download over stable network
- Language Support: Optimized for Chinese; English support is experimental
- Camera Orientation: Currently designed for portrait mode only
Before release, verify on physical iPhone:
- Camera permission granted and live preview working
- Microphone permission granted
- Model downloaded successfully
- Visual descriptions play through TTS
- Captions display correctly on screen
- Voice input activates and transcribes accurately
- Q&A responses are contextually relevant
- No audio conflicts between TTS and speech recognition
- App remains responsive under continuous use (10+ minutes)
- Memory usage stable without leaks
Full checklist: docs/DEVICE_TEST_CHECKLIST.md
- Multi-language support (English, Spanish, etc.)
- Landscape mode optimization
- Object detection with haptic feedback
- Customizable voice profiles
- Cloud sync for conversation history
- watchOS companion app
- Accessibility shortcuts integration
This project is licensed under the MIT License - see the LICENSE file for details.
- MLX: Apple's ML framework for efficient on-device inference
- Qwen Team: For open-sourcing the Qwen 3.5 vision-language models
- Hugging Face: Model hosting and community support
- Accessibility Community: For invaluable feedback and testing
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with β€οΈ for the visually impaired community