AdaSeg4MR is a sophisticated multi-modal AI assistant named after Ada Lovelace, featuring real-time object segmentation, voice interaction, and comprehensive visual analysis capabilities. The system integrates YOLO-based object detection, natural language processing, and computer vision to provide intelligent responses to user queries about visual content.
- YOLO11M Integration: Uses YOLO11M segmentation model for precise object detection and instance segmentation
- Multi-Class Support: Detects 80 COCO classes with customizable target selection
- Real-Time Processing: Continuous frame analysis with live overlay visualization
- Category Label Mapping: Intelligent translation of natural language to YOLO classes
- Voice Recognition: Speech-to-text using Whisper model
- Text-to-Speech: OpenAI TTS with natural voice synthesis
- Keyboard Input: Alternative text-based interaction mode
- Visual Analysis: Screenshot and webcam capture capabilities
- Intent Detection: AI-powered understanding of user intentions
- Context-Aware Responses: Maintains conversation history and visual context
- Multi-Language Support: Handles various natural language expressions
- Function Calling: Automatic selection of appropriate system functions
segmentation_model = YOLO("yolo11m-seg.pt")- Model: YOLO11M segmentation model
- Classes: 80 COCO object categories
- Confidence Threshold: 0.15 (configurable)
- Output: Bounding boxes, masks, confidence scores
- Groq API: Primary LLM for intent detection and response generation
- OpenAI API: Secondary LLM for complex reasoning tasks
- Google Gemini: Vision analysis and image understanding
- Whisper: Speech recognition and transcription
- Frame Buffer: Real-time video frame storage
- Overlay System: Segmentation result visualization
- Image Mode: Static image analysis capability
- Recording System: Video and audio capture functionality
- OS: Windows 10/11, Linux, macOS
- Python: 3.8 or higher
- RAM: Minimum 8GB, recommended 16GB+
- GPU: Optional but recommended for faster processing
- Storage: 2GB+ free space
pip install -r requirements.txtRequired packages:
ultralytics- YOLO model frameworkopencv-python- Computer vision operationsgroq- LLM API clientopenai- OpenAI API clientgoogle-generativeai- Gemini API clientfaster-whisper- Speech recognitionspeech_recognition- Audio processingpyaudio- Audio I/Opyperclip- Clipboard operationsnumpy- Numerical computationsPIL- Image processingpynput- Keyboard monitoring
Create config.json file:
{
"groq_api_key": "your_groq_api_key",
"google_api_key": "your_google_api_key",
"openai_api_key": "your_openai_api_key"
}# Download YOLO11M model (automatic on first run)
# Model will be cached locallyAdaSeg4MR/
├── ada.py # Main camera-based assistant
├── ada_img.py # Image analysis mode
├── lovelace_img.py # Evaluation framework
├── category_labels.json # Generic label mappings
├── config.json # API configuration
├── test_images/ # Test image directory
├── ../images/ # COCO dataset (for evaluation)
└── ../test_results/ # Evaluation results
python ada.py- Activation: Press spacebar to start/stop recording
- Commands: Natural language queries about visual content
Command: find people and cars
Command: how many dogs are there?
Command: where is the laptop?
Command: describe what you see
Command: take screenshot
Command: quit# Find specific objects
"find the person and the car"
# Find by category
"find fruit" # Detects apple, orange, banana
"find vehicles" # Detects all transportation types
"find animals" # Detects all animal classes# Count specific objects
"how many people are there?"
"count the cars and trucks"# Get object locations
"where is the laptop?"
"where are the people?"# General description
"describe what you see"
"analyze the scene"# Focused Analysis
"What is the color of the hat?"
"Where is the dog looking?"Command: image mode # Switch to image analysis
Command: camera mode # Switch to camera mode
Command: next # Next image
Command: previous # Previous image- Batch Processing: Analyze multiple images sequentially
- Persistent Results: Segmentation results remain visible
- Navigation: Easy image browsing with keyboard commands
- Export: Save analysis results and visualizations
Comprehensive evaluation of system performance using COCO dataset
# Download COCO dataset
# Place in ../images/ directory
python lovelace_img.py- Bounding Box IoU: Intersection over Union for detection accuracy
- Mask IoU: Segmentation mask accuracy
- Class Accuracy: Correct object classification rate
- Response Time: System latency measurements
- Count Accuracy: Object counting precision
- Dataset Preparation: Select random classes and images
- Question Generation: Create standardized test queries
- Automated Testing: Run all queries on selected images
- Result Analysis: Calculate comprehensive metrics
- Report Generation: Create detailed performance reports
- CPU Threading: Multi-core processing for audio and video
- GPU Acceleration: Optional CUDA support for YOLO model
- Memory Management: Efficient buffer management
- Frame Skipping: Adaptive processing rate
- Parallel Processing: Simultaneous audio and visual analysis
- Caching: Model and result caching
- Streaming: Real-time audio processing
- Async Operations: Non-blocking API calls
- Multi-Frame Averaging: Reduces detection noise
- Confidence Thresholding: Adaptive confidence levels
- Class-Specific Tuning: Optimized parameters per object type
- Context Awareness: Scene understanding for better detection
class AdaImgTester:
def __init__(self):
# Initialize COCO dataset
# Setup evaluation metrics
# Create result directories
def run_test(self, num_classes=10, num_images_per_class=5):
# Automated test execution
# Metric calculation
# Report generation- Object Detection: "find the [object1], [object2], and [object3]"
- Object Counting: "how many [object]s are there?"
- Position Query: "where is the [object]?"
- Description: "what is the [object] like?"
- Bounding Box IoU: Measures detection precision
- Mask IoU: Evaluates segmentation quality
- Class Accuracy: Classification correctness
- Detection Rate: Percentage of objects found
- Response Time: End-to-end processing latency
- Throughput: Images processed per second
- Memory Usage: System resource consumption
- CPU/GPU Utilization: Computational efficiency
- Query Success Rate: Percentage of successful queries
- Response Quality: Relevance and accuracy of answers
- Interaction Fluency: Natural conversation flow
- Error Recovery: System robustness
- by_image.csv: Per-image performance metrics
- by_classes.csv: Class-specific accuracy analysis
- total.csv: Overall system performance
- HTML Reports: Visual result presentation
# Example metric calculation
avg_bbox_iou = sum(bbox_ious) / len(bbox_ious)
avg_mask_iou = sum(mask_ious) / len(mask_ious)
class_accuracy = correct_predictions / total_predictions
response_time = end_time - start_time# Check API keys in config.json
# Verify internet connection
# Test API endpoints individually# Ensure sufficient disk space
# Check model file integrity
# Verify CUDA installation (if using GPU)# Check microphone permissions
# Verify audio drivers
# Test with different audio devices# Reduce image resolution
# Lower confidence thresholds
# Disable unnecessary features
# Check system resources# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Monitor system resources
# Check API response times
# Verify model predictions# Camera settings
CAMERA_SOURCE = 0 # 0: built-in, 1: external, 2: DroidCam
# Voice interaction
use_voice_interaction = True # Enable/disable voice
# Recording
ENABLE_RECORDING = True # Video/audio recording
# Processing
continuous_segmentation = True # Real-time processing# YOLO configuration
conf_threshold = 0.15 # Detection confidence
iou_threshold = 0.5 # Non-maximum suppression
classes = None # Target classes (None = all)
# Whisper settings
whisper_size = 'base' # Model size
device = 'cpu' # Processing device- Multi-Language Support: International language support
- Advanced Segmentation: Instance-aware segmentation
- 3D Understanding: Depth perception and 3D analysis
- Learning Capabilities: Adaptive model improvement
- Cloud Integration: Remote processing and storage
- Model Optimization: Quantization and pruning
- Hardware Acceleration: Better GPU utilization
- Parallel Processing: Enhanced multi-threading
- Memory Optimization: Reduced memory footprint
# Clone repository
git clone [repository-url]
cd AdaSeg4MR
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
python -m pytest tests/
# Code formatting
black ada.py ada_img.py- Modular Design: Separate concerns and responsibilities
- Documentation: Comprehensive docstrings and comments
- Error Handling: Robust exception management
- Testing: Automated test coverage
Elastic License 2.0
URL: https://www.elastic.co/licensing/elastic-license
- YOLO: Ultralytics for object detection models
- COCO: Microsoft for evaluation dataset
- OpenAI: Language models and TTS
- Groq: Fast inference API
- Google: Gemini vision models
- API Reference: Detailed function documentation
- Examples: Usage examples and tutorials
- Troubleshooting: Common issues and solutions
- Issues: GitHub issue tracker
- Discussions: Community forum
- Email: Direct support contact
This README provides comprehensive documentation for the AdaSeg4MR system. For specific implementation details, refer to the individual source files and their inline documentation.