🎙️🤖 Voice-Controlled Vision-Guided Robotic Arm

Part of the Intuex Organization

📖 Overview

A real-time perception–action loop that allows you to control a robotic arm using natural voice commands and live computer vision. This system integrates speech-to-text, vision-language modeling (Gemini), and low-level serial control.

✨ Key Features

🎙️ Push-to-Talk: Record commands by holding the SPACE bar
🧠 Gemini Reasoning: Multi-step task planning using vision + logic
👁️ Live Vision: Compatible with ESP32-CAM, Android IP Webcam, or CCTV
🗣️ Voice Feedback: Real-time spoken updates on what the robot is thinking
🔁 Stateful Loop: Remembers previous actions for complex, multi-stage tasks

🚀 Quick Start

1. Installation

git clone https://github.com/Ansh-droid-glitch/GPTArm.git
cd GPTArm
pip install -r requirements.txt

2. Configure Environment

Create a .env file in the root directory:

ASSEMBLYAI_API_KEY=your_key_here
GOOGLE_API=your_key_here

3. Launch the System

Run the specific script for your operating system:

Platform	Execution Command
🪟 Windows	`.\run.bat`
🐧 Linux	`sh run.sh`
🍎 macOS	`sh run_mac.sh`

🛠️ System Architecture

Tip

Each iteration of the loop is one reasoning step. The robot "sees," "thinks," "speaks," and then "moves."

Input: Voice (AssemblyAI) + Vision (Live Camera Frame)
Analysis: Gemini 1.5 Pro analyzes the scene and command
Planning: AI generates text-to-speech feedback and serial commands
Execution: Arduino moves the servos based on MOVE or GRIP strings

┌─────────────┐     ┌──────────────┐     ┌─────────────┐     ┌──────────────┐
│   Voice     │────>│   Gemini     │────>│  TTS + UI   │────>│   Arduino    │
│  Command    │     │   Vision AI  │     │  Feedback   │     │   Control    │
└─────────────┘     └──────────────┘     └─────────────┘     └──────────────┘
       │                    │
       └────────────┬───────┘
                    │
              ┌─────▼──────┐
              │   Camera   │
              │   Stream   │
              └────────────┘

📂 Project Structure

GPTArm/
├── main.py                 # Main control script
├── arduino/
│   └── servo_control.ino   # Arduino firmware
├── images/
│   └── input.jpg          # Camera captures stored here
├── audio/
│   └── recording.wav      # Voice recordings stored here
├── .env                   # API keys (create this)
├── requirements.txt       # Python dependencies
├── run.bat               # Windows launcher
├── run.sh                # Linux launcher
└── run_mac.sh            # macOS launcher

📝 Code Overview

Main Components

The system consists of three main components:

1. Voice Input (`record_audio()`)

Hold SPACE to record your voice command
Uses sounddevice for real-time audio capture
Saves to audio/recording.wav

2. Vision Processing

Fetches live image from camera URL
Sends image + voice command to Gemini AI
AI analyzes scene and plans robot movements

3. Robot Control

Parses AI response for MOVE and GRIP commands
Sends commands via serial to Arduino
Maintains command history for context

Main Script (`main.py`)

import os
import time
import numpy as np
import sounddevice as sd
import keyboard
from scipy.io.wavfile import write
import assemblyai as aai
import pyttsx3
from dotenv import load_dotenv
from google import genai
from google.genai import types
import serial
import requests

# Load environment variables
load_dotenv()
aai.settings.api_key = os.getenv("ASSEMBLYAI_API_KEY")
GOOGLE_API_KEY = os.getenv("GOOGLE_API")

# Initialize Gemini client
MODEL_ID = "gemini-robotics-er-1.5-preview"
client = genai.Client(api_key=GOOGLE_API_KEY)

# Camera and file paths
URL = "http://192.168.1.15:8080/shot"  # Update with your camera IP
IMAGE_PATH = "images/input.jpg"
AUDIO_PATH = "audio/recording.wav"
SAMPLE_RATE = 44100
CHANNELS = 1

# Text-to-speech setup
engine = pyttsx3.init()
engine.setProperty("rate", 170)

def speak(text):
    """Convert text to speech"""
    engine.say(text)
    engine.runAndWait()

def record_audio():
    """Record audio while SPACE is held"""
    audio_chunks = []
    recording = False

    def callback(indata, frames, time, status):
        if recording:
            audio_chunks.append(indata.copy())

    print("Hold SPACE to record")

    with sd.InputStream(
        samplerate=SAMPLE_RATE,
        channels=CHANNELS,
        dtype="int16",
        callback=callback
    ):
        while True:
            if keyboard.is_pressed("space"):
                if not recording:
                    print("Recording...")
                    recording = True
            else:
                if recording:
                    print("Stopped recording")
                    break

    audio = np.concatenate(audio_chunks, axis=0)
    write(AUDIO_PATH, SAMPLE_RATE, audio)
    print("Audio saved")

# Main control loop
commands_history = []
count = 0

while True:
    print(f"\n===== STEP {count} =====")

    # Capture image from camera
    r = requests.get(URL, timeout=5)
    if r.status_code == 200:
        with open(IMAGE_PATH, "wb") as f:
            f.write(r.content)
        print("Image updated")
    else:
        print("Image download failed")
        continue

    # Record voice command
    record_audio()

    # Transcribe audio
    transcript = aai.Transcriber().transcribe(AUDIO_PATH)
    if transcript.status == "error":
        print("Transcription failed")
        continue

    base_prompt = transcript.text
    print("You said:", base_prompt)

    speak("Processing")

    # Build prompt with or without history
    if count == 0:
        prompt = f"""
You are a robotic arm.
User request: "{base_prompt}"

First line: explain briefly what you are doing.
Next lines: commands only:
MOVE x y z
GRIP OPEN
GRIP CLOSE
"""
    else:
        prompt = f"""
You are a robotic arm.
User request: "{base_prompt}"

Previous actions:
{commands_history}

First line: explain briefly what you are doing.
Next lines: commands only:
MOVE x y z
GRIP OPEN
GRIP CLOSE
"""

    # Send to Gemini with image
    with open(IMAGE_PATH, "rb") as f:
        image_bytes = f.read()

    response = client.models.generate_content(
        model=MODEL_ID,
        contents=[
            types.Part.from_bytes(data=image_bytes, mime_type="image/jpeg"),
            prompt
        ],
        config=types.GenerateContentConfig(
            temperature=0.5,
            thinking_config=types.ThinkingConfig(thinking_budget=0)
        )
    )

    raw_text = response.text or ""
    raw_text = raw_text.strip()

    print("\nMODEL RESPONSE:\n", raw_text)

    # Parse response
    lines = [l.strip() for l in raw_text.split("\n") if l.strip()]
    spoken = []
    commands = []

    for line in lines:
        if line.startswith("MOVE") or line.startswith("GRIP"):
            commands.append(line)
        else:
            spoken.append(line)

    # Speak feedback
    if spoken:
        speak(spoken[0])

    if not commands:
        print("No commands generated")
        continue

    # Send commands to Arduino
    ser = serial.Serial("COM7", 9600, timeout=2)  # Update COM port
    time.sleep(2)

    for cmd in commands:
        print("Sending:", cmd)
        ser.write((cmd + "\n").encode())
        time.sleep(1)

    ser.close()

    # Update history
    commands_history.extend(commands)
    count += 1

🔌 Arduino Setup

Required Arduino Code (`servo_control.ino`)

#include <Servo.h>

Servo servo1, servo2, servo3, servo4;

void setup() {
  Serial.begin(9600);
  servo1.attach(9);   // Base
  servo2.attach(10);  // Shoulder
  servo3.attach(11);  // Elbow
  servo4.attach(6);   // Gripper
  
  // Set to home position
  servo1.write(90);
  servo2.write(90);
  servo3.write(90);
  servo4.write(90);
}

void loop() {
  if (Serial.available() > 0) {
    String command = Serial.readStringUntil('\n');
    command.trim();
    
    if (command.startsWith("MOVE")) {
      // Parse: MOVE x y z
      int x, y, z;
      sscanf(command.c_str(), "MOVE %d %d %d", &x, &y, &z);
      
      // Map coordinates to servo angles (adjust as needed)
      servo1.write(constrain(x, 0, 180));
      servo2.write(constrain(y, 0, 180));
      servo3.write(constrain(z, 0, 180));
      
      Serial.println("MOVED");
    }
    else if (command == "GRIP OPEN") {
      servo4.write(180);  // Open position
      Serial.println("OPENED");
    }
    else if (command == "GRIP CLOSE") {
      servo4.write(0);    // Close position
      Serial.println("CLOSED");
    }
  }
}

Wiring Diagram

Arduino Uno/Mega
├── Pin 9  → Base Servo (Signal)
├── Pin 10 → Shoulder Servo (Signal)
├── Pin 11 → Elbow Servo (Signal)
├── Pin 6  → Gripper Servo (Signal)
├── 5V     → Servo Power (or external 5V supply)
└── GND    → Common Ground

Important

Use an external power supply for servos if using more than 2 servos to avoid brownouts.

📋 Detailed Setup Guide

Step 1: Create Project Directories

mkdir GPTArm
cd GPTArm
mkdir images audio arduino

Step 2: Install Python Dependencies

Create requirements.txt:

numpy==1.24.3
sounddevice==0.4.6
keyboard==0.13.5
scipy==1.11.3
assemblyai==0.25.0
pyttsx3==2.90
python-dotenv==1.0.0
google-genai==0.2.0
pyserial==3.5
requests==2.31.0

Install dependencies:

pip install -r requirements.txt

Step 3: Configure Camera

Option A: Android IP Webcam

Download "IP Webcam" app from Play Store
Open app and click "Start server"
Note the IP address (e.g., 192.168.1.15:8080)

Update URL in main.py:

URL = "http://192.168.1.15:8080/shot.jpg"

Option B: ESP32-CAM

Flash ESP32-CAM with camera server firmware
Connect to your WiFi
Note the IP address
Update URL in main.py:
```
URL = "http://192.168.1.20:81/stream"
```

Step 4: Get API Keys

AssemblyAI (Speech-to-Text)

Go to AssemblyAI
Sign up for free account
Copy your API key from dashboard

Google Gemini (Vision AI)

Go to Google AI Studio
Create API key
Copy the key

Create .env file:

ASSEMBLYAI_API_KEY=your_assemblyai_key_here
GOOGLE_API=your_google_api_key_here

Step 5: Upload Arduino Code

Open Arduino IDE
Copy the servo control code to a new sketch
Select your board (Tools → Board)
Select correct COM port (Tools → Port)
Upload the sketch

Step 6: Find Your Serial Port

Windows

Open Device Manager
Expand "Ports (COM & LPT)"
Note the COM port (e.g., COM7)

Update in main.py:

ser = serial.Serial("COM7", 9600, timeout=2)

Linux/macOS

ls /dev/tty*
# Look for /dev/ttyUSB0 or /dev/ttyACM0

# Grant permissions (Linux)
sudo usermod -a -G dialout $USER
sudo chmod 666 /dev/ttyUSB0

Update in main.py:

ser = serial.Serial("/dev/ttyUSB0", 9600, timeout=2)

Step 7: Test Individual Components

Test Camera

import requests
URL = "http://192.168.1.15:8080/shot.jpg"
r = requests.get(URL)
with open("test.jpg", "wb") as f:
    f.write(r.content)
print("Image saved as test.jpg")

Test Microphone

import sounddevice as sd
print(sd.query_devices())  # List all audio devices

Test Serial Connection

import serial
ser = serial.Serial("COM7", 9600, timeout=2)
ser.write(b"GRIP OPEN\n")
ser.close()

Step 8: Run the System

python main.py

🎮 Usage Workflow

Basic Operation

Start the program: python main.py
Wait for prompt: "Hold SPACE to record"
Give command: Hold SPACE, speak clearly, release SPACE
AI processes: Analyzes image + voice command
Robot responds: Speaks action and executes movement
Repeat: System ready for next command

Example Commands

Command	What Happens
"Move to the red ball"	Camera finds red ball, arm moves there
"Pick up the cube"	Closes gripper on detected cube
"Place it over there"	Moves to indicated position
"Go to home position"	Returns to starting pose
"Open the gripper"	Opens claw

🎛️ Configuration Options

Camera Settings

# In main.py, update these variables:
URL = "http://YOUR_CAMERA_IP:PORT/endpoint"
IMAGE_PATH = "images/input.jpg"

Audio Settings

SAMPLE_RATE = 44100  # CD quality
CHANNELS = 1         # Mono recording

Voice Settings

engine.setProperty("rate", 170)   # Speech speed (100-300)
engine.setProperty("volume", 1.0) # Volume (0.0-1.0)

Serial Settings

ser = serial.Serial(
    port="COM7",      # Your port
    baudrate=9600,    # Must match Arduino
    timeout=2         # Read timeout in seconds
)

🔍 Code Explanation

Voice Recording Function

def record_audio():
    """
    Records audio while SPACE key is held down.
    Uses real-time callback to capture audio chunks.
    Saves as WAV file when recording stops.
    """
    audio_chunks = []
    recording = False

    def callback(indata, frames, time, status):
        if recording:
            audio_chunks.append(indata.copy())
    # ... rest of implementation

Main Control Loop

while True:
    # 1. Capture current scene
    r = requests.get(URL, timeout=5)
    
    # 2. Get voice command
    record_audio()
    
    # 3. Transcribe speech
    transcript = aai.Transcriber().transcribe(AUDIO_PATH)
    
    # 4. Send to Gemini AI with image
    response = client.models.generate_content(...)
    
    # 5. Parse commands
    commands = [line for line in response if line.startswith("MOVE") or line.startswith("GRIP")]
    
    # 6. Execute on Arduino
    ser.write((cmd + "\n").encode())

Command History System

commands_history = []  # Stores all previous commands

# AI uses this context to plan multi-step tasks
prompt = f"""
Previous actions:
{commands_history}

New request: "{base_prompt}"
"""

💻 Platform Requirements

OS	Extra Steps
Linux	`sudo apt install portaudio19-dev`
macOS	`brew install portaudio`
Windows	None (Works out of the box)

🤖 Robot Command Format

The system communicates with the Arduino using a simple string protocol:

MOVE x y z          # Move to coordinates
GRIP OPEN           # Open the claw
GRIP CLOSE          # Close the claw

Example Commands

# Move arm to position
"MOVE 100 50 75"

# Pick up object
"GRIP CLOSE"

# Release object
"GRIP OPEN"

📦 Hardware Requirements

Arduino board (Uno, Mega, or compatible)
4-6 DOF robotic arm with servo motors
Camera (ESP32-CAM, IP Webcam, or USB webcam)
USB cable for Arduino connection
Power supply for servos (5V or as required)

🔧 Configuration

Camera Setup

Edit the camera stream URL in your configuration file:

# For ESP32-CAM
CAMERA_URL = "http://192.168.1.100:81/stream"

# For Android IP Webcam
CAMERA_URL = "http://192.168.1.101:8080/video"

Serial Port Configuration

# Windows
SERIAL_PORT = "COM3"

# Linux/macOS
SERIAL_PORT = "/dev/ttyUSB0"

🎯 Usage Examples

Basic Pick and Place

Hold SPACE and say: "Pick up the red cube"
Robot analyzes the scene and moves to object
Hold SPACE and say: "Place it on the blue plate"
Robot completes the task

Multi-Step Task

"Sort the blocks by color"
Robot plans and executes multiple pick-and-place operations
Provides voice feedback for each step

🛡️ Safety & Disclaimer

Warning

This project controls physical hardware. Please observe the following safety guidelines:

⚠️ Keep the robot's workspace clear of obstacles
⚠️ Ensure your Arduino has safety limits to prevent servo damage
⚠️ Always supervise the robot during operation
⚠️ Test in a controlled environment first
⚠️ Implement emergency stop functionality

📚 Dependencies Explained

Package	Purpose	Why We Need It
`numpy`	Numerical operations	Audio array manipulation
`sounddevice`	Audio recording	Capture microphone input
`keyboard`	Keyboard input	Detect SPACE key press
`scipy`	Scientific computing	Save audio as WAV file
`assemblyai`	Speech-to-text	Convert voice to text
`pyttsx3`	Text-to-speech	Robot voice feedback
`python-dotenv`	Environment variables	Secure API key storage
`google-genai`	Gemini AI	Vision + language reasoning
`pyserial`	Serial communication	Talk to Arduino
`requests`	HTTP requests	Fetch camera images

❓ Frequently Asked Questions

Q: Can I use a USB webcam instead of IP camera?

Yes! Replace the image capture code with:

import cv2

cap = cv2.VideoCapture(0)  # 0 for default webcam
ret, frame = cap.read()
cv2.imwrite(IMAGE_PATH, frame)
cap.release()

Q: The robot moves too fast/slow. How do I adjust?

In the Arduino code, add delays between servo movements:

servo1.write(x);
delay(500);  // Wait 500ms
servo2.write(y);
delay(500);

Q: Can I add more servos?

Yes! Just add more servo objects and pins:

Servo servo5, servo6;
servo5.attach(5);
servo6.attach(3);

Then update the command parsing in Arduino.

Q: Why does speech recognition fail sometimes?

Common causes:

Background noise (use in quiet environment)
Microphone too far away
Speaking too fast
Poor internet connection (AssemblyAI is cloud-based)

Q: Can this work offline?

Partially. You need internet for:

AssemblyAI (speech-to-text)
Gemini AI (vision reasoning)

Local alternatives:

Use speech_recognition with Sphinx for offline STT
Use local vision models (though less capable)

Q: How accurate is the object detection?

Gemini 1.5 Pro is very accurate for:

Object identification
Spatial reasoning
Color recognition

Limitations:

Very small objects (<1cm)
Poor lighting conditions
Highly reflective surfaces

Q: Can I control multiple robots?

Yes! Open multiple serial connections:

robot1 = serial.Serial("COM7", 9600)
robot2 = serial.Serial("COM8", 9600)

# Send different commands to each
robot1.write(b"MOVE 90 90 90\n")
robot2.write(b"GRIP OPEN\n")

Q: How do I add custom commands?

Update the Arduino code to handle new commands
Update the Gemini prompt to include new commands
Example for a new "ROTATE" command:

Arduino:

else if (command.startsWith("ROTATE")) {
  int angle;
  sscanf(command.c_str(), "ROTATE %d", &angle);
  servo1.write(angle);
}

Prompt:

prompt = f"""
Commands available:
MOVE x y z
GRIP OPEN
GRIP CLOSE
ROTATE angle
"""

🐛 Troubleshooting

Common Issues

Camera not connecting:

# Verify camera URL is accessible
curl http://your-camera-ip:port/stream

Audio input not working:

# Test microphone
python -m sounddevice

Arduino not responding:

Check serial port permissions (Linux/macOS): sudo chmod 666 /dev/ttyUSB0
Verify baud rate matches Arduino sketch
Ensure correct port is selected

Common Issues

Camera not connecting:

# Verify camera URL is accessible
curl http://your-camera-ip:port/shot.jpg

# Or use browser to test
# http://192.168.1.15:8080/shot.jpg

Audio input not working:

# Test microphone
python -m sounddevice

# List audio devices
python -c "import sounddevice as sd; print(sd.query_devices())"

Arduino not responding:

Check serial port permissions (Linux/macOS): sudo chmod 666 /dev/ttyUSB0
Verify baud rate matches Arduino sketch (9600)
Ensure correct port is selected
Try unplugging and replugging USB cable
Check if another program is using the serial port

Servos jittering or not moving:

Use external power supply (5V, 2A minimum)
Check all ground connections
Verify servo wire connections
Reduce load on servos
Add capacitors (100μF) across power lines

"ModuleNotFoundError" errors:

# Reinstall all dependencies
pip install -r requirements.txt --force-reinstall

# If using Anaconda
conda install -c conda-forge <package-name>

Gemini API errors:

Check API key is valid
Verify you're using correct model ID
Check API quota limits
Ensure internet connection is stable

Voice transcription is inaccurate:

Speak clearly and at moderate pace
Reduce background noise
Move microphone closer
Use a better quality microphone
Check microphone input levels in system settings

⚡ Performance Optimization

Speed Improvements

# 1. Reduce image resolution before sending to AI
from PIL import Image

img = Image.open(IMAGE_PATH)
img = img.resize((640, 480))  # Smaller = faster
img.save(IMAGE_PATH)

# 2. Use threading for parallel operations
import threading

def capture_image_async():
    threading.Thread(target=capture_image).start()

# 3. Cache repeated AI responses
response_cache = {}
cache_key = f"{base_prompt}_{image_hash}"

if cache_key in response_cache:
    response = response_cache[cache_key]
else:
    response = client.models.generate_content(...)
    response_cache[cache_key] = response

Memory Optimization

# Clear audio chunks after saving
audio_chunks.clear()

# Close serial connection when not in use
ser.close()

Latency Reduction

# Use lower quality audio recording
SAMPLE_RATE = 16000  # Instead of 44100

# Reduce thinking budget for faster responses
thinking_config=types.ThinkingConfig(thinking_budget=0)

🎥 Demo Videos

Example Use Cases

1. Pick and Place Task

User: "Pick up the red cube and place it in the box"
Robot: "I'll pick up the red cube and move it to the box"
Actions: MOVE 120 80 60 → GRIP CLOSE → MOVE 90 90 90 → GRIP OPEN

2. Color Sorting

User: "Sort the blocks by color"
Robot: "I'll sort the blocks, starting with blue ones"
Actions: Multiple MOVE and GRIP sequences

3. Object Stacking

User: "Stack the three cubes on top of each other"
Robot: "I'll stack them starting from the bottom"
Actions: Precise MOVE commands with increasing Z values

🚀 Advanced Features

Adding Camera Calibration

import cv2
import numpy as np

# Calibrate camera to real-world coordinates
def pixel_to_coordinates(x_pixel, y_pixel):
    # Add your calibration matrix here
    x_real = (x_pixel - 320) * 0.1
    y_real = (y_pixel - 240) * 0.1
    return x_real, y_real

Implementing Safety Limits

# In Arduino code
int constrain_safe(int value, int min_val, int max_val) {
  if (value < min_val) return min_val;
  if (value > max_val) return max_val;
  return value;
}

// Usage
servo1.write(constrain_safe(x, 20, 160));  // Prevent extreme angles

Adding Emergency Stop

import threading

def emergency_stop_listener():
    while True:
        if keyboard.is_pressed('esc'):
            ser.write(b"STOP\n")
            print("EMERGENCY STOP ACTIVATED")
            break

# Start in background
threading.Thread(target=emergency_stop_listener, daemon=True).start()

Logging System

import logging
from datetime import datetime

logging.basicConfig(
    filename=f'robot_log_{datetime.now().strftime("%Y%m%d")}.txt',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

# Log all commands
logging.info(f"User command: {base_prompt}")
logging.info(f"AI response: {raw_text}")
logging.info(f"Executed: {commands}")

🔧 Alternative Hardware Configurations

Budget Setup (~$50)

Arduino Uno - $25
SG90 Micro Servos (4x) - $10
USB Webcam - $15
3D Printed Parts - $5 (or cardboard)

# Use USB webcam instead
import cv2
cap = cv2.VideoCapture(0)
ret, frame = cap.read()
cv2.imwrite(IMAGE_PATH, frame)

Mid-Range Setup (~$150)

Arduino Mega - $40
MG996R Servos (6x) - $50
ESP32-CAM - $10
5V 10A Power Supply - $20
Custom PCB Shield - $15
Aluminum Frame - $15

Professional Setup (~$500+)

Arduino Due / Raspberry Pi 4 - $50
Dynamixel Servos - $300
High-res IP Camera - $100
Linear Actuators - $50+
CNC Machined Parts - Variable

🔌 Wiring Best Practices

Power Distribution

External 5V Supply (10A)
    ├── Servo 1 (VCC, GND)
    ├── Servo 2 (VCC, GND)
    ├── Servo 3 (VCC, GND)
    ├── Servo 4 (VCC, GND)
    └── Arduino VIN (Optional: if not USB powered)

Arduino
    ├── Digital Pin 9  → Servo 1 Signal (Yellow)
    ├── Digital Pin 10 → Servo 2 Signal (Yellow)
    ├── Digital Pin 11 → Servo 3 Signal (Yellow)
    └── Digital Pin 6  → Servo 4 Signal (Yellow)

Common Ground Connection: Arduino GND ↔ Power Supply GND

Safety Circuit

// Add voltage monitoring in Arduino
int voltage_pin = A0;

void checkPower() {
  int voltage = analogRead(voltage_pin);
  if (voltage < 200) {  // ~1V threshold
    // Stop all servos
    servo1.detach();
    servo2.detach();
    servo3.detach();
    servo4.detach();
    Serial.println("LOW POWER - STOPPED");
  }
}

📱 Mobile App Integration (Optional)

Create a simple web interface for remote control:

from flask import Flask, render_template, request

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('control.html')

@app.route('/command', methods=['POST'])
def handle_command():
    text_command = request.json['command']
    # Process command without voice
    # ... existing logic
    return {'status': 'success'}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

🧪 Testing Framework

Unit Tests

Create test_robot.py:

import unittest
from main import parse_commands, validate_coordinates

class TestRobotControl(unittest.TestCase):
    
    def test_command_parsing(self):
        response = "I'll move to the object\nMOVE 90 90 90\nGRIP CLOSE"
        commands = parse_commands(response)
        self.assertEqual(len(commands), 2)
        self.assertEqual(commands[0], "MOVE 90 90 90")
    
    def test_coordinate_limits(self):
        self.assertTrue(validate_coordinates(90, 90, 90))
        self.assertFalse(validate_coordinates(200, 90, 90))

if __name__ == '__main__':
    unittest.main()

Integration Test

def test_full_pipeline():
    # 1. Test camera
    assert capture_image() == True
    
    # 2. Test audio
    assert record_audio() == True
    
    # 3. Test transcription
    transcript = transcribe_audio()
    assert len(transcript) > 0
    
    # 4. Test AI response
    response = get_ai_response(transcript, IMAGE_PATH)
    assert response is not None
    
    # 5. Test serial (with mock)
    commands = parse_commands(response)
    assert len(commands) > 0

🤝 Contributing

We welcome contributions from the community! This project is part of the Intuex organization.

How to Contribute

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Contribution Guidelines

Follow the existing code style
Write clear commit messages
Add tests for new features
Update documentation as needed
Be respectful and constructive

For more details, see the Intuex Contributing Guidelines.

Types of Contributions Welcome

🐛 Bug fixes
✨ New features
📝 Documentation improvements
🎨 UI/UX enhancements
🧪 Test coverage
🌐 Translations
💡 Ideas and suggestions

📄 License

This project is licensed under the Intuex0.1 License - see the LICENSE file for details.

🗺️ Roadmap

Current Features (v1.0)

✅ Voice control with push-to-talk
✅ Real-time vision processing
✅ Command history and context
✅ Text-to-speech feedback
✅ Serial Arduino control

Planned Features (v2.0)

🔄 Continuous voice activation ("Hey Robot...")
🔄 Object tracking and following
🔄 Multiple camera angles
🔄 Web-based control interface
🔄 Gesture recognition
🔄 Offline mode with local AI models

Future Enhancements (v3.0)

📋 Task queue and scheduling
📋 Multi-robot coordination
📋 AR visualization overlay
📋 Machine learning for custom tasks
📋 ROS integration
📋 Cloud telemetry and monitoring

🙏 Acknowledgments

Organization

Intuex - Building the future of intelligent automation
Open Source Community - For continuous support and contributions

AI & APIs

Gemini AI by Google - Powerful vision-language reasoning
AssemblyAI - Accurate speech-to-text transcription
OpenAI - Inspiration for conversational AI interfaces

Hardware & Libraries

Arduino Community - Extensive servo control resources
Python Community - Amazing open-source libraries
PySerial Contributors - Reliable serial communication
SoundDevice Developers - Cross-platform audio recording

Inspiration & Resources

MistralRobotics - AI-powered robotics examples
ROS (Robot Operating System) - Robotics best practices
OpenCV Community - Computer vision techniques

Special Thanks

All contributors who submitted issues and pull requests
Beta testers who helped identify bugs
The robotics community for continuous support

📄 License

This project is licensed under the Intuex 0.1 License.

Intuex 0.1 License Summary

This license allows you to:

✅ Use for personal projects
✅ Use for educational purposes
✅ Use for research and academic work
✅ Modify and distribute
✅ Use any patents in the code

Requirements:

📝 Include license and copyright notice
📝 Document any changes made
📝 Distribute derivatives under the same license
📝 Make source code available

Restrictions:

❌ No commercial use without separate license
❌ Cannot use "Intuex" trademark for endorsement

Full License Text

Intuex Open Source License v0.1

Copyright (c) 2024 Intuex Organization & Ansh

Permission is hereby granted, free of charge, to any person or organization
obtaining a copy of this software and associated documentation files (the
"Software"), to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, subject to the following conditions:

1. Attribution Requirement:
   - The above copyright notice and this permission notice shall be included
     in all copies or substantial portions of the Software.
   - Any modifications must be clearly documented and attributed.

2. Open Source Commitment:
   - Derivative works must be distributed under the same Intuex 0.1 License.
   - Source code must be made available for any distributed binary forms.

3. Patent Grant:
   - Contributors grant a perpetual, worldwide, royalty-free patent license
     to use any patent claims implemented in their contributions.

4. Trademark Protection:
   - The name "Intuex" and associated logos may not be used to endorse
     derived products without explicit written permission.

5. No Warranty:
   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY,
   FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT.

6. Limitation of Liability:
   IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
   CLAIM, DAMAGES, OR OTHER LIABILITY ARISING FROM THE SOFTWARE.

For licensing inquiries: https://github.com/Intuex

🏢 Part of Intuex

This project is maintained by the Intuex organization - building the future of intelligent automation.

🔗 Useful Links

💬 Community & Support

Get Help

Stay Updated

⭐ Star this repository to follow updates
👁️ Watch for new releases
🔔 Enable notifications for important changes

Share Your Build

Made something cool with this project? We'd love to see it!

Tag us on social media
Submit a pull request with your modifications
Share photos/videos in the discussions

📊 Project Stats

Built with ❤️ by Intuex • Powered by Gemini AI & AssemblyAI

⭐ Star this repo • 🚀 Explore Intuex • 🐛 Report Issues

_{Open source robotics • AI-powered automation • Community-driven innovation}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
arduino		arduino
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎙️🤖 Voice-Controlled Vision-Guided Robotic Arm

📖 Overview

✨ Key Features

🚀 Quick Start

1. Installation

2. Configure Environment

3. Launch the System

🛠️ System Architecture

📂 Project Structure

📝 Code Overview

Main Components

1. Voice Input (record_audio())

2. Vision Processing

3. Robot Control

Main Script (main.py)

🔌 Arduino Setup

Required Arduino Code (servo_control.ino)

Wiring Diagram

📋 Detailed Setup Guide

Step 1: Create Project Directories

Step 2: Install Python Dependencies

Step 3: Configure Camera

Option A: Android IP Webcam

Option B: ESP32-CAM

Step 4: Get API Keys

AssemblyAI (Speech-to-Text)

Google Gemini (Vision AI)

Step 5: Upload Arduino Code

Step 6: Find Your Serial Port

Windows

Linux/macOS

Step 7: Test Individual Components

Test Camera

Test Microphone

Test Serial Connection

Step 8: Run the System

🎮 Usage Workflow

Basic Operation

Example Commands

🎛️ Configuration Options

Camera Settings

Audio Settings

Voice Settings

Serial Settings

🔍 Code Explanation

Voice Recording Function

Main Control Loop

Command History System

💻 Platform Requirements

🤖 Robot Command Format

Example Commands

📦 Hardware Requirements

🔧 Configuration

Camera Setup

Serial Port Configuration

🎯 Usage Examples

Basic Pick and Place

Multi-Step Task

🛡️ Safety & Disclaimer

📚 Dependencies Explained

❓ Frequently Asked Questions

Q: Can I use a USB webcam instead of IP camera?

Q: The robot moves too fast/slow. How do I adjust?

Q: Can I add more servos?

Q: Why does speech recognition fail sometimes?

Q: Can this work offline?

Q: How accurate is the object detection?

Q: Can I control multiple robots?

Q: How do I add custom commands?

🐛 Troubleshooting

Common Issues

Common Issues

⚡ Performance Optimization

Speed Improvements

Memory Optimization

Latency Reduction

1. Voice Input (`record_audio()`)

Main Script (`main.py`)

Required Arduino Code (`servo_control.ino`)

Packages

Contributors