Real-time, vision-enabled AI tutor powered by Google Gemini Live API.
Talk naturally, show your homework, and get step-by-step guidance.Built for the Gemini Live Agent Challenge hackathon โ Category: Live Agents ๐ฃ๏ธ
| Feature | Description |
|---|---|
| ๐ฃ๏ธ Real-time Voice | Natural conversation with your tutor โ speak and get spoken responses via Gemini Live API (native audio) |
| ๐ธ Vision-Enabled | Show your homework via camera or upload an image โ analyzed by Gemini 2.5 Flash vision model |
| ๐ง Balanced Teaching | Guides you to understand concepts, then provides complete solutions with step-by-step explanations |
| โก Interruptible | Break in at any time โ the tutor handles interruptions gracefully |
| ๐ Language Selection | Choose from 20+ languages (English, Hindi, Spanish, French, etc.) โ the tutor responds in your preferred language |
| ๐ Multi-Subject | Mathematics, Physics, Chemistry, Biology, CS, Language Arts, History |
| ๐ ๏ธ ADK Agent Tools | Structured tools for practice problems, concept explanations, study plans |
| ๐ User Authentication | Sign up / log in with username & password โ passwords hashed with SHA-256 + salt |
| ๐ค Student Profiles | Name, grade, gender, age, language โ persisted to Google Cloud Firestore |
| โ๏ธ Google Cloud | Deployed on Cloud Run with Terraform IaC, user data stored in Firestore |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Frontend (Browser) โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Mic/Audioโ โ Camera โ โ Chat UI / Controls โ โ
โ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ WebSocket (wss://) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Google Cloud Run โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ FastAPI + WebSocket Server โ โ
โ โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Session Mgr โ โ ADK Tutor Agent โ โ โ
โ โ โ (Live API โ โ โโโโโโโโโโโโโโโโโโโโโโโ โ โ โ
โ โ โ sessions) โ โ โ Tools: โ โ โ โ
โ โ โ โ โ โ โข Practice Probs โ โ โ โ
โ โ โ โ โ โ โข Concept Explain โ โ โ โ
โ โ โ โ โ โ โข Check Solution โ โ โ โ
โ โ โ โ โ โ โข Study Plan โ โ โ โ
โ โ โ โ โ โ โข Step-by-Step โ โ โ โ
โ โ โโโโโโโโโโฌโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโ โ โ โ
โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโ โ
โ โ โ โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโ
โ Bidirectional Audio โ User Auth &
โ Streaming โ Profile Storage
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Google AI Platform โ โ Google Cloud Firestore โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โ โ
โ โ Gemini 2.5 Flash โ โ โ โข User profiles โ
โ โ Native Audio โ โ โ โข Username / password โ
โ โ (Live API โ voice) โ โ โ โข Grade, age, gender โ
โ โ โข Real-time audio โ โ โ โข Language preference โ
โ โ โข Interruptions โ โ โ โข Subject preference โ
โ โ โข Transcription โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โ (Fallback: local JSON) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Gemini 2.5 Flash โ โ
โ โ Vision Model โ โ
โ โ (image analysis) โ โ
โ โ โข Homework photos โ โ
โ โ โข Image upload โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Layer | Technology |
|---|---|
| AI Models | Gemini 2.5 Flash Native Audio (Live API โ voice) + Gemini 2.5 Flash (vision/image analysis) |
| Agent Framework | Google ADK (Agent Development Kit) |
| SDK | Google GenAI SDK (google-genai v1.x) |
| Backend | Python 3.12, FastAPI, uvicorn, WebSockets |
| Database | Google Cloud Firestore (user profiles & session data) |
| Frontend | Vanilla HTML/CSS/JS (no build step) |
| Google Cloud | Cloud Run, Vertex AI, Cloud Build, Firestore |
| IaC | Terraform + Cloud Build YAML |
| Containerization | Docker (multi-stage build) |
- Python 3.11+
- A Google AI API Key or a Google Cloud project with Vertex AI enabled
- Microphone + Camera (for full experience)
git clone https://github.com/Sumit231292/Gemini_AI_Tutor.git
cd Gemini_AI_Tutor# Create and activate virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
# Install dependencies
pip install -r backend/requirements.txt# Copy the example env file
cp .env.example .env
# Edit .env and add your API key
GOOGLE_API_KEY=your-api-key-here
# Set your GCP project for Firestore (user storage)
GOOGLE_CLOUD_PROJECT=your-gcp-project-id# Authenticate with GCP
gcloud auth application-default login --project YOUR_PROJECT_ID
# Create Firestore database (one-time)
gcloud firestore databases create --project=YOUR_PROJECT_ID --location=nam5 --type=firestore-nativeWithout Firestore, the app automatically falls back to a local JSON file (
backend/data/users.json).
cd backend
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadNavigate to http://localhost:8000 โ select a subject and start learning!
Note: For microphone/camera access, use
localhostor HTTPS. Browsers block media APIs on non-secure origins.
# Authenticate with Google Cloud
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
# Deploy
chmod +x deploy/deploy.sh
./deploy/deploy.sh YOUR_PROJECT_ID us-central1# Trigger a build from the repo root
gcloud builds submit \
--config deploy/cloudbuild.yaml \
--project YOUR_PROJECT_ID \
--substitutions _REGION=us-central1cd deploy
# Initialize Terraform
terraform init
# Plan the deployment
terraform plan -var="project_id=YOUR_PROJECT_ID"
# Apply
terraform apply -var="project_id=YOUR_PROJECT_ID"After deployment, set your API key:
gcloud run services update edunova \
--region us-central1 \
--set-env-vars GOOGLE_API_KEY=your-api-keyGemini_AI_Tutor/
โโโ backend/
โ โโโ app/
โ โ โโโ __init__.py # Package init
โ โ โโโ config.py # Environment configuration
โ โ โโโ main.py # FastAPI server + WebSocket endpoints
โ โ โโโ live_api.py # Gemini Live API session manager
โ โ โโโ tutor_agent.py # ADK agent with tutoring tools
โ โ โโโ user_store.py # User auth & Firestore/JSON storage
โ โโโ data/
โ โ โโโ users.json # Local fallback user storage
โ โโโ requirements.txt # Python dependencies
โ โโโ Dockerfile # Container for Cloud Run
โโโ frontend/
โ โโโ index.html # Main UI (sign-in, landing, session)
โ โโโ style.css # Styling (glass-morphism design)
โ โโโ app.js # Client-side logic & WebSocket
โโโ deploy/
โ โโโ deploy.sh # One-click deploy script
โ โโโ cloudbuild.yaml # Cloud Build CI/CD config
โ โโโ main.tf # Terraform IaC config
โโโ .env.example # Environment template
โโโ .gitignore
โโโ README.md # This file
- Create an account โ sign up with a username, password, name, grade, and preferred language
- Log in โ returning users log in with username and password
- Select a subject from the landing page
- Talk to your tutor โ click the mic button and speak naturally
- Show your homework โ use the camera button to capture a photo, or upload an image
- Type messages โ use the text input for typed questions
- Change language โ switch language anytime from the landing page
- Get guided โ the tutor explains concepts step-by-step and gives you the final answer
- Interrupt anytime โ the tutor handles interruptions gracefully
- Log out โ click the logout button to switch accounts
| Requirement | Status | Details |
|---|---|---|
| โ Gemini model | โ๏ธ | Gemini 2.5 Flash Native Audio (Live API) + Gemini 2.5 Flash (vision) |
| โ Google GenAI SDK or ADK | โ๏ธ | Both โ GenAI SDK for Live API + ADK for agent tools |
| โ Google Cloud service | โ๏ธ | Cloud Run, Vertex AI, Cloud Build, Firestore |
| โ Multimodal input | โ๏ธ | Voice (audio) + Vision (camera/image) + Text |
| โ Beyond text-in/text-out | โ๏ธ | Real-time voice conversation + image understanding |
| โ Multi-language | โ๏ธ | 20+ languages with per-session language selection |
| โ User persistence | โ๏ธ | Student profiles stored in Google Cloud Firestore |
| โ Live Agent category | โ๏ธ | Real-time interruptible voice + vision tutor |
| โ Public code repo | โ๏ธ | This repository |
| โ Spin-up instructions | โ๏ธ | See Quick Start above |
| โ Architecture diagram | โ๏ธ | See Architecture section |
| โ IaC deployment (bonus) | โ๏ธ | Terraform + Cloud Build |
- Gemini Live API with native audio model provides remarkably natural real-time conversations with low latency
- Hybrid vision approach โ using Gemini 2.5 Flash for image analysis and feeding results into the native audio session creates a seamless "sees and speaks" experience
- Vision + Voice combo creates a powerful tutoring experience โ students can literally "show" their homework
- Balanced teaching approach combines guided learning with actual answers โ the tutor explains step-by-step AND gives the final answer
- ADK tools provide structured capabilities (practice problems, study plans) beyond free-form conversation
- Language selection ensures the tutor always responds in the student's preferred language
- Firestore integration provides durable, serverless storage for student profiles across sessions
- Native audio model limitations โ the
gemini-2.5-flash-native-audio-latestmodel doesn't support direct image input, requiring a hybrid approach with a separate vision model call - Audio format handling โ bridging browser MediaRecorder PCM format to Gemini's expected input required careful sample rate and encoding management
- WebSocket lifecycle โ managing the bidirectional bridge between client WebSocket and Gemini Live API session required careful async handling
- Interruption handling โ ensuring smooth interruption UX when the student speaks while the tutor is responding
- Real-time whiteboard/drawing for working through math problems visually
- Progress tracking across sessions with Firestore persistence
- Integration with curriculum standards (Common Core, etc.)
- Google OAuth as alternative login method
- Support for more Gemini model variants as they become available
MIT License โ see LICENSE for details.
Built with โค๏ธ using Google Gemini Live API ยท ADK ยท Google Cloud
#GeminiLiveAgentChallenge