A Flask app that watches your face and hands. Recognizes expressions, lets you control system volume by pinching your fingers, and detects basic hand gestures. Runs on M-series Macs using MediaPipe and OpenCV.
The app watches your face and tries to figure out what you're feeling. It looks at the positions of key points on your face β like your mouth corners, eyelids, and eyebrows β and makes a guess about whether you're happy, sad, angry, surprised, or just neutral. It's not perfect, especially in bad lighting, but it works surprisingly well once you figure out how close to sit to the camera.
- π Happy β When you smile. It looks for an open mouth and corners that curve up.
- π’ Sad β When your mouth is closed or the corners point down.
- π Angry β When your eyebrows are close together and your expression gets tight.
- π² Surprise β When you open your mouth and eyes wide.
- π Neutral β Everything else. The default state.
Pinch your thumb and index finger together and the system volume drops. Open them up and it gets louder. Make a fist and the system mutes. The distance between your fingertips gets mapped to 0-100% volume. It's a bit touchy at first β the detection can be finicky if you move too fast or if the room is dark β but you'll get used to it.
The app recognizes a few hand shapes:
- OK sign β Touch your thumb to your index finger to form a circle.
- Wave β Move your hand back and forth horizontally a few times. Useful for triggering actions.
- Fist β Close your hand. This triggers the mute function.
stellaris-hack/
βββ app.py Main backend code
βββ requirements.txt What you need to install
βββ start.sh Quick launch script
βββ check_system.py Checks if everything's set up right
βββ README.md This file
βββ TECHNICAL.md Details about how it works
βββ QUICKSTART.md Quick reference
βββ templates/
β βββ index.html The web dashboard
βββ static/ Place for extra files if needed
- A Mac with M1, M2, M3 chip (or newer). Older Intel Macs might work but they're not tested.
- Python 3.8 or higher. Check with
python3 --version. - A working webcam.
- About 5 minutes and a decent internet connection for the first install (downloading dependencies is the main wait).
Step 1: Open Terminal and go to the project folder.
cd ~/Documents/stellaris-hackStep 2: If you want to keep things clean, create a virtual environment. This keeps the app's dependencies separate from your system Python.
python3 -m venv venv
source venv/bin/activateStep 3: Install what the app needs.
pip install --upgrade pip
pip install -r requirements.txtThe first time, this takes a couple minutes because it's downloading MediaPipe and OpenCV. Subsequent times are much faster.
Step 4: Give the app permission to use your camera.
- Go to System Preferences β Security & Privacy β Camera
- Find Terminal (or whatever you're using to run the app) and allow it
Step 5: Run it.
python app.pyYou should see something like:
π¬ Starting Real-time Computer Vision System...
π Open browser to: http://localhost:5001
Step 6: Open your browser to http://localhost:5001 and you're good. It'll ask for camera permission one more time in the browser itself β click allow.
Sit in front of your webcam and try different expressions. The app updates every frame, so you should see it respond quickly. Fair warning: it works better with good lighting. If you're in a dark room, it'll struggle. Also, don't expect movie-level accuracy β it's pretty good but not perfect. Sometimes it confuses Surprise with a genuine smile.
Make a thumbβindex pinch (tips touching) and then slide your hand horizontally across the camera view. Volume is mapped to the x coordinate of the pinch: left edge is 0β―%, right edge is 100β―%. Pinch and move leftβtoβright to raise the volume, rightβtoβleft to lower it. A full fist still mutes the system.
A manual slider is available under the volume card if the gesture isnβt behaving. The pinch position must remain visible to the camera, and smooth, deliberate motion works best; very fast swipes or poor lighting will still produce jitter.
- OK Sign: Touch your thumb to your index finger. Hold it steady for a second so the app registers it.
- Wave: Move your hand left and right a few times. The app looks for repeated direction changes, so make it obvious.
- Fist: Close your hand. All fingers should curl in. This triggers mute.
When you open the app in your browser, you'll see:
- Video stream β Your face and hands with overlays showing the landmarks the app is tracking.
- Expression display β Shows what the app thinks you're expressing right now.
- Volume indicator β Shows the current system volume (0-100) and a visual bar that fills up.
- Status info β Shows detected gesture and other real-time data.
The whole page is responsive, so it should work fine on your phone too if you want to view it remotely on the same network.
MediaPipe gives us a bunch of points on your face (468 landmarks) and on your hands (21 per hand). The app calculates ratios and distances between these points:
- For expressions, it looks at mouth opening, eye opening, and the curve of your smile.
- For volume, it measures the distance between your thumb tip and index tip.
- For gestures, it checks if fingers are curled, if certain fingers are touching, or if your hand is moving in a pattern.
OpenCV handles the video capture and drawing the landmarks on the screen. Flask serves the video stream to your browser as MJPEG (basically a series of JPEG images sent one after another).
For volume control, the app uses osascript to send AppleScript commands to macOS. Direct system integration β it actually changes your Mac's volume.
First check: Is your webcam actually connected? Try FaceTime or Zoom to see if it works there.
If that works but this app doesn't:
- Check System Preferences β Security & Privacy β Camera. Make sure Terminal (or your IDE) is in the allowed apps list.
- Run the diagnostics:
python check_system.py - If it still doesn't work, try restarting Terminal or your IDE.
Test osascript directly:
osascript -e "set volume output volume 50"If you get an error, check System Preferences β Security & Privacy β Microphone. The app might need microphone permission to control audio.
If that works in Terminal but not in the app, it might be a permission issue with the Python process. Try running the app directly without any virtualenv stuff and see if that helps.
A manual slider has been added to the dashboard beneath the volume card β you can use it with your mouse to set the volume directly if the pinch gesture is unreliable. It will always reflect whatever the current system volume is.
This is common. A few things to try:
- Better lighting β The app works way better in well-lit rooms. Shadow on your face confuses it. The live feed will even display a red warning if the average brightness drops too low.
- Lowβlight enhancement β A CLAHE (adaptive histogram equalization) step is applied when the feed is very dark. This improves contrast without turning the image monochrome; you should now see more color and detail even in dim conditions. Itβs not a miracle cure, though β if the sensor only captures noise, software canβt invent color.
- Higher JPEG quality β The stream now uses 95β―% quality to reduce compression artifacts and preserve color fidelity. This may slightly increase bandwidth/use on the local loop but makes the picture look much sharper.
- Smoothed volume control β Volume readings are filtered over a short history and only updated when the change exceeds a small threshold, preventing jitter when you move or wave your hand. Volume is also frozen while performing a wave gesture.
- Pinchβslide control β Volume is determined by the horizontal position of a thumbβindex pinch. Keep the tips together and move your hand left/right to adjust the volume. This is more intuitive and avoids erratic readings from finger spacing.- Gesture robustness/visibility β wave detection no longer affects volume; a dedicated Gesture card appears on the dashboard (next to Expression & Volume). The inβcamera overlay also shows the gesture clearly.
- Processing safety β timeouts and exception handlers have been added around face/hand processing. If MediaPipe ever hangs or throws an error while you move your hands, the stream will continue using the last good frame instead of going black. Flaky camera reads are also retried.
- Overlay text β the inβcamera overlay now has a dark background and larger font so expression/gesture/volume values are legible. The web dashboard also increased font sizes.- Camera access β macOS may ask you to grant permission when the server first tries to open the camera. If the feed stays black, check System Preferencesβ―ββ―Security & Privacyβ―ββ―Camera and make sure the terminal/Python process is allowed.
- Exposure/brightness controls β we attempt to tweak the webcamβs exposure/brightness properties when the stream starts, but not all cameras respect these settings. Try an external USB webcam if the builtβin one remains dark.
- Browser support β the video window uses an MJPEG stream; Safari sometimes only shows the first frame. If the image freezes after a moment, try Chrome or Firefox, or reload the page. The server adds cacheβcontrol headers to mitigate this, but switching browsers is the most reliable fix.
- Closer to camera β Sit about 12-18 inches from the camera. Too far and the landmarks aren't detailed enough.
- Different background β If your background is all white or all black, try sitting somewhere with more contrast.
- Relax β Sometimes the app mistakes a neutral face for sadness if you're just sitting there tired. Exaggerate your expressions a bit.
The app is pushing a lot of ML models through your Mac. A few things that help:
- Close other apps, especially browsers with lots of tabs.
- The video resolution is set to 800x600 by default. If your Mac is struggling, you can lower it even more in
app.pyaround line 150. - JPEG compression is set to quality 80. You can lower it to 70 if streaming is laggy.
- Face Mesh and Hand detection thresholds can be tweaked for speed vs. accuracy in
app.pylines 25-35.
Some other app is using port 5000. Edit the last line of app.py to use a different port:
app.run(debug=True, port=8080) # Use 8080 insteadAll the thresholds and parameters are adjustable. Here are the ones you'll probably want to tweak:
In app.py, the detect_expression() function has numbers like:
if mouth_opening > 0.15: # Lower this to 0.12 for easier happy detectionLower the number to make detection easier, raise it to make it stricter.
The pinch distance gets mapped to volume. Currently it goes from 0.02 (pinched) to 0.20 (open). You can adjust these:
volume = (thumb_index_dist - 0.02) / 0.18 * 100
# β minimum distance β rangeSmaller range = more sensitive. Larger range = more forgiving.
Around line 150 in app.py:
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 800)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 600)Higher resolution = better quality, slower. Lower = faster, grainier.
If you don't want to see all the landmarks and lines on the video, comment out the drawing calls in process_frame():
# mp_drawing.draw_landmarks(frame, face_landmarks, ...)On an M2 Mac, expect:
- 25-30 actual frames per second (targets 30, usually achieves 25-28)
- About 15-20ms per frame for detection
- 250-300MB of RAM in use
- 25-35% CPU usage
- Noticeable delay? Probably around 50ms from real world to display
This isn't a problem for what the app does. It feels responsive even at 25 FPS.
The app runs with three main Python libraries:
MediaPipe β Does the heavy lifting. It uses pre-trained neural networks to find 468 points on your face and 21 points on each hand. Google's team trained these models and optimized them for actual phones and laptops, so they're pretty efficient. Apple Silicon specifically gets some optimizations.
OpenCV β A classic computer vision library. Handles reading from your webcam, drawing the landmarks on the frames, and encoding the video as JPEG.
Flask β A lightweight web framework. Serves the dashboard HTML page and streams the video to your browser.
The app captures frames from your camera, runs them through both MediaPipe models (face and hands), then calculates distances and ratios between the landmarks. For expressions, it's looking at about 5-6 key measurements. For gestures, it's checking finger positions and distances. These get converted to the expression name or gesture type, which then updates the web dashboard.
# Face Mesh: 468 facial landmark points
face_mesh = mp_face_mesh.FaceMesh(
static_image_mode=False,
max_num_faces=1,
refine_landmarks=True,
min_detection_confidence=0.5
)
# Hands: 21 landmark points per hand
hands = mp_hands.Hands(
static_image_mode=False,
max_num_hands=2,
model_complexity=0 # Lightweight for M2
)def detect_expression(landmarks):
mouth_opening = distance(upper_lip, lower_lip) / mouth_width
eye_openness = distance(top_eyelid, bottom_eyelid)
smile_curve = mouth_upper_y - avg(smile_corners_y)
if mouth_opening > 0.15 and eye_openness > 0.012:
return "Happy" if smile_curve < -0.01 else "Surprise"
# ... more logicdef pinch_center_to_volume(hand_landmarks):
# assume thumb tip (4) and index tip (8) are touching
thumb = hand_landmarks[4]
index = hand_landmarks[8]
center_x = (thumb.x + index.x) / 2.0 # normalized 0-1
volume = int(max(0, min(100, center_x * 100)))
set_system_volume(volume) # osascript command# Set volume to 75%
osascript -e "set volume output volume 75"
# Mute system
osascript -e "set volume output muted true"
# Unmute system
osascript -e "set volume output muted false"| Route | Method | Purpose |
|---|---|---|
/ |
GET | Render main dashboard (index.html) |
/video_feed |
GET | Stream MJPEG video (multipart/x-mixed-replace) |
/status |
GET | Return JSON: {expression, volume} |
Uses generator function for efficient multipart streaming:
def video_feed_generator():
while True:
ret, frame = cap.read()
processed = process_frame(frame)
ret, buffer = cv2.imencode('.jpg', processed)
yield b'--frame\r\n' + b'Content-Type: image/jpeg\r\n' + ...- Video Stream: Centered img tag with MJPEG from
/video_feed - Status Cards: Expression & Volume with live updates
- Info Grid: Gesture descriptions and controls
- Legend: Expression states reference
- Connection Status: Real-time indicator
- AJAX requests to
/statusevery 500ms - Updates expression display and volume bar
- Fallback for disconnection handling
Edit detect_expression() in app.py:
if mouth_opening > 0.15: # Decrease for more sensitive happy detection
return "Happy"The mapping now uses the xβcoordinate of the thumb-index pinch center. You can tweak
process_frame() where center_x is calculated or adjust the detect_pinch threshold
for how close the fingers must be.
# in process_frame, after gesture == "Pinch":
center_x, _ = get_thumb_index_center(hand_landmarks.landmark)
volume = int(max(0, min(100, center_x * 100)))Edit video_feed_generator():
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280) # Increase for better quality
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 960)Comment out in process_frame():
# mp_drawing.draw_landmarks(frame, face_landmarks, ...)# Check camera access
ls -la /dev/video*
# Grant permissions in System Preferences β Security & Privacy
# Add Terminal to Camera allowed apps# Test osascript directly
osascript -e "set volume output volume 50"
osascript -e "set volume output muted true"
# Verify permissions: Settings β Privacy & Security β Microphone- Reduce camera resolution (see customization above)
- Decrease detection confidence thresholds
- Use
model_complexity=0for hands (already done) - Close unnecessary browser tabs
- Ensure good lighting
- If the video box appears very dark or almost black, the app will overlay a warning message in red text.
- The stream is local (no network involved), so "poor connection" usually means the camera feed is low-quality or blocked.
- Position yourself near a window or lamp and remove any obstructions from the webcam.
- Increase
min_detection_confidence(default 0.7) - Keep hands fully visible in frame
| Metric | Value |
|---|---|
| Camera Resolution | 800x600 |
| Target FPS | 30 |
| Face Detection Latency | ~15ms |
| Hand Detection Latency | ~12ms |
| Memory Usage | ~250-300MB |
| CPU Usage | 25-35% |
- No data is sent to external servers
- Video processing happens entirely on your machine
- Camera feed only used for local processing
- No recording or storage of frames
| Package | Version | Purpose |
|---|---|---|
| Flask | 2.3.2 | Web framework |
| OpenCV | 4.8.0.76 | Computer vision processing |
| MediaPipe | 0.10.0 | ML models for face/hand detection |
| NumPy | 1.24.3 | Numerical computations |
| Werkzeug | 2.3.6 | WSGI utilities |
- Create a detection function in
app.py:
def detect_pinch(hand_landmarks):
thumb_tip = hand_landmarks[4]
middle_tip = hand_landmarks[12]
distance = get_distance(thumb_tip, middle_tip)
return distance < 0.05- Call in
process_frame()and return gesture name
# Replace port 5000 in last line of app.py:
app.run(debug=True, port=8080)- Use production WSGI server (Gunicorn):
pip install gunicorn
gunicorn --workers 4 --bind 0.0.0.0:5000 app:app- Update camera source to network stream or handle headless mode
- Single Face: Current setup supports 1 face per frame
- Hand Count: Supports up to 2 hands (can be increased)
- Lighting: Works best in well-lit environments
- Background: Works better with contrasting backgrounds
- Occlusion: Handles partial hand occlusion well
This project is provided as-is for educational and personal use.
For issues or questions:
- Check the Troubleshooting section
- Verify all dependencies are installed
- Ensure camera permissions are granted
- Check console output for specific error messages
Built for macOS Silicon | Optimized for M2 & M3
Last Updated: March 5, 2026