A comprehensive system for predicting advertisement performance using audio analysis, speech recognition, and text analysis from MP4 video files.
This system implements a modular audio-focused approach to advertisement performance prediction:
- Audio Extraction โ Extract audio from MP4 files using FFmpeg
- Audio Feature Analysis โ Extract 10 critical audio features (RMS, pitch, MFCCs, etc.)
- Speech-to-Text โ Convert speech to text using SpeechRecognition
- Text Feature Analysis โ Extract 10 critical text features (sentiment, CTAs, hooks)
- Content-Based Model โ Predict performance using trained Random Forest classifier
MP4 Video โ Audio Extraction โ Audio Features โ
โ Content-Based Model โ Performance Prediction
MP4 Video โ Audio Extraction โ Transcript โ Text Features โ
- ContentBasedPredictor: Main predictor using 313-trained Random Forest model
- AudioExtractor: FFmpeg-based audio extraction from video files
- AudioFeatureExtractor: 10 critical audio features using librosa
- TranscriptExtractor: Speech recognition using SpeechRecognition
- TranscriptFeatureExtractor: 10 critical text features for marketing effectiveness
- Web App: Flask-based upload and prediction interface
# Clone the repository
git clone <repository-url>
cd AUDIO_FEATURES
# Install with uv (recommended)
uv sync
# Or install with pip
pip install -e .# Start the Flask web app
cd web_app
python app.py
# Open browser to http://localhost:5003
# Upload MP4 video and get performance predictionfrom content_based_predictor import ContentBasedPredictor
# Initialize predictor
predictor = ContentBasedPredictor(models_dir="models")
# Predict performance for a video
result = predictor.predict_performance(video_path="advertisement.mp4")
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.3f}")
print(f"Probabilities: {result['probabilities']}")- Model Type: Random Forest Classifier
- Accuracy: 96.8%
- Cross-validation: 94.4% ยฑ 2.0%
- Training Samples: 313 videos from YouTube dataset
- Class Distribution:
- High Performance: 186 videos (59.4%)
- Low Performance: 127 videos (40.6%)
Audio Features (10 critical features):
rms_mean: Overall energy/loudnessdynamic_range: Energy variationpitch_mean: Voice pitchspeech_rate: Speaking pacespectral_centroid_mean: Audio brightnessmfcc_1_mean,mfcc_2_mean: Speech characteristicspause_duration_mean: Speaking rhythmonset_rate: Event densityzero_crossing_rate: Speech vs music
Text Features (10 critical features):
hook_curiosity_words_count: "Secret", "shocking" wordshook_action_words_count: "Get", "try", "buy" CTAshook_personal_pronouns_count: "You", "your" pronounshook_sentiment_polarity: Opening emotional tonecall_to_action_count: Direct conversion triggersaction_words_count: Action-oriented languagesentiment_polarity: Overall emotional toneexclamation_count: Energy levelquestion_count: Engagement techniquesword_count: Content density
The system uses a trained model with these settings:
# Model configuration (already trained)
model_type = "random_forest"
feature_count = 20 # 10 audio + 10 text features
performance_threshold = 15.0 # Combined score threshold# Train new model with your data
python scripts/content_based_trainer_313.py
# This will:
# 1. Load training data from results/content_based_313_training_data.csv
# 2. Extract audio and transcript features
# 3. Train Random Forest classifier
# 4. Save model to models/content_based_313_model.pklYour training data should be in CSV format with these columns:
video_id,brand_name,channel_title,duration_seconds,view_count,like_count,comment_count,channel_subscriber_count,publish_age_days,performance_label
video_001,nike,Nike Official,30,1000000,50000,5000,1000000,7,high
video_002,apple,Apple,45,500000,10000,1000,5000000,14,low
The web app provides a simple API:
import requests
# Upload video and get prediction
with open("advertisement.mp4", "rb") as f:
files = {"video_file": f}
data = {"brand_name": "nike"}
response = requests.post("http://localhost:5003/api/predict", files=files, data=data)
result = response.json()
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']}")# Get prediction statistics
predictor = ContentBasedPredictor()
stats = predictor.get_stats()
print(f"Total Predictions: {stats['total_predictions']}")
print(f"Success Rate: {stats['success_rate']:.3f}")
print(f"Average Processing Time: {stats['avg_processing_time']:.2f}s")AUDIO_FEATURES/
โโโ content_based_predictor.py # Main predictor class
โโโ web_app/ # Flask web application
โ โโโ app.py # Flask server
โ โโโ templates/index.html # Upload interface
โ โโโ static/ # CSS/JS assets
โโโ src/audio_features/ # Core audio processing
โ โโโ audio_extractor.py # Audio extraction
โ โโโ audio_features.py # Audio feature extraction
โ โโโ transcript_extractor.py # Speech-to-text
โ โโโ transcript_features.py # Text feature extraction
โ โโโ main_pipeline.py # Complete pipeline
โโโ scripts/ # Training scripts
โ โโโ content_based_trainer_313.py
โโโ models/ # Trained models
โ โโโ content_based_313_model.pkl
โ โโโ content_based_313_scaler.pkl
โ โโโ content_based_313_features.pkl
โโโ results/ # Training results
โโโ pyproject.toml # Project configuration
# Install development dependencies
uv sync --group dev
# Run tests
python test_types_only.py
# Run linting
uv run ruff check .
uv run ruff format .- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make changes and add tests
- Run linting and tests:
uv run ruff check . && python test_types_only.py - Commit changes:
git commit -m "Add feature" - Push to branch:
git push origin feature-name - Create a Pull Request
- Python 3.9+
- FFmpeg (for audio extraction)
- librosa (audio analysis)
- scikit-learn (machine learning)
- SpeechRecognition (speech-to-text)
- Flask (web application)
# Install FFmpeg (macOS)
brew install ffmpeg
# Install Python dependencies
uv syncMIT License - see LICENSE file for details.
- Issues: [GitHub Issues]
- Discussions: [GitHub Discussions]
Perfect for:
- โ Content Strategy: Should we publish this video?
- โ A/B Testing: Which version will perform better?
- โ Quality Control: Flag potentially low-performing content
- โ Campaign Planning: Predict which videos need promotion
- โ Budget Allocation: Focus resources on high-potential content
Practical Example:
Input: Nike running shoe ad, 30s, 100K subscribers
Output: "HIGH performance predicted (97% confidence)"
โ Decision: Publish with standard promotion budget
When ready to enhance the model:
- Add Visual Features: Thumbnail analysis, face detection, scene analysis
- Advanced Text Features: Sentiment analysis, topic modeling, viral hooks
- External Data: Trending topics, competitor analysis, seasonality
- More Data: Expand beyond 313 samples for better generalization
- Real-time Processing: Support for live video analysis
- Dataset Size: 313 videos (good for proof-of-concept)
- Feature Set: Audio + text only (no visual analysis)
- Threshold: Fixed threshold of 15.0 (could be optimized per brand)
- Language: Optimized for English content
This system successfully demonstrates:
- โ High accuracy (96.8%) binary classification
- โ Robust cross-validation performance
- โ Production-ready model artifacts
- โ Easy-to-use web interface
- โ Comprehensive audio + text feature extraction
Ready for production use in video performance prediction workflows!