🎵 Audio Classifier: Music vs Noise

A desktop application built with Python that performs real-time audio classification using deep learning to distinguish between music and noise with mel-spectrogram CNN.

📋 Table of Contents

Overview
Features
Installation
Usage
Project Structure
Model Architecture
GUI Preview
Challenges Faced
Configuration
Performance Metrics
Future Improvements
System Requirements
Contributing
Author
Acknowledgments
Technical Notes

📋 Overview

This project implements a real-time audio classification system that distinguishes between music and noise using deep learning. The system converts audio signals into mel-spectrograms and classifies them using a custom Convolutional Neural Network (CNN).

Key Highlights

🎤 Real-time audio recording and classification
🖼️ Mel-spectrogram visualization
🎨 Modern GUI with Tkinter
🧠 Custom CNN architecture
🔄 Continuous classification loop
⚡ GPU-accelerated inference (RTX 3070)

✨ Features

Real-time Classification - Continuously records and classifies audio in 5-second intervals
Visual Feedback - Live progress bar and confidence scores
Spectrogram Generation - Converts audio to mel-spectrograms for neural network processing
GPU Acceleration - Automatic CUDA support for faster inference
Silence Detection - Filters out silent audio segments
Debug Mode - Saves spectrograms for visual inspection
Cross-Device Adaptability - Fine-tuning capability for different microphones

🚀 Installation

Prerequisites

Python 3.8 or higher
CUDA-capable GPU (tested on RTX 3070)
CUDA Toolkit 11.8+ (for GPU acceleration)

Step 1: Clone the repository

git clone https://github.com/yourusername/audio-classifier.git
cd audio-classifier

Step 2: Install dependencies

pip install -r requirements.txt

Step 3: Verify CUDA installation

python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU\"}')"

Expected output:

CUDA Available: True
Device: NVIDIA GeForce RTX 3070 Laptop GPU

📦 Dependencies

torch>=2.0.0
torchvision>=0.15.0
librosa>=0.10.0
sounddevice>=0.4.6
numpy>=1.24.0
matplotlib>=3.7.0
Pillow>=9.5.0

💻 Usage

Training the Model

Step 1: Prepare your dataset

Place music samples (.wav) in data/music_wav/
Place noise samples (.wav) in data/noise_wav/

Step 2: Generate spectrograms

python scripts/GENERATOR.py

This converts all .wav files into mel-spectrogram images stored in data/dataset/

Step 3: Train the model

python scripts/TRAINING.py

Training will utilize RTX 3070 GPU automatically. Model weights saved to sound_model.pth

Step 4: Fine-tune (if needed)

python scripts/FINE_TUNING.py

Use this when adapting the model to a specific microphone (e.g., laptop vs smartphone)

Running the Classifier

python main.py

The GUI window will open:

Click ▶ START to begin real-time classification
Speak, play music, or make noise near your microphone
Results appear after each 5-second recording
Click ⏸ STOP to pause classification

📁 Project Structure

Second Git/
│
├── main.py                    # Main entry point
├── config.py                  # Configuration and paths
├── sound_model.pth            # Trained model weights
├── requirements.txt           # Python dependencies
├── README.md                  # Project documentation
├── .gitignore                 # Git ignore rules
│
├── scripts/                   # Training automation
│   ├── GENERATOR.py           # WAV → Spectrogram conversion
│   ├── TRAINING.py            # Model training loop
│   └── FINE_TUNING.py         # Fine-tuning script
│
├── data/                      # Dataset management
│   ├── music_wav/             # Music audio samples (.gitkeep)
│   ├── noise_wav/             # Noise audio samples (.gitkeep)
│   └── dataset/               # Generated spectrograms
│       ├── music/             # Music class images
│       └── noise/             # Noise class images
│
├── gui/                       # Graphical interface
│   ├── app.py                 # GUI logic
│   └── __init__.py
│
├── model/                     # Neural network architecture
│   ├── classifier.py          # SoundClassifier (CNN)
│   └── __init__.py
│
├── audio/                     # Audio processing module
│   ├── processor.py           # Audio signal processing
│   └── __init__.py
│
└── spectrograms/              # Debug spectrograms output

🧠 Model Architecture

The SoundClassifier is a custom CNN optimized for spectrogram classification:

Input (3×155×154 RGB Spectrogram)
    ↓
Conv2D(3→16, 3×3) + ReLU + MaxPool(2×2)
    ↓
Conv2D(16→32, 3×3) + ReLU + MaxPool(2×2)
    ↓
Flatten
    ↓
FC(32×38×38 → 128) + ReLU
    ↓
FC(128 → 2) [music, noise]

Key Parameters

Input: RGB mel-spectrogram (155×154 pixels)
Output: 2 classes (music, noise)
Activation: ReLU
Pooling: MaxPool2D (2×2)
Total Parameters: ~185k
Inference Time: ~15ms (GPU) / ~150ms (CPU)

🎨 GUI Preview

The application features a modern dark-themed interface:

┌─────────────────────────────────────┐
│     🎵 Real-time Audio Classifier    │
├─────────────────────────────────────┤
│                                     │
│         🎤 Listening...             │
│                                     │
│    ┌───────────────────────────┐   │
│    │                           │   │
│    │        MUSIC              │   │ ← Color-coded result
│    │                           │   │
│    │   Confidence: 94.2%       │   │
│    └───────────────────────────┘   │
│                                     │
│    [████████████░░░░] 75%          │ ← Live progress
│    🎤 Recording... 3.8s / 5.0s     │
│                                     │
│    [▶ START]      [⏸ STOP]        │
│                                     │
│  Device: CUDA | Duration: 5.0s     │
└─────────────────────────────────────┘

UI Features

✅ Real-time status indicators
✅ Smooth animated progress bar
✅ Color-coded results (🟢 music / 🟠 noise / ⚫ silence)
✅ Confidence percentage display
✅ Timer showing recording progress

⚠️ Challenges Faced

1. Critical Microphone Hardware Mismatch 🎤

The Problem

The model was initially trained on high-quality smartphone recordings. When deployed on a Lenovo Legion 5 Pro laptop, a severe issue occurred:

Symptom: Model classified everything as NOISE with ~100% confidence
Even playing music directly → classified as "NOISE 100%"
Root cause: Laptop's built-in microphone had drastically different characteristics:
- Much lower signal-to-noise ratio (background fan noise, electrical interference)
- Different frequency response curve
- Poor microphone positioning (bottom/side of chassis)
- Hardware noise cancellation affecting audio spectrum

Visual Diagnosis

Comparing spectrograms revealed the issue:

Smartphone Recording	Laptop Recording
Clear frequency bands	Blurred, noisy patterns
High dynamic range	Compressed, washed out
Distinct musical features	Dominated by background noise

The model literally "couldn't see" the music patterns through the laptop mic's noise floor.

The Solution

Data Collection Phase:

Recorded 50+ samples using laptop microphone in typical usage conditions
Captured both music playback and ambient noise
Saved spectrograms to spectrograms/ for visual inspection
Key insight: Laptop spectrograms looked completely different from training data

Fine-tuning Strategy:

# FINE_TUNING.py approach
- Loaded pre-trained weights from sound_model.pth
- Froze early convolutional layers (feature extractors)
- Retrained final FC layers on laptop data
- Used very low learning rate (0.0001) to avoid catastrophic forgetting
- Balanced dataset: 50% smartphone data + 50% laptop data

Training Process:

Started with base model accuracy: 95% (smartphone) → 0% (laptop)
After 20 epochs of fine-tuning: → 92% (laptop)
Model now recognizes music patterns in noisy laptop recordings

Technical Adjustments:

Lowered SILENCE_THRESHOLD from 0.01 to 0.005
Added amplitude normalization before spectrogram generation
Implemented dynamic range compression in preprocessing

Results

Metric	Before Fine-tuning	After Fine-tuning
Smartphone accuracy	95.3%	94.8% ✅ (retained)
Laptop accuracy	~0% ❌	92.4% ✅
Music → Noise misclassification	100%	7.6%
Confidence on correct predictions	N/A	87-96%

Key Learnings

⚠️ Audio ML models are extremely hardware-dependent
⚠️ Never assume model generalization across recording devices
✅ Always test on target deployment hardware
✅ Fine-tuning is essential for production audio systems

2. Spectrogram Normalization

Challenge: Different audio sources produced varying amplitude ranges, causing inconsistent spectrograms.

Solution:

Implemented dynamic normalization based on maximum amplitude
Added silence threshold (SILENCE_THRESHOLD = 0.005) to filter out empty recordings
Normalized all audio to [-1, 1] range before processing

3. Real-time Performance Optimization ⚡

Initial Problem: GUI freezing during audio processing.

Optimization:

Used threading for non-blocking audio recording
Leveraged RTX 3070 GPU for 10x faster inference (~15ms vs ~150ms)
Implemented progressive progress bar updates (50ms intervals)
Cached spectrogram generation for smoother UX

Hardware Performance (Lenovo Legion 5 Pro):

CPU: Ryzen 7 5800H (inference: ~150ms)
GPU: RTX 3070 Laptop (inference: ~15ms)
Memory: Minimal (<500MB VRAM usage)

🔧 Configuration

Edit config.py to customize:

# Audio Settings
SAMPLE_RATE = 44100        # Audio sampling rate (Hz)
DURATION = 5.0             # Recording duration (seconds)
SILENCE_THRESHOLD = 0.005  # Minimum amplitude threshold

# Device Settings
DEVICE = "cuda"            # "cuda" for GPU, "cpu" for CPU

# Model Settings
CLASSES = ['music', 'noise']

📊 Performance Metrics

Training Performance

Metric	Value
Training Accuracy	95.3%
Validation Accuracy	92.1%
Training Time (100 epochs)	~12 minutes (RTX 3070)
Model Size	22,5 MB

Inference Performance (Lenovo Legion 5 Pro)

Hardware	Inference Time	FPS
RTX 3070 Laptop GPU	~15ms	~66
Ryzen 7 5800H CPU	~150ms	~6

Device-Specific Accuracy

Recording Device	Before Fine-tuning	After Fine-tuning
Smartphone (original training)	95.3%	94.8%
Laptop microphone	~0% ❌	92.4% ✅

🛠️ Future Improvements

Add multi-class classification (speech, nature sounds, traffic noise)
Implement real-time spectrogram visualization in GUI
Add automatic device detection and model selection
Create model ensemble (smartphone + laptop models)
Build web interface using Flask/FastAPI
Develop automatic microphone calibration system
Export to ONNX for cross-platform deployment
Create mobile app using PyTorch Mobile
Add data augmentation (pitch shift, time stretch, noise injection)

🖥️ System Requirements

Minimum

Python 3.8+
4GB RAM
CPU with AVX support
Built-in microphone

Recommended (for GPU acceleration)

Python 3.10+
8GB RAM
NVIDIA GPU with CUDA support (RTX 20/30/40 series)
CUDA Toolkit 11.8+
External microphone for better quality

Tested Configuration

Laptop: Lenovo Legion 5 Pro
CPU: AMD Ryzen 7 5800H
GPU: NVIDIA GeForce RTX 3070 Laptop
RAM: 16GB DDR4
OS: Windows 10 / Linux

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the project
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

👤 Author

Ivan Kachmar

GitHub: @Fargot135
LinkedIn: Ivan Kachmar

🙏 Acknowledgments

PyTorch team for the deep learning framework
Librosa developers for audio processing tools
NVIDIA for CUDA and GPU acceleration
The open-source community for inspiration

📚 Technical Notes

Audio Processing Pipeline

Raw Audio (44.1kHz) 
    → Mel-Spectrogram (128 mel bands)
    → Convert to dB scale
    → Resize to 155×154
    → Normalize [-1, 1]
    → CNN Classification
    → Softmax Probabilities

Why Mel-Spectrograms?

Human perception: Mel scale mimics human hearing
Feature compression: Reduces dimensionality while preserving information
Visual patterns: Makes audio patterns visible to CNN
Transfer learning: Compatible with image-trained models

Why Fine-tuning Was Essential

This project demonstrates a critical lesson in ML deployment: models must be adapted to production hardware. The dramatic failure on laptop microphones (0% accuracy) wasn't a model deficiency—it was a data distribution mismatch. Fine-tuning with device-specific data solved this completely.

If you found this project helpful, please consider giving it a ⭐!

Made with ❤️

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
audio		audio
data		data
gui		gui
model		model
scripts		scripts
spectrograms		spectrograms
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt
sound_model.pth		sound_model.pth

Fargot135/Audio-Classifier-AI

Folders and files

Latest commit

History

Repository files navigation