# Module 02: Audio Analysis and Quality Detection

**Difficulty**: ‚≠ê‚≠ê
**Estimated Time**: 45 minutes
**Prerequisites**: [Module 00: Music Library Manager](00_music_library_manager.ipynb), [Module 01: Metadata Management](01_metadata_management.ipynb)

## Learning Objectives
By the end of this notebook, you will be able to:
1. Extract audio properties (bitrate, sample rate, duration, channels)
2. Categorize files by quality tier (Low/Medium/High/Lossless)
3. Identify low-quality files that could be upgraded
4. Compare audio quality of duplicate files
5. Make intelligent decisions about which duplicates to keep
6. Generate audio quality reports for your library

## Overview
Audio quality matters! This module helps you understand and manage the technical quality of your music files. You'll learn to:

- **Analyze Properties**: Understand bitrate, sample rate, and codec information
- **Quality Tiers**: Categorize files from Low to Lossless quality
- **Smart Duplicates**: When you have duplicates, keep the best quality version
- **Upgrade Targets**: Find low-quality files worth re-downloading

### Why Audio Quality Matters
- **Storage vs Quality**: Balance file size with audio fidelity
- **Future-proofing**: Lossless files maintain quality for any conversion
- **Listening Experience**: Higher bitrates = better sound (to a point)
- **Smart Cleanup**: Remove low-quality duplicates automatically

## 1. Setup and Configuration

In [None]:
# Import required libraries
import os
from pathlib import Path
from collections import defaultdict, Counter
from typing import List, Dict, Optional, Tuple
import hashlib
from datetime import datetime

# Audio analysis
from mutagen import File
from mutagen.mp3 import MP3, BitrateMode
from mutagen.flac import FLAC

# For better display
import pandas as pd
from IPython.display import display, HTML
import warnings

# Configuration
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 100)
warnings.filterwarnings('ignore')

print("‚úì Libraries imported successfully")

In [None]:
# Configure your music library path
MUSIC_LIBRARY_PATH = Path('../../music')

# Supported audio file extensions
AUDIO_EXTENSIONS = {'.mp3', '.flac', '.wav', '.m4a', '.aac', '.ogg', '.wma', '.opus'}

# Quality tier thresholds (in kbps)
QUALITY_TIERS = {
    'lossless': float('inf'),  # FLAC, WAV, etc.
    'high': 256,               # 256+ kbps
    'medium': 192,             # 192-255 kbps
    'low': 128,                # 128-191 kbps
    'poor': 0                  # <128 kbps
}

# Lossless formats
LOSSLESS_FORMATS = {'.flac', '.wav', '.aiff', '.ape', '.alac'}

# Verify the path exists
if not MUSIC_LIBRARY_PATH.exists():
    print(f"‚ö†Ô∏è Warning: Music library path does not exist: {MUSIC_LIBRARY_PATH.absolute()}")
    MUSIC_LIBRARY_PATH.mkdir(parents=True, exist_ok=True)
else:
    print(f"‚úì Music library path: {MUSIC_LIBRARY_PATH.absolute()}")

## 2. Audio Properties Extraction

In [None]:
def get_audio_properties(file_path: Path) -> Dict:
    """
    Extract technical audio properties from a file.
    
    Returns:
        Dictionary with audio properties:
        - bitrate (kbps)
        - sample_rate (Hz)
        - duration (seconds)
        - channels (1=mono, 2=stereo)
        - bitrate_mode (CBR/VBR/ABR)
        - codec
    """
    properties = {
        'bitrate': None,
        'sample_rate': None,
        'duration': None,
        'channels': None,
        'bitrate_mode': None,
        'codec': None
    }
    
    try:
        audio = File(file_path)
        
        if audio is None:
            return properties
        
        # Get audio info
        info = audio.info
        
        # Bitrate (convert to kbps)
        if hasattr(info, 'bitrate') and info.bitrate:
            properties['bitrate'] = round(info.bitrate / 1000, 0)  # bits/s to kbps
        
        # Sample rate
        if hasattr(info, 'sample_rate') and info.sample_rate:
            properties['sample_rate'] = info.sample_rate
        
        # Duration (length in seconds)
        if hasattr(info, 'length') and info.length:
            properties['duration'] = round(info.length, 2)
        
        # Channels
        if hasattr(info, 'channels') and info.channels:
            properties['channels'] = info.channels
        
        # Bitrate mode (for MP3)
        if isinstance(audio, MP3):
            if hasattr(info, 'bitrate_mode'):
                mode_map = {
                    BitrateMode.CBR: 'CBR',
                    BitrateMode.VBR: 'VBR',
                    BitrateMode.ABR: 'ABR',
                    BitrateMode.UNKNOWN: 'Unknown'
                }
                properties['bitrate_mode'] = mode_map.get(info.bitrate_mode, 'Unknown')
        
        # Codec/format
        properties['codec'] = file_path.suffix.lower().replace('.', '').upper()
        
    except Exception as e:
        # If extraction fails, return empty properties
        pass
    
    return properties


def format_duration(seconds: float) -> str:
    """
    Format duration in seconds to MM:SS format.
    """
    if seconds is None:
        return "Unknown"
    
    minutes = int(seconds // 60)
    secs = int(seconds % 60)
    return f"{minutes}:{secs:02d}"


def get_channel_description(channels: int) -> str:
    """
    Get human-readable channel description.
    """
    if channels is None:
        return "Unknown"
    elif channels == 1:
        return "Mono"
    elif channels == 2:
        return "Stereo"
    else:
        return f"{channels} channels"

print("‚úì Audio properties extraction functions loaded")

## 3. Quality Analysis and Categorization

In [None]:
def categorize_audio_quality(file_path: Path, bitrate: Optional[float] = None) -> str:
    """
    Categorize audio file into quality tiers.
    
    Args:
        file_path: Path to audio file
        bitrate: Optional bitrate in kbps (if already known)
    
    Returns:
        Quality tier: 'lossless', 'high', 'medium', 'low', or 'poor'
    """
    # Check if lossless format
    if file_path.suffix.lower() in LOSSLESS_FORMATS:
        return 'lossless'
    
    # Get bitrate if not provided
    if bitrate is None:
        props = get_audio_properties(file_path)
        bitrate = props.get('bitrate')
    
    if bitrate is None:
        return 'unknown'
    
    # Categorize based on bitrate
    if bitrate >= QUALITY_TIERS['high']:
        return 'high'
    elif bitrate >= QUALITY_TIERS['medium']:
        return 'medium'
    elif bitrate >= QUALITY_TIERS['low']:
        return 'low'
    else:
        return 'poor'


def scan_library_with_audio_analysis(library_path: Path) -> List[Dict]:
    """
    Scan library and include audio analysis for each file.
    Combines file info, metadata, and audio properties.
    
    Returns:
        List of dictionaries with complete file information
    """
    music_files = []
    
    print(f"Scanning library with audio analysis: {library_path}")
    
    file_count = 0
    for root, dirs, files in os.walk(library_path):
        for file in files:
            file_path = Path(root) / file
            
            if file_path.suffix.lower() in AUDIO_EXTENSIONS:
                file_count += 1
                
                if file_count % 50 == 0:
                    print(f"  Analyzed {file_count} files...")
                
                # Get file info
                stat_info = file_path.stat()
                
                # Get audio properties
                audio_props = get_audio_properties(file_path)
                
                # Determine quality tier
                quality_tier = categorize_audio_quality(file_path, audio_props.get('bitrate'))
                
                # Combine all information
                file_info = {
                    'filename': file,
                    'path': str(file_path),
                    'folder': str(file_path.parent.relative_to(library_path)),
                    'size_mb': stat_info.st_size / (1024 * 1024),
                    'extension': file_path.suffix.lower(),
                    # Audio properties
                    'bitrate': audio_props.get('bitrate'),
                    'sample_rate': audio_props.get('sample_rate'),
                    'duration': audio_props.get('duration'),
                    'duration_formatted': format_duration(audio_props.get('duration')),
                    'channels': audio_props.get('channels'),
                    'channel_desc': get_channel_description(audio_props.get('channels')),
                    'bitrate_mode': audio_props.get('bitrate_mode'),
                    'codec': audio_props.get('codec'),
                    'quality_tier': quality_tier
                }
                
                music_files.append(file_info)
    
    print(f"‚úì Analysis complete: {file_count} files analyzed")
    return music_files


def get_quality_statistics(music_files: List[Dict]) -> Dict:
    """
    Generate statistics about audio quality across the library.
    """
    if not music_files:
        return {'error': 'No files to analyze'}
    
    total_files = len(music_files)
    
    # Quality tier distribution
    quality_counts = Counter(f.get('quality_tier', 'unknown') for f in music_files)
    
    # Codec distribution
    codec_counts = Counter(f.get('codec', 'Unknown') for f in music_files)
    
    # Bitrate statistics (excluding lossless)
    bitrates = [f['bitrate'] for f in music_files 
                if f.get('bitrate') is not None and f.get('quality_tier') != 'lossless']
    
    avg_bitrate = sum(bitrates) / len(bitrates) if bitrates else 0
    min_bitrate = min(bitrates) if bitrates else 0
    max_bitrate = max(bitrates) if bitrates else 0
    
    # Total duration
    total_duration = sum(f.get('duration', 0) for f in music_files if f.get('duration'))
    total_hours = total_duration / 3600
    
    return {
        'total_files': total_files,
        'quality_distribution': dict(quality_counts),
        'codec_distribution': dict(codec_counts),
        'bitrate_stats': {
            'average': round(avg_bitrate, 1),
            'min': min_bitrate,
            'max': max_bitrate
        },
        'total_duration_hours': round(total_hours, 2)
    }

print("‚úì Quality analysis functions loaded")

## 4. Enhanced Duplicate Detection with Quality Comparison

In [None]:
def calculate_file_hash(file_path: Path, chunk_size: int = 8192) -> str:
    """
    Calculate MD5 hash of a file.
    """
    md5_hash = hashlib.md5()
    
    with open(file_path, 'rb') as f:
        for chunk in iter(lambda: f.read(chunk_size), b''):
            md5_hash.update(chunk)
    
    return md5_hash.hexdigest()


def compare_file_quality(file1: Dict, file2: Dict) -> Dict:
    """
    Compare two files and determine which has better quality.
    
    Returns:
        Dictionary with comparison results and recommendation
    """
    quality_ranking = {
        'lossless': 5,
        'high': 4,
        'medium': 3,
        'low': 2,
        'poor': 1,
        'unknown': 0
    }
    
    q1_score = quality_ranking.get(file1.get('quality_tier', 'unknown'), 0)
    q2_score = quality_ranking.get(file2.get('quality_tier', 'unknown'), 0)
    
    # Determine better file
    if q1_score > q2_score:
        better_file = 1
        reason = f"{file1['quality_tier']} > {file2['quality_tier']}"
    elif q2_score > q1_score:
        better_file = 2
        reason = f"{file2['quality_tier']} > {file1['quality_tier']}"
    else:
        # Same quality tier, compare bitrates
        br1 = file1.get('bitrate', 0) or 0
        br2 = file2.get('bitrate', 0) or 0
        
        if br1 > br2:
            better_file = 1
            reason = f"{br1} kbps > {br2} kbps"
        elif br2 > br1:
            better_file = 2
            reason = f"{br2} kbps > {br1} kbps"
        else:
            # Identical quality, prefer smaller file
            if file1['size_mb'] < file2['size_mb']:
                better_file = 1
                reason = "Smaller file size"
            else:
                better_file = 2
                reason = "Smaller file size"
    
    return {
        'file1': {
            'name': file1['filename'],
            'quality': file1.get('quality_tier', 'unknown'),
            'bitrate': file1.get('bitrate'),
            'size_mb': round(file1['size_mb'], 2)
        },
        'file2': {
            'name': file2['filename'],
            'quality': file2.get('quality_tier', 'unknown'),
            'bitrate': file2.get('bitrate'),
            'size_mb': round(file2['size_mb'], 2)
        },
        'recommendation': f"Keep File {better_file}",
        'reason': reason,
        'keep_file': better_file,
        'remove_file': 2 if better_file == 1 else 1
    }


def find_duplicates_with_quality(music_files: List[Dict], 
                                 method: str = 'size') -> List[Dict]:
    """
    Find duplicate files and provide quality-based recommendations.
    
    Args:
        method: 'size' for quick check, 'hash' for exact duplicates
    
    Returns:
        List of duplicate groups with quality comparisons
    """
    duplicates = []
    
    if method == 'size':
        # Group by file size
        size_groups = defaultdict(list)
        for file in music_files:
            # Round to 2 decimal places for grouping
            size_key = round(file['size_mb'], 2)
            size_groups[size_key].append(file)
        
        # Find groups with multiple files
        for size, files in size_groups.items():
            if len(files) > 1:
                # Compare first two files as example
                comparison = compare_file_quality(files[0], files[1])
                duplicates.append({
                    'type': 'Same size',
                    'size_mb': size,
                    'count': len(files),
                    'files': files,
                    'comparison': comparison
                })
    
    elif method == 'hash':
        # Group by file hash (exact duplicates)
        print("Calculating file hashes (this may take a while)...")
        hash_groups = defaultdict(list)
        
        for i, file in enumerate(music_files, 1):
            if i % 100 == 0:
                print(f"  Processed {i}/{len(music_files)} files...")
            
            file_hash = calculate_file_hash(Path(file['path']))
            hash_groups[file_hash].append(file)
        
        # Find groups with multiple files
        for file_hash, files in hash_groups.items():
            if len(files) > 1:
                comparison = compare_file_quality(files[0], files[1])
                duplicates.append({
                    'type': 'Exact duplicate',
                    'hash': file_hash[:8],
                    'count': len(files),
                    'files': files,
                    'comparison': comparison
                })
    
    return duplicates


def find_low_quality_files(music_files: List[Dict], 
                          threshold: str = 'low') -> pd.DataFrame:
    """
    Find files below a quality threshold.
    
    Args:
        threshold: 'low' (finds poor quality), 'medium' (finds low and poor), etc.
    """
    quality_order = ['poor', 'low', 'medium', 'high', 'lossless']
    threshold_index = quality_order.index(threshold)
    
    low_quality = [
        file for file in music_files
        if file.get('quality_tier') in quality_order[:threshold_index + 1]
    ]
    
    if low_quality:
        return pd.DataFrame(low_quality)
    else:
        return pd.DataFrame()

print("‚úì Enhanced duplicate detection functions loaded")

---
## 5. Usage Examples

### 5.1 Scan Library with Audio Analysis

In [None]:
# Scan library with complete audio analysis
print("Scanning library with audio analysis...\n")
all_music_files = scan_library_with_audio_analysis(MUSIC_LIBRARY_PATH)

print(f"\nAnalyzed {len(all_music_files)} audio files")

# Display sample
if all_music_files:
    sample_df = pd.DataFrame(all_music_files[:10])
    display_cols = ['filename', 'quality_tier', 'bitrate', 'codec', 'duration_formatted', 'size_mb']
    display(sample_df[display_cols])
else:
    print("\nNo music files found.")

### 5.2 Audio Quality Statistics Report

In [None]:
# Generate quality statistics
if all_music_files:
    print("üéµ Audio Quality Report")
    print("=" * 60 + "\n")
    
    stats = get_quality_statistics(all_music_files)
    
    print(f"Total Files: {stats['total_files']}")
    print(f"Total Duration: {stats['total_duration_hours']:.2f} hours\n")
    
    print("Quality Distribution:")
    for tier, count in sorted(stats['quality_distribution'].items(), 
                              key=lambda x: x[1], reverse=True):
        percentage = (count / stats['total_files']) * 100
        print(f"  {tier.capitalize()}: {count} files ({percentage:.1f}%)")
    
    print("\nCodec Distribution:")
    for codec, count in sorted(stats['codec_distribution'].items(), 
                               key=lambda x: x[1], reverse=True):
        print(f"  {codec}: {count} files")
    
    print("\nBitrate Statistics (lossy files):")
    print(f"  Average: {stats['bitrate_stats']['average']} kbps")
    print(f"  Range: {stats['bitrate_stats']['min']} - {stats['bitrate_stats']['max']} kbps")
else:
    print("No files to analyze.")

### 5.3 Find Low Quality Files

In [None]:
# Find poor quality files (< 128 kbps)
if all_music_files:
    print("Finding low quality files...\n")
    
    low_quality = find_low_quality_files(all_music_files, threshold='low')
    
    if len(low_quality) > 0:
        print(f"‚ö†Ô∏è Found {len(low_quality)} low quality files:\n")
        cols = ['filename', 'quality_tier', 'bitrate', 'codec', 'size_mb']
        display(low_quality[cols].head(20))
        
        print(f"\nüí° These files are good candidates for re-downloading in higher quality")
    else:
        print("‚úì No low quality files found!")
else:
    print("No files to check.")

### 5.4 Find Duplicates with Quality Comparison

In [None]:
# Quick duplicate check by file size
if all_music_files:
    print("Finding duplicates with quality analysis...\n")
    
    duplicates = find_duplicates_with_quality(all_music_files, method='size')
    
    if duplicates:
        print(f"Found {len(duplicates)} potential duplicate groups:\n")
        
        for i, dup in enumerate(duplicates[:5], 1):  # Show first 5
            print(f"Group {i}: {dup['type']} ({dup['count']} files)")
            comp = dup['comparison']
            print(f"  File 1: {comp['file1']['name']}")
            print(f"    Quality: {comp['file1']['quality']}, "
                  f"Bitrate: {comp['file1']['bitrate']} kbps, "
                  f"Size: {comp['file1']['size_mb']} MB")
            print(f"  File 2: {comp['file2']['name']}")
            print(f"    Quality: {comp['file2']['quality']}, "
                  f"Bitrate: {comp['file2']['bitrate']} kbps, "
                  f"Size: {comp['file2']['size_mb']} MB")
            print(f"  ‚úì Recommendation: {comp['recommendation']} ({comp['reason']})")
            print()
        
        if len(duplicates) > 5:
            print(f"... and {len(duplicates) - 5} more duplicate groups")
    else:
        print("‚úì No duplicates found")
else:
    print("No files to check.")

### 5.5 Filter by Quality Tier

In [None]:
# Show all lossless files
if all_music_files:
    lossless = [f for f in all_music_files if f.get('quality_tier') == 'lossless']
    
    if lossless:
        print(f"Found {len(lossless)} lossless files:\n")
        df = pd.DataFrame(lossless)
        display(df[['filename', 'codec', 'sample_rate', 'duration_formatted', 'size_mb']].head(20))
    else:
        print("No lossless files in library")
else:
    print("No files to display.")

### 5.6 Detailed File Properties

In [None]:
# Show detailed properties for a specific file
if all_music_files:
    file_info = all_music_files[0]  # First file as example
    
    print("Detailed Audio Properties\n" + "=" * 60)
    print(f"File: {file_info['filename']}")
    print(f"Path: {file_info['folder']}")
    print()
    print(f"Quality Tier: {file_info['quality_tier'].upper()}")
    print(f"Codec: {file_info.get('codec', 'Unknown')}")
    print(f"Bitrate: {file_info.get('bitrate', 'N/A')} kbps")
    print(f"Bitrate Mode: {file_info.get('bitrate_mode', 'N/A')}")
    print(f"Sample Rate: {file_info.get('sample_rate', 'N/A')} Hz")
    print(f"Channels: {file_info.get('channel_desc', 'Unknown')}")
    print(f"Duration: {file_info.get('duration_formatted', 'Unknown')}")
    print(f"File Size: {file_info['size_mb']:.2f} MB")
else:
    print("No files to display.")

## 6. Summary

### What We've Learned

In this module, we've added comprehensive audio analysis capabilities:

1. **Audio Properties**: Extract bitrate, sample rate, duration, channels
2. **Quality Tiers**: Categorize files as Poor/Low/Medium/High/Lossless
3. **Quality Statistics**: Understand the overall quality of your library
4. **Smart Duplicates**: Compare duplicates and know which to keep
5. **Upgrade Targets**: Identify low-quality files worth re-downloading
6. **Codec Analysis**: See what formats you have

### Key Functions

**Analysis:**
- `get_audio_properties()` - Extract all technical properties
- `categorize_audio_quality()` - Determine quality tier
- `scan_library_with_audio_analysis()` - Complete library scan

**Quality Insights:**
- `get_quality_statistics()` - Library-wide quality report
- `find_low_quality_files()` - Find files below threshold

**Duplicates:**
- `compare_file_quality()` - Compare two files
- `find_duplicates_with_quality()` - Find duplicates with recommendations

### Quality Tier Reference

- **Lossless**: FLAC, WAV, ALAC - Perfect quality, large files
- **High**: 256+ kbps - Excellent quality, most can't hear difference from lossless
- **Medium**: 192-255 kbps - Good quality, acceptable for most use cases
- **Low**: 128-191 kbps - Noticeable quality loss, consider upgrading
- **Poor**: <128 kbps - Low quality, definitely worth upgrading

### Best Practices

1. **Storage vs Quality**: Lossless for archival, high quality for daily listening
2. **Duplicate Resolution**: Keep highest quality, remove lower versions
3. **Upgrade Strategy**: Focus on favorite songs first when upgrading quality
4. **Format Choice**: MP3 320kbps or FLAC are safest bets
5. **Consistency**: Try to maintain consistent quality across library

### Next Steps

Continue to:
- **Module 04: Advanced Organization** - Organize by quality tier
- **Module 06: Visualizations** - Chart quality distribution
- **Module 08: Quality Validation** - Detect corrupted files

### Additional Resources

- [Audio Bitrate Comparison](https://en.wikipedia.org/wiki/Bit_rate#Audio)
- [Lossless vs Lossy Compression](https://en.wikipedia.org/wiki/Data_compression#Audio)
- [Audio Codec Comparison](https://en.wikipedia.org/wiki/Comparison_of_audio_coding_formats)