# Acoustic Shield - Training & Deployment

This notebook trains and deploys the audio classification model using AWS SageMaker.

## Pipeline Overview
1. **Setup Configuration** - Define S3 paths, IAM role, and hyperparameters
2. **Create Training Job** - Fine-tune wav2vec2-base on audio data
3. **Deploy Endpoint** - Deploy real-time inference endpoint
4. **Test Endpoint** - Smoke test with sample audio

## Dataset Structure
Training data must be organized in audiofolder format:
```
s3://acousticshield-ml/train/
├── Normal/*.wav
├── TireSkid/*.wav
├── EmergencyBraking/*.wav
└── CollisionImminent/*.wav
```

## Audio Requirements
- **Format**: WAV (mono or stereo)
- **Sample Rate**: 16 kHz (auto-resampled if different)
- **Duration**: 1-5 seconds recommended

## Customization Options
- **Change epochs**: Modify `EPOCHS` parameter (default: 4)
- **Change learning rate**: Modify `LEARNING_RATE` (default: 3e-5)
- **Change batch size**: Modify `BATCH_SIZE` (default: 8)
- **Skip validation**: Set `VAL_S3 = None`

## Step 1: Configuration

⚠️ **IMPORTANT**: No region hardcoding - SageMaker auto-detects region from bucket location.

In [None]:
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFace
from sagemaker.serializers import DataSerializer
from sagemaker.deserializers import JSONDeserializer
import json
from datetime import datetime

print("✓ Imports complete")

In [None]:
# ============================================================================
# CONFIGURATION - Modify these parameters as needed
# ============================================================================

# S3 paths
TRAIN_S3 = "s3://acousticshield-ml/train/"           # Training data (required)
VAL_S3 = "s3://acousticshield-ml/val/"               # Validation data (optional, set to None to skip)
MODEL_OUTPUT_S3 = "s3://acousticshield-ml/models/"   # Model artifacts output

# IAM Role
ROLE_NAME = "role-sagemaker-train"                    # IAM role name (not ARN)

# Hyperparameters
EPOCHS = 4                 # Number of training epochs (increase for better accuracy)
LEARNING_RATE = 3e-5       # Learning rate (decrease if training is unstable)
BATCH_SIZE = 8             # Batch size per device (reduce if OOM errors)
WARMUP_STEPS = 500         # Number of warmup steps
GRADIENT_ACCUMULATION = 1  # Gradient accumulation steps (increase for effective larger batch)

# Instance configuration
TRAIN_INSTANCE_TYPE = "ml.g4dn.xlarge"  # GPU instance for training
TRAIN_INSTANCE_COUNT = 1                 # Number of training instances
ENDPOINT_INSTANCE_TYPE = "ml.m5.xlarge" # CPU instance for inference
ENDPOINT_INSTANCE_COUNT = 1              # Number of endpoint instances

# Model configuration
TRANSFORMERS_VERSION = "4.44"  # HuggingFace Transformers version
PYTORCH_VERSION = "2.3"        # PyTorch version
PYTHON_VERSION = "py311"       # Python version

print("="*80)
print("Acoustic Shield - Training Configuration")
print("="*80)
print(f"Training data: {TRAIN_S3}")
print(f"Validation data: {VAL_S3 if VAL_S3 else 'None (will split from train)'}")
print(f"Model output: {MODEL_OUTPUT_S3}")
print(f"IAM Role: {ROLE_NAME}")
print(f"\nHyperparameters:")
print(f"  Epochs: {EPOCHS}")
print(f"  Learning Rate: {LEARNING_RATE}")
print(f"  Batch Size: {BATCH_SIZE}")
print(f"  Warmup Steps: {WARMUP_STEPS}")
print(f"\nInstances:")
print(f"  Training: {TRAIN_INSTANCE_TYPE} x {TRAIN_INSTANCE_COUNT}")
print(f"  Endpoint: {ENDPOINT_INSTANCE_TYPE} x {ENDPOINT_INSTANCE_COUNT}")
print("="*80)

## Step 2: Initialize SageMaker Session

Auto-detect region from S3 bucket location.

In [None]:
# Auto-detect region from S3 bucket
s3_client = boto3.client('s3')
bucket_name = TRAIN_S3.split('/')[2]  # Extract bucket from s3://bucket/path
bucket_location = s3_client.get_bucket_location(Bucket=bucket_name)['LocationConstraint']
region = bucket_location if bucket_location else 'us-east-1'

print(f"🌍 Detected region: {region}")

# Initialize boto3 session with detected region
boto_session = boto3.Session(region_name=region)
sagemaker_session = sagemaker.Session(boto_session=boto_session)

# Get IAM role ARN
iam_client = boto_session.client('iam')
role_response = iam_client.get_role(RoleName=ROLE_NAME)
TRAIN_ROLE_ARN = role_response['Role']['Arn']

print(f"✓ SageMaker session initialized")
print(f"✓ Region: {region}")
print(f"✓ Role ARN: {TRAIN_ROLE_ARN}")

## Step 3: Create HuggingFace Estimator

Configure the training job with the HuggingFace estimator.

In [None]:
# Create HuggingFace estimator
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir='training.py',
    role=TRAIN_ROLE_ARN,
    instance_type=TRAIN_INSTANCE_TYPE,
    instance_count=TRAIN_INSTANCE_COUNT,
    transformers_version=TRANSFORMERS_VERSION,
    pytorch_version=PYTORCH_VERSION,
    py_version=PYTHON_VERSION,
    hyperparameters={
        'epochs': EPOCHS,
        'learning-rate': LEARNING_RATE,
        'batch-size': BATCH_SIZE,
        'warmup-steps': WARMUP_STEPS,
        'gradient-accumulation-steps': GRADIENT_ACCUMULATION,
    },
    output_path=MODEL_OUTPUT_S3,
    base_job_name='acousticshield-train',
    sagemaker_session=sagemaker_session,
    disable_profiler=True,  # Disable profiler to reduce overhead
    debugger_hook_config=False,  # Disable debugger to reduce overhead
)

print("✓ HuggingFace estimator created")
print(f"  Base job name: acousticshield-train")
print(f"  Output path: {MODEL_OUTPUT_S3}")

## Step 4: Start Training Job

⏱️ **Expected duration**: 30-40 minutes on ml.g4dn.xlarge

The training job will:
1. Load audio data from S3 using audiofolder format
2. Resample all audio to 16 kHz
3. Extract features using wav2vec2 feature extractor
4. Fine-tune the model for 4 epochs
5. Evaluate on validation set each epoch
6. Save best model based on F1 score
7. Upload model artifacts to S3

In [None]:
# Prepare training channels
training_channels = {'train': TRAIN_S3}

# Add validation channel if provided
if VAL_S3:
    training_channels['validation'] = VAL_S3
    print(f"📊 Using separate validation set: {VAL_S3}")
else:
    print(f"📊 Validation set will be split from training data (90/10)")

print(f"\n🚀 Starting training job...")
print(f"⏰ Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n" + "="*80)

# Start training
huggingface_estimator.fit(training_channels, wait=True)

print("\n" + "="*80)
print("✅ Training job completed!")
print(f"⏰ Finished at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"📦 Model artifacts: {huggingface_estimator.model_data}")
print("="*80)

## Step 5: Deploy Real-Time Endpoint

⏱️ **Expected duration**: 5-8 minutes

The endpoint will:
- Accept audio/wav input (any sample rate, mono or stereo)
- Auto-resample to 16 kHz if needed
- Return JSON with label, confidence, and probabilities

In [None]:
# Generate unique endpoint name
timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
endpoint_name = f'acousticshield-endpoint-{timestamp}'

print(f"🚀 Deploying endpoint: {endpoint_name}")
print(f"⏰ Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"💻 Instance: {ENDPOINT_INSTANCE_TYPE}")
print("\nThis will take 5-8 minutes...\n")

# Deploy endpoint
predictor = huggingface_estimator.deploy(
    initial_instance_count=ENDPOINT_INSTANCE_COUNT,
    instance_type=ENDPOINT_INSTANCE_TYPE,
    endpoint_name=endpoint_name,
    serializer=DataSerializer(content_type='audio/wav'),
    deserializer=JSONDeserializer(),
)

print("\n" + "="*80)
print("✅ Endpoint deployed successfully!")
print(f"⏰ Finished at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"🌐 Endpoint name: {endpoint_name}")
print(f"📍 Status: InService")
print("="*80)

## Step 6: Test Endpoint

### Option A: Test with Sample Audio from S3

In [None]:
# Download sample audio from S3 for testing
import io

# List available test files
test_bucket = 'acousticshield-ml'
test_prefix = 'train/'  # Or 'val/' if validation set exists

s3 = boto3.client('s3', region_name=region)
response = s3.list_objects_v2(Bucket=test_bucket, Prefix=test_prefix, MaxKeys=10)

if 'Contents' in response:
    # Find first WAV file
    test_files = [obj['Key'] for obj in response['Contents'] if obj['Key'].endswith('.wav')]
    
    if test_files:
        test_file_key = test_files[0]
        print(f"📁 Using test file: s3://{test_bucket}/{test_file_key}")
        
        # Download file
        wav_buffer = io.BytesIO()
        s3.download_fileobj(test_bucket, test_file_key, wav_buffer)
        wav_bytes = wav_buffer.getvalue()
        
        print(f"✓ Downloaded {len(wav_bytes)} bytes")
    else:
        print("❌ No WAV files found in S3 bucket")
        wav_bytes = None
else:
    print(f"❌ No objects found at s3://{test_bucket}/{test_prefix}")
    wav_bytes = None

### Option B: Generate Synthetic Test Audio

In [None]:
# Generate synthetic test audio (1 second sine wave at 440 Hz)
import numpy as np
import soundfile as sf

print("🎵 Generating synthetic test audio...")

sample_rate = 16000
duration = 1.0
frequency = 440.0  # A4 note

t = np.linspace(0, duration, int(sample_rate * duration))
test_audio = 0.3 * np.sin(2 * np.pi * frequency * t)

# Convert to WAV bytes
wav_buffer = io.BytesIO()
sf.write(wav_buffer, test_audio, sample_rate, format='WAV')
wav_bytes = wav_buffer.getvalue()

print(f"✓ Generated {len(wav_bytes)} bytes of test audio")
print(f"  Format: 16 kHz mono, {duration} second sine wave")

### Invoke Endpoint

In [None]:
if wav_bytes:
    print("\n🔮 Invoking endpoint...")
    print("="*80)
    
    # Predict using the deployed endpoint
    response = predictor.predict(wav_bytes)
    
    # Display results
    print("\n📊 Prediction Results:")
    print(json.dumps(response, indent=2))
    
    print("\n" + "="*80)
    print("🏷️  PREDICTION SUMMARY")
    print("="*80)
    print(f"Predicted Class: {response['label']}")
    print(f"Confidence: {response['confidence']:.2%}")
    print("\nClass Probabilities:")
    for class_name, prob in sorted(response['probs'].items(), key=lambda x: x[1], reverse=True):
        bar = '█' * int(prob * 50)
        print(f"  {class_name:20s} {prob:.2%} {bar}")
    print("="*80)
    
    print("\n✅ Endpoint test successful!")
else:
    print("⚠️  No test audio available. Please provide a WAV file.")

## Step 7: Test with boto3 SageMaker Runtime (Alternative Method)

This demonstrates how to invoke the endpoint using raw boto3 client.

In [None]:
if wav_bytes:
    print("🔧 Testing with boto3 SageMaker Runtime client...\n")
    
    # Create SageMaker Runtime client
    runtime_client = boto_session.client('sagemaker-runtime')
    
    # Invoke endpoint
    response = runtime_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='audio/wav',
        Accept='application/json',
        Body=wav_bytes
    )
    
    # Parse response
    result = json.loads(response['Body'].read().decode())
    
    print("📊 Response from boto3 client:")
    print(json.dumps(result, indent=2))
    print(f"\n✅ boto3 invocation successful!")
    print(f"   Predicted: {result['label']} ({result['confidence']:.2%})")

## Step 8: Endpoint Information

Save endpoint details for future use.

In [None]:
print("\n" + "="*80)
print("📋 ENDPOINT INFORMATION")
print("="*80)
print(f"Endpoint Name: {endpoint_name}")
print(f"Region: {region}")
print(f"Instance Type: {ENDPOINT_INSTANCE_TYPE}")
print(f"Instance Count: {ENDPOINT_INSTANCE_COUNT}")
print(f"Model Artifacts: {huggingface_estimator.model_data}")
print(f"\nInput Format: audio/wav (16 kHz mono recommended, auto-resampled)")
print(f"Output Format: application/json")
print(f"\nExpected Output:")
print(f"  {{")
print(f"    \"label\": \"TireSkid\",")
print(f"    \"confidence\": 0.85,")
print(f"    \"probs\": {{")
print(f"      \"Normal\": 0.05,")
print(f"      \"TireSkid\": 0.85,")
print(f"      \"EmergencyBraking\": 0.08,")
print(f"      \"CollisionImminent\": 0.02")
print(f"    }}")
print(f"  }}")
print("="*80)

# Save endpoint info to file
endpoint_info = {
    'endpoint_name': endpoint_name,
    'region': region,
    'instance_type': ENDPOINT_INSTANCE_TYPE,
    'model_artifacts': huggingface_estimator.model_data,
    'created_at': datetime.now().isoformat(),
    'classes': ['Normal', 'TireSkid', 'EmergencyBraking', 'CollisionImminent']
}

with open('endpoint_info.json', 'w') as f:
    json.dump(endpoint_info, f, indent=2)

print("\n✓ Endpoint information saved to endpoint_info.json")

## Optional: Cleanup

⚠️ **WARNING**: This will delete the endpoint. You will be charged while the endpoint is running.

Uncomment and run the cell below to delete the endpoint when done testing.

In [None]:
# # Uncomment to delete endpoint
# print(f"🗑️  Deleting endpoint: {endpoint_name}")
# predictor.delete_endpoint()
# print("✓ Endpoint deleted successfully")
# print("\n⚠️  Model artifacts remain in S3 and can be redeployed anytime.")

---

## Summary

✅ **Training Complete**: Model fine-tuned on audio classification task  
✅ **Endpoint Deployed**: Real-time inference endpoint is running  
✅ **Testing Complete**: Endpoint responds with predictions  

### Next Steps
1. Integrate endpoint into your application
2. Monitor endpoint metrics in CloudWatch
3. Set up auto-scaling if needed
4. Retrain periodically with new data

### Cost Management
- **Training**: One-time cost (~$0.50-1.00)
- **Endpoint**: Ongoing cost (~$0.23/hour for ml.m5.xlarge)
- **Storage**: Model artifacts in S3 (~$0.02/month)

💡 **Tip**: Delete the endpoint when not in use and redeploy when needed to save costs!

### Documentation
- Endpoint name saved in `endpoint_info.json`
- Model artifacts: `s3://acousticshield-ml/models/`
- Training logs: Available in CloudWatch