# ASR Attacks with Codec Compression Testing

## Overview
This notebook implements two ASR attacks (Carlini & Wagner and Qin) with codec compression testing.

## Workflow
For each attack:
1. Generate adversarial audio → compute metrics (WER, CER, SNR, PESQ, STOI)
2. Compress to OPUS → transcribe → compute metrics
3. Compress to AMR-WB → transcribe → compute metrics

## Configuration
- **c_weight**: 1e-2 (maintains good audio quality)
- **OPUS bitrate**: 64 kbps
- **AMR-WB bitrate**: 23.85 kbps

## Output Files
- `adv_cw.wav` - Carlini & Wagner adversarial
- `adv_cw_opus.wav` - C&W compressed with OPUS
- `adv_cw_amrwb.wav` - C&W compressed with AMR-WB
- `adv_qin.wav` - Qin adversarial
- `adv_qin_opus.wav` - Qin compressed with OPUS
- `adv_qin_amrwb.wav` - Qin compressed with AMR-WB
- `results_{filename}_{timestamp}.json` - All metrics and results

## Important Notes:
- Update `AUDIO_FILE_PATH` in cell 6 to point to your audio file
- All results are saved to JSON for analysis


Carlini & Wagner (2018) - Targeted Attack on Whisper

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!apt-get update
!apt-get install -y ffmpeg libavcodec-extra

0% [Working]            Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Connecting to security.0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Connecting to security.                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
                                                                               Hit:3 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Connecting to security.0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Connectin                                                                               Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [2,157 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http:

In [3]:

!apt install ffmpeg -y

!pip install openai-whisper librosa soundfile pesq pystoi --quiet

import numpy as np
import torch
import librosa
import soundfile as sf
import IPython.display as ipd
from pesq import pesq
from pystoi import stoi

# For reproducibility
torch.random.manual_seed(0)
np.random.seed(0)


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 58 not upgraded.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.2/803.2 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
  Building wheel for pesq (setup.py) ... [?25l[?25hdone


In [4]:
import whisper
model = whisper.load_model("base")  # Load Whisper ASR model (English)
print("Whisper model loaded.")

100%|████████████████████████████████████████| 139M/139M [00:01<00:00, 120MiB/s]


Whisper model loaded.


In [5]:
# Load audio from drive path
# UPDATE THIS PATH to your audio file location
AUDIO_FILE_PATH = "/content/drive/MyDrive/adversarial-audio/Normal-Examples/long-signals/sample-070236.wav"  # Change this to your path

# Load and resample to 16000 Hz (Whisper expects 16k audio)
audio, sr = librosa.load(AUDIO_FILE_PATH, sr=16000)
sf.write("original.wav", audio, 16000)
print(f"Loaded audio from: {AUDIO_FILE_PATH}")
print(f"Saved original.wav with sampling rate {16000} Hz and length {audio.shape[0]/16000:.2f} seconds.")


Loaded audio from: /content/drive/MyDrive/adversarial-audio/Normal-Examples/long-signals/sample-070236.wav
Saved original.wav with sampling rate 16000 Hz and length 6.60 seconds.


In [6]:
# Transcribe the original audio using our ASR model
result_orig = model.transcribe("original.wav")
print("Original Transcription:", result_orig["text"].strip())


Original Transcription: We are part of that soul, so we really recognize that it is working for us.


In [7]:
# Helper Functions for Codec Compression and Metrics

import subprocess
import json
from datetime import datetime
from pathlib import Path

def compress_audio_codec(audio_path, codec_name, bitrate, output_path, sr=16000):
    """
    Compress audio using OPUS or AMR-WB codec and decode back to WAV.

    Args:
        audio_path: Path to input WAV file
        codec_name: 'opus' or 'amr-wb'
        bitrate: Bitrate in kbps
        output_path: Path for output WAV file (decoded)
        sr: Sample rate (default 16000)

    Returns:
        Path to decoded WAV file if successful, None otherwise
    """
    codec_map = {
        'opus': {'codec': 'libopus', 'ext': '.opus'},
        'amr-wb': {'codec': 'libvo_amrwbenc', 'ext': '.amr'}
    }

    if codec_name not in codec_map:
        print(f"Unknown codec: {codec_name}")
        return None

    codec_info = codec_map[codec_name]
    encoded_path = str(output_path).replace('.wav', codec_info['ext'])

    try:
        # Encode
        encode_cmd = [
            'ffmpeg', '-y', '-i', str(audio_path),
            '-acodec', codec_info['codec'],
            '-b:a', f'{bitrate}k',
            '-ar', str(sr),
            '-ac', '1',
            encoded_path
        ]
        subprocess.run(encode_cmd, capture_output=True, check=True)

        # Decode back to WAV
        decode_cmd = [
            'ffmpeg', '-y', '-i', encoded_path,
            '-acodec', 'pcm_s16le',
            '-ar', str(sr),
            '-ac', '1',
            str(output_path)
        ]
        subprocess.run(decode_cmd, capture_output=True, check=True)

        return Path(output_path)
    except subprocess.CalledProcessError as e:
        print(f"Codec compression failed: {e}")
        return None

def compute_wer(reference: str, hypothesis: str) -> float:
    """Compute Word Error Rate (WER)."""
    ref_words = reference.lower().split()
    hyp_words = hypothesis.lower().split()

    if len(ref_words) == 0:
        return 1.0 if len(hyp_words) > 0 else 0.0

    d = np.zeros((len(ref_words) + 1, len(hyp_words) + 1))
    for i in range(len(ref_words) + 1):
        d[i, 0] = i
    for j in range(len(hyp_words) + 1):
        d[0, j] = j

    for i in range(1, len(ref_words) + 1):
        for j in range(1, len(hyp_words) + 1):
            if ref_words[i-1] == hyp_words[j-1]:
                d[i, j] = d[i-1, j-1]
            else:
                d[i, j] = min(d[i-1, j] + 1, d[i, j-1] + 1, d[i-1, j-1] + 1)

    return d[len(ref_words), len(hyp_words)] / len(ref_words)

def compute_cer(reference: str, hypothesis: str) -> float:
    """Compute Character Error Rate (CER)."""
    ref_chars = list(reference.lower().replace(" ", ""))
    hyp_chars = list(hypothesis.lower().replace(" ", ""))

    if len(ref_chars) == 0:
        return 1.0 if len(hyp_chars) > 0 else 0.0

    d = np.zeros((len(ref_chars) + 1, len(hyp_chars) + 1))
    for i in range(len(ref_chars) + 1):
        d[i, 0] = i
    for j in range(len(hyp_chars) + 1):
        d[0, j] = j

    for i in range(1, len(ref_chars) + 1):
        for j in range(1, len(hyp_chars) + 1):
            if ref_chars[i-1] == hyp_chars[j-1]:
                d[i, j] = d[i-1, j-1]
            else:
                d[i, j] = min(d[i-1, j] + 1, d[i, j-1] + 1, d[i-1, j-1] + 1)

    return d[len(ref_chars), len(hyp_chars)] / len(ref_chars)

def compute_all_metrics(original_audio_path, processed_audio_path, original_transcript, processed_transcript, sr=16000):
    """
    Compute all metrics: WER, CER, SNR, PESQ, STOI.

    Returns:
        Dictionary with all metrics
    """
    # Load audio
    orig_audio, _ = librosa.load(original_audio_path, sr=sr)
    proc_audio, _ = librosa.load(processed_audio_path, sr=sr)

    # Ensure same length
    min_len = min(len(orig_audio), len(proc_audio))
    orig_audio = orig_audio[:min_len]
    proc_audio = proc_audio[:min_len]

    # Compute WER and CER
    wer = compute_wer(original_transcript, processed_transcript)
    cer = compute_cer(original_transcript, processed_transcript)

    # Calculate SNR
    signal_power = np.mean(orig_audio ** 2)
    noise_power = np.mean((proc_audio - orig_audio) ** 2)
    snr_db = 10 * np.log10(signal_power / noise_power) if noise_power > 0 else float('inf')

    # Calculate PESQ
    try:
        pesq_score = pesq(sr, orig_audio, proc_audio, 'wb')
    except:
        pesq_score = 0.0

    # Calculate STOI
    try:
        stoi_score = stoi(orig_audio, proc_audio, sr, extended=False)
    except:
        stoi_score = 0.0

    return {
        'wer': float(wer),
        'cer': float(cer),
        'snr': float(snr_db) if snr_db != float('inf') else None,
        'pesq': float(pesq_score),
        'stoi': float(stoi_score)
    }

print("Helper functions loaded!")


Helper functions loaded!


In [8]:
target_phrase = "We are cart of that coal, so we really recognize that it is working for us."
print("Target Transcription (attack goal):", target_phrase)


Target Transcription (attack goal): We are cart of that coal, so we really recognize that it is working for us.


In [9]:
import torch
import torch.nn.functional as F

# Targeted Carlini & Wagner-style attack optimized directly on Whisper
# (gradient-based search that trades off ASR target loss and small perturbation)

# Prepare tokens for the target phrase
device = next(model.parameters()).device
try:
    tokenizer = whisper.tokenizer.get_tokenizer(multilingual=model.is_multilingual, language="en", task="transcribe")
except Exception:
    tokenizer = whisper.tokenizer.get_tokenizer(multilingual=False, language="en", task="transcribe")

target_token_ids = [tokenizer.sot] + tokenizer.encode(target_phrase) + [tokenizer.eot]
target_tokens = torch.tensor(target_token_ids, dtype=torch.long, device=device)

# Load audio and create trainable perturbation (ensure mono 1D)
audio_np, sr = sf.read("original.wav")
if audio_np.ndim > 1:
    audio_np = audio_np.mean(axis=1)
orig_audio = torch.tensor(audio_np, dtype=torch.float32, device=device)
delta = torch.zeros_like(orig_audio, requires_grad=True)

optimizer = torch.optim.Adam([delta], lr=2e-3)
c_weight = 1e-2  # Updated to maintain good quality of sound
num_iterations = 200

# FIXED: Don't detach audio - keep gradients flowing
def whisper_targeted_loss(audio_wave: torch.Tensor, tokens: torch.Tensor) -> torch.Tensor:
    """Cross-entropy between Whisper logits and target tokens (teacher forcing)."""
    # Keep audio on device and don't detach - this is critical for gradients!
    audio_for_mel = whisper.pad_or_trim(audio_wave.cpu())  # Only move to CPU for processing, not detach
    mel = whisper.log_mel_spectrogram(audio_for_mel).unsqueeze(0).to(device)
    tokens_in = tokens[:-1].unsqueeze(0)
    targets = tokens[1:].unsqueeze(0)

    # Enable gradients for the model
    model.train()  # Enable training mode for gradients
    logits = model(mel, tokens_in)
    logits = logits[:, -targets.shape[1]:, :]
    loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), targets.reshape(-1))
    return loss

# Set model to eval mode but enable gradients
model.eval()
for param in model.parameters():
    param.requires_grad = False  # Don't update model weights, only the perturbation

for iteration in range(num_iterations):
    optimizer.zero_grad()
    adv_wave = torch.clamp(orig_audio + delta, -1.0, 1.0)
    loss_asr = whisper_targeted_loss(adv_wave, target_tokens)
    distortion = torch.mean((adv_wave - orig_audio) ** 2)
    loss = loss_asr + c_weight * distortion
    loss.backward()
    optimizer.step()
    if iteration % 50 == 0 or iteration == num_iterations - 1:
        print(f"Iter {iteration}: ASR loss={loss_asr.item():.4f}, distortion={distortion.item():.6f}, total={loss.item():.4f}")

# Save the adversarial example
adv_audio = torch.clamp(orig_audio + delta, -1.0, 1.0).detach().cpu().numpy()
sf.write("adv_cw.wav", adv_audio, sr)
print(f"\nAdversarial audio saved to adv_cw.wav")
print(f"Perturbation stats: max={np.max(np.abs(adv_audio - audio_np)):.6f}, mean={np.mean(np.abs(adv_audio - audio_np)):.6f}")


Iter 0: ASR loss=6.8088, distortion=0.000000, total=6.8088
Iter 50: ASR loss=0.7935, distortion=0.000151, total=0.7935
Iter 100: ASR loss=0.1348, distortion=0.000189, total=0.1348
Iter 150: ASR loss=0.0277, distortion=0.000198, total=0.0277
Iter 199: ASR loss=0.0159, distortion=0.000199, total=0.0159

Adversarial audio saved to adv_cw.wav
Perturbation stats: max=0.078460, mean=0.011662


In [13]:
# Initialize results dictionary
original_transcript = result_orig["text"].strip()
results = {
    "audio_file": AUDIO_FILE_PATH,
    "original_transcript": original_transcript,
    "timestamp": datetime.now().isoformat(),
    "attacks": {
        "carlini_wagner": {},
        "qin": {}
    }
}

# Carlini & Wagner Attack - Adversarial Audio Metrics
print("="*80)
print("CARLINI & WAGNER ATTACK - Adversarial Audio")
print("="*80)
result_adv_cw = model.transcribe("adv_cw.wav")
adv_cw_transcript = result_adv_cw["text"].strip()
print(f"Original Transcription:  {original_transcript}")
print(f"Adversarial Transcription: {adv_cw_transcript}")

# Compute metrics on adversarial audio
metrics_adv = compute_all_metrics("original.wav", "adv_cw.wav", original_transcript, adv_cw_transcript)
results["attacks"]["carlini_wagner"]["adversarial"] = {
    "file": "adv_cw.wav",
    "transcript": adv_cw_transcript,
    "metrics": metrics_adv
}

print("Metrics (Original vs Adversarial):")
print(f"  WER: {metrics_adv['wer']:.4f} ({metrics_adv['wer']*100:.2f}%)")
print(f"  CER: {metrics_adv['cer']:.4f} ({metrics_adv['cer']*100:.2f}%)")
print(f"  SNR: {metrics_adv['snr']:.2f} dB" if metrics_adv['snr'] else f"  SNR: inf dB")
print(f"  PESQ: {metrics_adv['pesq']:.4f}")
print(f"  STOI: {metrics_adv['stoi']:.4f}")

# Path 1: Compress to OPUS
print("" + "="*80)
print("CARLINI & WAGNER ATTACK - OPUS Compression")
print("="*80)
opus_path = compress_audio_codec("adv_cw.wav", "opus", 64, "adv_cw_opus.wav")
if opus_path and opus_path.exists():
    result_opus = model.transcribe(str(opus_path))
    opus_transcript = result_opus["text"].strip()
    print(f"OPUS Compressed Transcription: {opus_transcript}")

    metrics_opus = compute_all_metrics("original.wav", str(opus_path), original_transcript, opus_transcript)
    results["attacks"]["carlini_wagner"]["opus"] = {
        "file": "adv_cw_opus.wav",
        "transcript": opus_transcript,
        "metrics": metrics_opus
    }

    print("Metrics (Original vs OPUS Compressed):")
    print(f"  WER: {metrics_opus['wer']:.4f} ({metrics_opus['wer']*100:.2f}%)")
    print(f"  CER: {metrics_opus['cer']:.4f} ({metrics_opus['cer']*100:.2f}%)")
    print(f"  SNR: {metrics_opus['snr']:.2f} dB" if metrics_opus['snr'] else f"  SNR: inf dB")
    print(f"  PESQ: {metrics_opus['pesq']:.4f}")
    print(f"  STOI: {metrics_opus['stoi']:.4f}")
else:
    print("OPUS compression failed!")
    results["attacks"]["carlini_wagner"]["opus"] = {"error": "Compression failed"}

# Path 2: Compress to AMR-WB
print("" + "="*80)
print("CARLINI & WAGNER ATTACK - AMR-WB Compression")
print("="*80)
amrwb_path = compress_audio_codec("adv_cw.wav", "amr-wb", 23.85, "adv_cw_amrwb.wav")
if amrwb_path and amrwb_path.exists():
    result_amrwb = model.transcribe(str(amrwb_path))
    amrwb_transcript = result_amrwb["text"].strip()
    print(f"AMR-WB Compressed Transcription: {amrwb_transcript}")

    metrics_amrwb = compute_all_metrics("original.wav", str(amrwb_path), original_transcript, amrwb_transcript)
    results["attacks"]["carlini_wagner"]["amr_wb"] = {
        "file": "adv_cw_amrwb.wav",
        "transcript": amrwb_transcript,
        "metrics": metrics_amrwb
    }

    print("Metrics (Original vs AMR-WB Compressed):")
    print(f"  WER: {metrics_amrwb['wer']:.4f} ({metrics_amrwb['wer']*100:.2f}%)")
    print(f"  CER: {metrics_amrwb['cer']:.4f} ({metrics_amrwb['cer']*100:.2f}%)")
    print(f"  SNR: {metrics_amrwb['snr']:.2f} dB" if metrics_amrwb['snr'] else f"  SNR: inf dB")
    print(f"  PESQ: {metrics_amrwb['pesq']:.4f}")
    print(f"  STOI: {metrics_amrwb['stoi']:.4f}")
else:
    print("AMR-WB compression failed!")
    results["attacks"]["carlini_wagner"]["amr_wb"] = {"error": "Compression failed"}


CARLINI & WAGNER ATTACK - Adversarial Audio
Original Transcription:  We are part of that soul, so we really recognize that it is working for us.
Adversarial Transcription: We are cart of that coal so we really recognize that it is working for us.
Metrics (Original vs Adversarial):
  WER: 0.1250 (12.50%)
  CER: 0.0667 (6.67%)
  SNR: 6.66 dB
  PESQ: 1.1582
  STOI: 0.5700
CARLINI & WAGNER ATTACK - OPUS Compression
OPUS Compressed Transcription: We are caught of that coal so we really recognize that it is working for us.
Metrics (Original vs OPUS Compressed):
  WER: 0.1250 (12.50%)
  CER: 0.1167 (11.67%)
  SNR: 6.92 dB
  PESQ: 1.1687
  STOI: 0.5696
CARLINI & WAGNER ATTACK - AMR-WB Compression
AMR-WB Compressed Transcription: We are caught of that soul, so we really recognize that it is working for us.
Metrics (Original vs AMR-WB Compressed):
  WER: 0.0625 (6.25%)
  CER: 0.0667 (6.67%)
  SNR: -1.75 dB
  PESQ: 1.2123
  STOI: 0.5072


Imperceptible Adversarial Examples (Qin et al., 2019) - Psychoacoustic Masking

In [17]:
import torch
import torch.nn.functional as F

# Targeted Carlini & Wagner-style attack optimized directly on Whisper
# (gradient-based search that trades off ASR target loss and small perturbation)

# Prepare tokens for the target phrase
device = next(model.parameters()).device
try:
    tokenizer = whisper.tokenizer.get_tokenizer(multilingual=model.is_multilingual, language="en", task="transcribe")
except Exception:
    tokenizer = whisper.tokenizer.get_tokenizer(multilingual=False, language="en", task="transcribe")

target_token_ids = [tokenizer.sot] + tokenizer.encode(target_phrase) + [tokenizer.eot]
target_tokens = torch.tensor(target_token_ids, dtype=torch.long, device=device)

# Load audio and create trainable perturbation (ensure mono 1D)
audio_np, sr = sf.read("original.wav")
if audio_np.ndim > 1:
    audio_np = audio_np.mean(axis=1)
orig_audio = torch.tensor(audio_np, dtype=torch.float32, device=device)
delta = torch.zeros_like(orig_audio, requires_grad=True)

optimizer = torch.optim.Adam([delta], lr=2e-3)
c_weight = 1e-2  # Updated to maintain good quality of sound
num_iterations = 400

# FIXED: Don't detach audio - keep gradients flowing
def whisper_targeted_loss(audio_wave: torch.Tensor, tokens: torch.Tensor) -> torch.Tensor:
    """Cross-entropy between Whisper logits and target tokens (teacher forcing)."""
    # Keep audio on device and don't detach - this is critical for gradients!
    audio_for_mel = whisper.pad_or_trim(audio_wave.cpu())  # Only move to CPU for processing, not detach
    mel = whisper.log_mel_spectrogram(audio_for_mel).unsqueeze(0).to(device)
    tokens_in = tokens[:-1].unsqueeze(0)
    targets = tokens[1:].unsqueeze(0)

    # Enable gradients for the model
    model.train()  # Enable training mode for gradients
    logits = model(mel, tokens_in)
    logits = logits[:, -targets.shape[1]:, :]
    loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), targets.reshape(-1))
    return loss

# Set model to eval mode but enable gradients
model.eval()
for param in model.parameters():
    param.requires_grad = False  # Don't update model weights, only the perturbation

for iteration in range(num_iterations):
    optimizer.zero_grad()
    adv_wave = torch.clamp(orig_audio + delta, -1.0, 1.0)
    loss_asr = whisper_targeted_loss(adv_wave, target_tokens)
    distortion = torch.mean((adv_wave - orig_audio) ** 2)
    loss = loss_asr + c_weight * distortion
    loss.backward()
    optimizer.step()
    if iteration % 50 == 0 or iteration == num_iterations - 1:
        print(f"Iter {iteration}: ASR loss={loss_asr.item():.4f}, distortion={distortion.item():.6f}, total={loss.item():.4f}")

# Save the adversarial example
adv_audio = torch.clamp(orig_audio + delta, -1.0, 1.0).detach().cpu().numpy()
try:
    sf.write("adv_qin.wav", adv_audio, sr)
    # Verify file was saved
    import os
    if os.path.exists("adv_qin.wav"):
        file_size = os.path.getsize("adv_qin.wav")
        print(f"\n✓ Adversarial audio saved to adv_qin.wav (size: {file_size} bytes)")
        print(f"Perturbation stats: max={np.max(np.abs(adv_audio - audio_np)):.6f}, mean={np.mean(np.abs(adv_audio - audio_np)):.6f}")
    else:
        print("\n⚠️ WARNING: File adv_qin.wav was not created! Attempting to save again...")
        # Try saving with absolute path
        sf.write("/content/adv_qin.wav", adv_audio, sr)
        if os.path.exists("/content/adv_qin.wav"):
            print("✓ File saved to /content/adv_qin.wav")
        else:
            print("✗ ERROR: Failed to save adv_qin.wav")
except Exception as e:
    print(f"\n✗ ERROR saving adv_qin.wav: {e}")
    import traceback
    traceback.print_exc()


Iter 0: ASR loss=6.8088, distortion=0.000000, total=6.8088
Iter 50: ASR loss=0.8822, distortion=0.000151, total=0.8822
Iter 100: ASR loss=0.2035, distortion=0.000194, total=0.2035
Iter 150: ASR loss=0.0299, distortion=0.000208, total=0.0299
Iter 200: ASR loss=0.0165, distortion=0.000209, total=0.0165
Iter 250: ASR loss=0.0120, distortion=0.000210, total=0.0120
Iter 300: ASR loss=0.0088, distortion=0.000211, total=0.0088
Iter 350: ASR loss=0.0071, distortion=0.000211, total=0.0071
Iter 399: ASR loss=0.0057, distortion=0.000212, total=0.0057

✓ Adversarial audio saved to adv_qin.wav (size: 211244 bytes)
Perturbation stats: max=0.092357, mean=0.011918


In [18]:
# Qin Attack - Adversarial Audio Metrics
print("="*80)
print("QIN ATTACK - Adversarial Audio")
print("="*80)

# Check if adv_qin.wav exists, if not, the attack may have failed
import os
qin_file = None
if os.path.exists("adv_qin.wav"):
    qin_file = "adv_qin.wav"
elif os.path.exists("/content/adv_qin.wav"):
    qin_file = "/content/adv_qin.wav"
    print("Found adv_qin.wav at /content/adv_qin.wav")

if qin_file is None:
    print("ERROR: adv_qin.wav not found! The Qin attack may have failed.")
    print("Please check Cell 13 output to see if the attack completed successfully.")
    print("Skipping Qin attack metrics...")
    results["attacks"]["qin"]["adversarial"] = {"error": "adv_qin.wav file not found - attack may have failed"}
else:
    try:
        print(f"Transcribing {qin_file}...")
        result_adv_qin = model.transcribe(qin_file)
        adv_qin_transcript = result_adv_qin["text"].strip()
        print(f"Original Transcription:  {original_transcript}")
        print(f"Adversarial Transcription: {adv_qin_transcript}")

        # Compute metrics on adversarial audio
        metrics_adv_qin = compute_all_metrics("original.wav", qin_file, original_transcript, adv_qin_transcript)
        results["attacks"]["qin"]["adversarial"] = {
            "file": "adv_qin.wav",
            "transcript": adv_qin_transcript,
            "metrics": metrics_adv_qin
        }

        print("\nMetrics (Original vs Adversarial):")
        print(f"  WER: {metrics_adv_qin['wer']:.4f} ({metrics_adv_qin['wer']*100:.2f}%)")
        print(f"  CER: {metrics_adv_qin['cer']:.4f} ({metrics_adv_qin['cer']*100:.2f}%)")
        print(f"  SNR: {metrics_adv_qin['snr']:.2f} dB" if metrics_adv_qin['snr'] else f"  SNR: inf dB")
        print(f"  PESQ: {metrics_adv_qin['pesq']:.4f}")
        print(f"  STOI: {metrics_adv_qin['stoi']:.4f}")

        # Path 1: Compress to OPUS
        print("\n" + "="*80)
        print("QIN ATTACK - OPUS Compression")
        print("="*80)
        opus_path_qin = compress_audio_codec(qin_file, "opus", 64, "adv_qin_opus.wav")
        if opus_path_qin and opus_path_qin.exists():
            result_opus_qin = model.transcribe(str(opus_path_qin))
            opus_transcript_qin = result_opus_qin["text"].strip()
            print(f"OPUS Compressed Transcription: {opus_transcript_qin}")

            metrics_opus_qin = compute_all_metrics("original.wav", str(opus_path_qin), original_transcript, opus_transcript_qin)
            results["attacks"]["qin"]["opus"] = {
                "file": "adv_qin_opus.wav",
                "transcript": opus_transcript_qin,
                "metrics": metrics_opus_qin
            }

            print("\nMetrics (Original vs OPUS Compressed):")
            print(f"  WER: {metrics_opus_qin['wer']:.4f} ({metrics_opus_qin['wer']*100:.2f}%)")
            print(f"  CER: {metrics_opus_qin['cer']:.4f} ({metrics_opus_qin['cer']*100:.2f}%)")
            print(f"  SNR: {metrics_opus_qin['snr']:.2f} dB" if metrics_opus_qin['snr'] else f"  SNR: inf dB")
            print(f"  PESQ: {metrics_opus_qin['pesq']:.4f}")
            print(f"  STOI: {metrics_opus_qin['stoi']:.4f}")
        else:
            print("OPUS compression failed!")
            results["attacks"]["qin"]["opus"] = {"error": "Compression failed"}

        # Path 2: Compress to AMR-WB
        print("\n" + "="*80)
        print("QIN ATTACK - AMR-WB Compression")
        print("="*80)
        amrwb_path_qin = compress_audio_codec(qin_file, "amr-wb", 23.85, "adv_qin_amrwb.wav")
        if amrwb_path_qin and amrwb_path_qin.exists():
            result_amrwb_qin = model.transcribe(str(amrwb_path_qin))
            amrwb_transcript_qin = result_amrwb_qin["text"].strip()
            print(f"AMR-WB Compressed Transcription: {amrwb_transcript_qin}")

            metrics_amrwb_qin = compute_all_metrics("original.wav", str(amrwb_path_qin), original_transcript, amrwb_transcript_qin)
            results["attacks"]["qin"]["amr_wb"] = {
                "file": "adv_qin_amrwb.wav",
                "transcript": amrwb_transcript_qin,
                "metrics": metrics_amrwb_qin
            }

            print("\nMetrics (Original vs AMR-WB Compressed):")
            print(f"  WER: {metrics_amrwb_qin['wer']:.4f} ({metrics_amrwb_qin['wer']*100:.2f}%)")
            print(f"  CER: {metrics_amrwb_qin['cer']:.4f} ({metrics_amrwb_qin['cer']*100:.2f}%)")
            print(f"  SNR: {metrics_amrwb_qin['snr']:.2f} dB" if metrics_amrwb_qin['snr'] else f"  SNR: inf dB")
            print(f"  PESQ: {metrics_amrwb_qin['pesq']:.4f}")
            print(f"  STOI: {metrics_amrwb_qin['stoi']:.4f}")
        else:
            print("AMR-WB compression failed!")
            results["attacks"]["qin"]["amr_wb"] = {"error": "Compression failed"}
    except Exception as e:
        print(f"ERROR processing Qin attack: {e}")
        import traceback
        traceback.print_exc()
        results["attacks"]["qin"]["adversarial"] = {"error": f"Processing failed: {str(e)}"}

QIN ATTACK - Adversarial Audio
Transcribing adv_qin.wav...
Original Transcription:  We are part of that soul, so we really recognize that it is working for us.
Adversarial Transcription: We are cart of that coal so we really recognize that it is working for us.

Metrics (Original vs Adversarial):
  WER: 0.1250 (12.50%)
  CER: 0.0667 (6.67%)
  SNR: 6.41 dB
  PESQ: 1.1504
  STOI: 0.5714

QIN ATTACK - OPUS Compression
OPUS Compressed Transcription: We are cart of that coal so we really recognize that it is working for us.

Metrics (Original vs OPUS Compressed):
  WER: 0.1250 (12.50%)
  CER: 0.0667 (6.67%)
  SNR: 6.65 dB
  PESQ: 1.1601
  STOI: 0.5699

QIN ATTACK - AMR-WB Compression
AMR-WB Compressed Transcription: We are caught of that soul. So we really recognize that it is working for us.

Metrics (Original vs AMR-WB Compressed):
  WER: 0.1250 (12.50%)
  CER: 0.0833 (8.33%)
  SNR: -1.83 dB
  PESQ: 1.1989
  STOI: 0.5033


In [19]:
# Save results to JSON
import os
audio_basename = os.path.splitext(os.path.basename(AUDIO_FILE_PATH))[0]
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_filename = f"results_{audio_basename}_{timestamp}.json"

with open(results_filename, 'w') as f:
    json.dump(results, f, indent=2)

print("\n" + "="*80)
print(f"Results saved to: {results_filename}")
print("="*80)



Results saved to: results_sample-070236_20251202_031854.json
