# MMLU-Pro Audio Dataset Pipeline

This notebook processes the MMLU-Pro validation dataset to create an organized audio dataset:

**Pipeline Steps:**
1. Extract questions from HuggingFace MMLU-Pro validation split (70 questions)
2. Convert each question to MP3 audio
3. Generate audio variations with different effects
4. Upload to Google Drive with organized folder structure
5. Track all files in Google Sheets

**Output Structure:**
- **Google Drive**: `/{parent_folder}/{question_id}/` containing all MP3 variants
- **Google Sheets**: Separate sheets for each effect type with file tracking

---
## 1. Setup & Installation

In [None]:
# Install required packages
!pip install datasets gtts librosa soundfile pydub scipy numpy gspread google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client -q

---
## 2. Import Libraries

In [None]:
import os
import shutil
from pathlib import Path
from typing import List, Dict, Optional, Tuple

# Audio processing
import librosa
import soundfile as sf
import numpy as np
from scipy.signal import fftconvolve
from pydub import AudioSegment
from gtts import gTTS

# Dataset
from datasets import load_dataset

# Google services
import gspread
from google.colab import auth
from google.auth import default
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload

print("✓ All libraries imported successfully!")

---
## 3. Configuration

**IMPORTANT**: Update these values before running!

In [None]:
# ==== USER CONFIGURATION ====

# Google Sheets link (must have edit access)
GOOGLE_SHEET_LINK = "https://docs.google.com/spreadsheets/d/YOUR_SHEET_ID/edit"

# Google Drive parent folder ID (where question folders will be created)
# To get this: Open the folder in Drive, the ID is in the URL after /folders/
DRIVE_PARENT_FOLDER_ID = "YOUR_FOLDER_ID"

# Path to your saved background effect files (optional)
# Set to None if you don't have these files
EFFECTS_FOLDER = "./saved_effects"  # or None

# Question range to process (0-69 for full validation set)
START_QUESTION = 0
END_QUESTION = 69  # Inclusive

# Local output directory
OUTPUT_DIR = "./outputs"

print("✓ Configuration loaded!")
print(f"  Will process questions {START_QUESTION} to {END_QUESTION}")

---
## 4. Google Authentication

In [None]:
# Authenticate with Google (will prompt for authorization)
print("Authenticating with Google...")
auth.authenticate_user()
creds, _ = default()

# Initialize Google APIs
gc = gspread.authorize(creds)
drive_service = build('drive', 'v3', credentials=creds)

print("✓ Google authentication successful!")

---
## 5. Core Audio Processing Classes

In [None]:
class AudioProcessor:
    """Handles all audio generation and transformation operations."""
    
    def __init__(self, output_dir: str = "./outputs"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
    
    def text_to_audio(self, text: str, output_path: str) -> bool:
        """Convert text to MP3 using Google TTS."""
        try:
            tts = gTTS(text=text, lang='en', slow=False)
            tts.save(output_path)
            return True
        except Exception as e:
            print(f"Error converting text to audio: {e}")
            return False
    
    @staticmethod
    def _shift_pitch(data, sr, n_steps=0):
        """Shift pitch by n_steps semitones."""
        return librosa.effects.pitch_shift(y=data, sr=sr, n_steps=n_steps)
    
    @staticmethod
    def _stretch_time(data, rate=1.0):
        """Change speed by rate factor."""
        return librosa.effects.time_stretch(y=data, rate=rate)
    
    @staticmethod
    def _apply_reverb(data, sr, room_size=0.5, wet_dry=0.3):
        """Add reverb effect."""
        reverb_duration = room_size * 2.0
        ir_length = int(reverb_duration * sr)
        t = np.linspace(0, reverb_duration, ir_length)
        decay = np.exp(-3.0 * t / reverb_duration)
        impulse = decay * np.random.randn(ir_length) * 0.1
        reverb_signal = fftconvolve(data, impulse, mode='same')
        output = (1 - wet_dry) * data + wet_dry * reverb_signal
        return output / np.max(np.abs(output))
    
    @staticmethod
    def _overlay_audio(data, sr, overlay_path, volume_ratio=1.0):
        """Overlay background audio onto original."""
        overlay_data, _ = librosa.load(overlay_path, sr=sr)
        original_length = len(data)
        overlay_length = len(overlay_data)
        
        # Match lengths
        if overlay_length < original_length:
            num_repeats = int(np.ceil(original_length / overlay_length))
            overlay_data = np.tile(overlay_data, num_repeats)[:original_length]
        else:
            overlay_data = overlay_data[:original_length]
        
        # Mix and normalize
        overlay_data = overlay_data * volume_ratio
        mixed = data + overlay_data
        max_val = np.max(np.abs(mixed))
        if max_val > 1.0:
            mixed = mixed / max_val
        return mixed
    
    @staticmethod
    def _save_as_mp3(data, sr, output_path):
        """Save audio data as MP3."""
        temp_wav = str(output_path).replace('.mp3', '_temp.wav')
        sf.write(temp_wav, data, sr)
        audio = AudioSegment.from_wav(temp_wav)
        audio.export(output_path, format='mp3', bitrate='192k')
        os.remove(temp_wav)
    
    def apply_effect(self, input_path: str, output_path: str, effect_type: str, 
                    effect_value: float = None, overlay_path: str = None) -> bool:
        """Apply a single effect to an audio file.
        
        Args:
            input_path: Path to input MP3
            output_path: Path to save output MP3
            effect_type: One of 'pitch_up', 'pitch_down', 'speed_up', 'speed_down', 
                        'reverb', or 'overlay'
            effect_value: Parameter value for the effect
            overlay_path: Path to overlay audio (for overlay effect)
        """
        try:
            data, sr = librosa.load(input_path, sr=None)
            
            if effect_type == 'pitch_up':
                data = self._shift_pitch(data, sr, n_steps=effect_value or 4)
            elif effect_type == 'pitch_down':
                data = self._shift_pitch(data, sr, n_steps=-(effect_value or 4))
            elif effect_type == 'speed_up':
                data = self._stretch_time(data, rate=effect_value or 1.5)
            elif effect_type == 'speed_down':
                data = self._stretch_time(data, rate=effect_value or 0.7)
            elif effect_type == 'reverb':
                data = self._apply_reverb(data, sr, room_size=effect_value or 0.5)
            elif effect_type == 'overlay' and overlay_path:
                data = self._overlay_audio(data, sr, overlay_path, 
                                          volume_ratio=effect_value or 1.0)
            else:
                return False
            
            self._save_as_mp3(data, sr, output_path)
            return True
        except Exception as e:
            print(f"Error applying {effect_type} effect: {e}")
            return False

print("✓ AudioProcessor class loaded!")

---
## 6. Google Drive & Sheets Manager

In [None]:
class GoogleManager:
    """Manages Google Drive uploads and Sheets updates."""
    
    def __init__(self, sheet_link: str, drive_parent_id: str, 
                 drive_service, sheets_client):
        self.drive_service = drive_service
        self.gc = sheets_client
        self.drive_parent_id = drive_parent_id
        
        # Open the Google Sheet
        try:
            self.sheet = self.gc.open_by_url(sheet_link)
            print(f"✓ Connected to Google Sheet")
        except Exception as e:
            raise Exception(f"Error opening Google Sheet: {e}")
    
    def create_question_folder(self, question_id: int) -> str:
        """Create a folder for a question in Google Drive.
        
        Returns:
            folder_id: The ID of the created folder
        """
        folder_metadata = {
            'name': str(question_id),
            'mimeType': 'application/vnd.google-apps.folder',
            'parents': [self.drive_parent_id]
        }
        folder = self.drive_service.files().create(
            body=folder_metadata, 
            fields='id'
        ).execute()
        return folder.get('id')
    
    def upload_file(self, local_path: str, filename: str, 
                   folder_id: str) -> str:
        """Upload a file to Google Drive.
        
        Returns:
            webViewLink: Public URL to the uploaded file
        """
        file_metadata = {
            'name': filename,
            'parents': [folder_id]
        }
        media = MediaFileUpload(local_path, resumable=True)
        uploaded_file = self.drive_service.files().create(
            body=file_metadata,
            media_body=media,
            fields='id, webViewLink'
        ).execute()
        return uploaded_file.get('webViewLink')
    
    def update_sheet(self, sheet_name: str, question_id: int, 
                    filename: str, file_url: str):
        """Add a row to the specified sheet."""
        try:
            worksheet = self.sheet.worksheet(sheet_name)
        except gspread.exceptions.WorksheetNotFound:
            # Create the worksheet if it doesn't exist
            worksheet = self.sheet.add_worksheet(
                title=sheet_name, 
                rows=100, 
                cols=3
            )
            # Add headers
            worksheet.append_row(["Question ID", "Filename", "Drive Link"])
        
        # Append the data
        worksheet.append_row([question_id, filename, file_url])

print("✓ GoogleManager class loaded!")

---
## 7. Main Pipeline Orchestrator

In [None]:
class AudioDatasetPipeline:
    """Main pipeline that orchestrates the entire process."""
    
    def __init__(self, output_dir: str, google_manager: GoogleManager, 
                 effects_folder: Optional[str] = None):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.audio_processor = AudioProcessor(output_dir)
        self.google_manager = google_manager
        self.effects_folder = Path(effects_folder) if effects_folder else None
        
        # Define all effects to apply
        self.effects_config = self._build_effects_config()
    
    def _build_effects_config(self) -> List[Dict]:
        """Build the configuration for all effects to apply."""
        config = [
            {'name': 'original', 'type': None},  # No effect, just the original
            {'name': 'pitch_up', 'type': 'pitch_up', 'value': 4},
            {'name': 'pitch_down', 'type': 'pitch_down', 'value': 4},
            {'name': 'speed_up', 'type': 'speed_up', 'value': 1.5},
            {'name': 'speed_down', 'type': 'speed_down', 'value': 0.7},
            {'name': 'reverb', 'type': 'reverb', 'value': 0.5},
        ]
        
        # Add overlay effects if effects folder is provided
        if self.effects_folder and self.effects_folder.exists():
            overlays = [
                ('wind', 'wind.mp3', 0.75),
                ('rain', 'rain.mp3', 1.25),
                ('coffee_shop', 'coffee_shop.mp3', 1.4),
                ('busy_street', 'busy_street.mp3', 0.6),
                ('music', 'song1.mp3', 0.35),
            ]
            for name, filename, volume in overlays:
                overlay_path = self.effects_folder / filename
                if overlay_path.exists():
                    config.append({
                        'name': name,
                        'type': 'overlay',
                        'value': volume,
                        'overlay_path': str(overlay_path)
                    })
        
        return config
    
    def process_question(self, question_id: int, question_text: str) -> bool:
        """Process a single question through the entire pipeline.
        
        Steps:
        1. Generate original MP3
        2. Apply all effects
        3. Upload to Google Drive
        4. Update Google Sheets
        5. Clean up local files
        """
        print(f"\n{'='*60}")
        print(f"Processing Question {question_id}")
        print(f"{'='*60}")
        print(f"Text: {question_text[:100]}...")
        
        try:
            # Step 1: Generate original MP3
            original_filename = f"{question_id}_original.mp3"
            original_path = self.output_dir / original_filename
            
            print(f"\n[1/4] Generating original audio...")
            if not self.audio_processor.text_to_audio(question_text, str(original_path)):
                print(f"  ✗ Failed to generate audio for question {question_id}")
                return False
            print(f"  ✓ Saved: {original_filename}")
            
            # Step 2: Create Google Drive folder
            print(f"\n[2/4] Creating Google Drive folder...")
            folder_id = self.google_manager.create_question_folder(question_id)
            print(f"  ✓ Folder created for question {question_id}")
            
            # Step 3: Process all effects and upload
            print(f"\n[3/4] Generating effects and uploading...")
            files_to_upload = []  # (local_path, filename, effect_name)
            
            for effect_config in self.effects_config:
                effect_name = effect_config['name']
                filename = f"{question_id}_{effect_name}.mp3"
                output_path = self.output_dir / filename
                
                # For original, just use the already generated file
                if effect_config['type'] is None:
                    files_to_upload.append((str(original_path), filename, effect_name))
                    continue
                
                # Generate effect
                print(f"  • Generating {effect_name}...")
                success = self.audio_processor.apply_effect(
                    str(original_path),
                    str(output_path),
                    effect_config['type'],
                    effect_config.get('value'),
                    effect_config.get('overlay_path')
                )
                
                if success:
                    files_to_upload.append((str(output_path), filename, effect_name))
                else:
                    print(f"    ✗ Failed to generate {effect_name}")
            
            # Upload all files and update sheets
            print(f"\n[4/4] Uploading to Drive and updating Sheets...")
            for local_path, filename, effect_name in files_to_upload:
                # Upload to Drive
                file_url = self.google_manager.upload_file(
                    local_path, filename, folder_id
                )
                
                # Update corresponding sheet
                self.google_manager.update_sheet(
                    effect_name, question_id, filename, file_url
                )
                print(f"  ✓ Uploaded {filename} → {effect_name} sheet")
            
            # Step 4: Clean up local files
            print(f"\n  Cleaning up local files...")
            for local_path, _, _ in files_to_upload:
                if os.path.exists(local_path):
                    os.remove(local_path)
            
            print(f"\n✓ Question {question_id} completed successfully!")
            return True
            
        except Exception as e:
            print(f"\n✗ Error processing question {question_id}: {e}")
            return False

print("✓ AudioDatasetPipeline class loaded!")

---
## 8. Load Dataset

In [None]:
print("Loading MMLU-Pro validation dataset...")
dataset = load_dataset("TIGER-Lab/MMLU-Pro", split="validation", streaming=True)

# Peek at first row
first_row = next(iter(dataset))
print(f"\n✓ Dataset loaded successfully!")
print(f"  Available columns: {list(first_row.keys())}")
print(f"  Sample question: {first_row['question'][:100]}...")

---
## 9. Initialize Pipeline

In [None]:
# Initialize Google Manager
print("Initializing Google Manager...")
google_manager = GoogleManager(
    sheet_link=GOOGLE_SHEET_LINK,
    drive_parent_id=DRIVE_PARENT_FOLDER_ID,
    drive_service=drive_service,
    sheets_client=gc
)

# Initialize Pipeline
print("Initializing Audio Dataset Pipeline...")
pipeline = AudioDatasetPipeline(
    output_dir=OUTPUT_DIR,
    google_manager=google_manager,
    effects_folder=EFFECTS_FOLDER
)

print(f"\n✓ Pipeline initialized!")
print(f"  Effects to apply: {[e['name'] for e in pipeline.effects_config]}")

---
## 10. Run Pipeline

This cell processes all questions in the specified range.

In [None]:
# Process all questions
print(f"\n{'#'*60}")
print(f"STARTING PIPELINE")
print(f"Processing questions {START_QUESTION} to {END_QUESTION}")
print(f"{'#'*60}\n")

success_count = 0
fail_count = 0

# Skip to start position and take the range we want
num_questions = END_QUESTION - START_QUESTION + 1
questions_to_process = dataset.skip(START_QUESTION).take(num_questions)

for idx, row in enumerate(questions_to_process):
    question_id = START_QUESTION + idx
    question_text = row['question']
    
    success = pipeline.process_question(question_id, question_text)
    
    if success:
        success_count += 1
    else:
        fail_count += 1

# Final summary
print(f"\n{'#'*60}")
print(f"PIPELINE COMPLETE")
print(f"{'#'*60}")
print(f"  ✓ Successful: {success_count}")
print(f"  ✗ Failed: {fail_count}")
print(f"  Total: {success_count + fail_count}")

---
## 11. Process Single Question (Optional)

Use this cell to process a single question for testing.

In [None]:
# Test with a single question
TEST_QUESTION_ID = 0

# Get the question
test_question = next(iter(dataset.skip(TEST_QUESTION_ID).take(1)))
question_text = test_question['question']

print(f"Testing with question {TEST_QUESTION_ID}:")
print(f"Text: {question_text}\n")

# Process
pipeline.process_question(TEST_QUESTION_ID, question_text)

---
## 12. Utility Functions

In [None]:
def check_sheet_status(sheet_name: str):
    """Check how many entries are in a specific sheet."""
    try:
        worksheet = google_manager.sheet.worksheet(sheet_name)
        all_values = worksheet.get_all_values()
        print(f"Sheet '{sheet_name}': {len(all_values) - 1} entries (excluding header)")
        return all_values
    except gspread.exceptions.WorksheetNotFound:
        print(f"Sheet '{sheet_name}' does not exist yet.")
        return []

def check_all_sheets():
    """Check status of all effect sheets."""
    print("\nChecking all sheets:")
    print("="*40)
    for effect in pipeline.effects_config:
        check_sheet_status(effect['name'])

# Run check
# check_all_sheets()

---
## Summary

This notebook provides a complete pipeline for:

1. **Data Extraction**: Loads MMLU-Pro validation dataset from HuggingFace
2. **Audio Generation**: Converts questions to MP3 using Google TTS
3. **Effect Application**: Applies various audio effects (pitch, speed, reverb, overlays)
4. **Cloud Storage**: Uploads organized files to Google Drive
5. **Tracking**: Maintains detailed records in Google Sheets

**Output Structure:**
- **Google Drive**: `/{parent_folder}/{question_id}/` contains all MP3 variants
- **Google Sheets**: Each effect has its own sheet with question_id, filename, and drive link

**Supported Effects:**
- `original` - No effect applied
- `pitch_up` - Increase pitch by 4 semitones
- `pitch_down` - Decrease pitch by 4 semitones  
- `speed_up` - Increase speed by 1.5x
- `speed_down` - Decrease speed to 0.7x
- `reverb` - Add reverb effect
- Background overlays (if effect files provided): `wind`, `rain`, `coffee_shop`, `busy_street`, `music`