#  Vertex Inference Unified

This notebook uses the unified `google-genai` library (imported as `from google import genai`). It supports:
- **Vertex AI Backend:** Uploads videos to GCS during the 'Prepare' step. Using your own Cloud Account
- **Gemini API Backend:** Uploads videos using the **File API** during the 'Prepare' step. Using API key

**Pipeline Steps:**
1.  **Import Libraries & Configure.**
2.  **Config** - Set up the configuration for the pipeline, including the model, model config, prompts, and prompt config.
2.  **Initialize Clients:** Set up AI client and Storage client.
3.  **(Only the first time) Fetch Dataset:** Downloads metadata from HuggingFace.
4.  **(Only the first time on each API type) Download, Extract & Prepare Videos:** Downloads, extracts, uploads (GCS/File API). Updates metadata.
5.  **Bulk Inference (Async):** Performs inference using pre-uploaded video resources using baseline models (Gemini 2.5 Pro/ Gemini 2.5 Flash)

### Notes:

1. After switching from Vertex to Gemini and vice versa, be sure to follow the steps:
    - Run all cells in order to re-upload the videos to the correct storage client, you can enable the SKIP_DOWNLOAD and SKIP_EXTRACT flags to skip the download and extraction steps. Only the upload step is needed

2. Gemini API's file client has a expiry time of 1 day or so for the uploaded files. You may need to follow the steps above to re-upload the files.

## Import Libraries

In [1]:
# Cell 1: Imports (Corrected for `google.genai`)
import os
import csv
import json
import logging
import time
import random
import requests
import datetime
import zipfile
import math
import sys
import asyncio
from typing import Dict, List, Optional, Set, Tuple, Any
from collections import defaultdict
from pathlib import Path
import shutil
import subprocess
import tempfile
import fractions

# Google Cloud & AI Libraries (Unified SDK)
try:
    import google.genai as genai
    from google.genai import types
    from google.genai import errors as genai_errors
    from google.api_core import exceptions as api_core_exceptions
    # GCS Client (Optional, for Vertex Mode)
    try:
        from google.cloud import storage
        GCS_AVAILABLE = True
    except ImportError:
        print("INFO: google-cloud-storage not found. Vertex AI GCS operations unavailable.")
        storage = None
        GCS_AVAILABLE = False
    print("`google.genai` SDK and helpers imported successfully.")
except ImportError as e:
     print(f"ERROR: Failed to import Google libraries: {e}. Install: pip install google-genai google-api-core google-cloud-storage")
     genai = None; types = None; genai_errors = None; api_core_exceptions = None
     storage = None; GCS_AVAILABLE = False
     raise ImportError("FATAL: `google.genai` or `google-api-core` SDK not found.")

# Data Handling & Progress
from datasets import load_dataset
import pandas as pd
from tqdm.notebook import tqdm

# UI Elements
import ipywidgets as widgets
from IPython.display import display, Markdown, HTML, clear_output

# Async in Notebook
import nest_asyncio
nest_asyncio.apply()

`google.genai` SDK and helpers imported successfully.


## Config Settings

In [None]:
# --- GCP Configuration ---
# PROJECT_ID = "your_google_cloud_project_id" # Your Google Cloud Project ID (Needed for GCS and Vertex AI mode)
# LOCATION = "your_google_cloud_region"      # Your Google Cloud Region (Needed for Vertex AI mode)
# GCS_BUCKET = "your_gcs_bucket" # Your GCS bucket name (Needed for video storage), required for Vertex AI mode

PROJECT_ID = "tiktokllm" # Your Google Cloud Project ID (Needed for GCS and Vertex AI mode)
LOCATION = "us-central1"      # Your Google Cloud Region (Needed for Vertex AI mode)
GCS_BUCKET = "seekdeep-ml-storage" # Your GCS bucket name (Needed for video storage)

# --- Choose Backend Mode ---
# Set USE_VERTEX to True to use the Vertex AI backend (requires ADC or service account auth).
# Set USE_VERTEX to False to use the Gemini API backend (requires GEMINI_API_KEY).
USE_VERTEX = True  # <-- CHANGE THIS TO True TO USE VERTEX AI

# --- Gemini API Key (Only required if USE_VERTEX is False) ---
# IMPORTANT: Replace with your actual Gemini API Key if USE_VERTEX is False.
# Consider loading from environment variables (GOOGLE_API_KEY) or a secure secrets manager.
GEMINI_API_KEY = "YOUR_API_KEY_HERE"  # Replace with your actual Gemini API Key


# --- File Paths ---
DATASET_CSV = "dataset.csv"               # Input dataset metadata from HuggingFace
METADATA_FILE = "video_metadata_vertex_inj_correct_mcq.csv" if USE_VERTEX else "video_metadata_non_vertex_inj_correct_mcq.csv"      # Stores video info: video_id, local_path, gcs_uri (if Vertex), question data
RESULTS_FILE = "results_noncot_full_inference.csv"              # Output file for inference predictions
DOWNLOADS_DIR = "downloads"               # Directory for downloaded zip file
EXTRACTED_VIDEOS_DIR = "extracted_videos" # Directory storing extracted .mp4 files locally
SPEED_VIDEOS_DIR = "speed_videos"         # Stores sped up/slowed down videos
HF_CACHE_DIR = "./hf_cache"               # Cache directory for HuggingFace datasets

# --- Step 1: Fetch Dataset Configuration ---
HF_DATASET_NAME = "lmms-lab/AISG_Challenge" # HuggingFace dataset identifier
HF_DATASET_SPLIT = "test"                 # Dataset split to use
SKIP_FETCH = False                        # Set True to skip fetching if DATASET_CSV exists

# --- Step 2: Download & Prepare Videos Configuration ---
VIDEO_ZIP_URL = "https://huggingface.co/datasets/lmms-lab/AISG_Challenge/resolve/main/Benchmark-AllVideos-HQ-Encoded-challenge.zip?download=true"
ZIP_FILE_NAME = "all_videos.zip"
SKIP_DOWNLOAD_ZIP = True                 # Set True to skip downloading if zip exists
SKIP_EXTRACT = True                      # Set True to skip extraction if videos exist locally
SKIP_PREPARE = False                      # Set True to skip video preparation (GCS upload for Vertex, metadata update)
MAX_VIDEOS_TO_PROCESS = None              # Limit videos for testing (e.g., 5), None for all
UPLOAD_BATCH_SIZE_GCS = 10                # Batch size for GCS uploads (Vertex mode only)

# --- Inference Configuration ---
# Choose a model name compatible with your selected method (Vertex AI or Gemini API)

# Examples:

# Vertex AI: gemini-2.0-flash, gemini-2.0-flash-lite, gemini-2.0-pro-exp-02-05, gemini-2.0-flash-thinking-exp-01-21
# Rate limits: https://cloud.google.com/vertex-ai/generative-ai/docs/quotas#gemini-2.0-flash
# Basically 500 requests per minute for 2.0-flash and 2.0-flash-lite (unlimited), 10 requests per minute for 2.0-pro-exp-02-05, gemini-2.5-flash-preview-04-17

# Gemini API: gemini-2.0-flash, gemini-2.0-flash-lite, gemini-2.0-flash-thinking-exp-01-21, gemini-2.5-pro-exp-03-25
# Rate limits: https://ai.google.dev/gemini-api/docs/rate-limits#tier-1
# For free tier: 30 requests per minute for 2.0-flash and 2.0-flash-lite, 10 requests per minute for 2.0-pro-exp-02-05
# For tier-1: 2000 requests per minute for 2.0-flash and 2.0-flash-lite (have to pay), 10 requests per minute for 2.0-pro-exp-02-05 and gemini-2.0-flash-thinking-exp-01-21, gemini-2.5-flash-preview-04-17

# 1.0=normal speed, 0.5=half speed, etc.
VIDEO_SPEED_FACTOR = 0.5

# --- Setup Derived Paths & Directories ---
zip_file_path = Path(DOWNLOADS_DIR) / ZIP_FILE_NAME
extracted_videos_path = Path(EXTRACTED_VIDEOS_DIR)
speed_videos_path = Path(SPEED_VIDEOS_DIR) / str(VIDEO_SPEED_FACTOR)
Path(DOWNLOADS_DIR).mkdir(parents=True, exist_ok=True)
extracted_videos_path.mkdir(parents=True, exist_ok=True)
speed_videos_path.mkdir(parents=True, exist_ok=True)
Path(HF_CACHE_DIR).mkdir(parents=True, exist_ok=True)

# --- Logging Configuration ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[logging.StreamHandler(sys.stdout)])
logger = logging.getLogger(__name__)

# --- Configuration Validation & Display --- #
warnings_found = False
if USE_VERTEX:
    if not PROJECT_ID or PROJECT_ID == "your-gcp-project-id":
        logger.error("Vertex AI mode requires PROJECT_ID to be set.")
        warnings_found = True
    if not LOCATION:
        logger.error("Vertex AI mode requires LOCATION to be set.")
        warnings_found = True
    if not GCS_BUCKET or GCS_BUCKET == "your-gcs-bucket-name":
        logger.error("Vertex AI mode requires GCS_BUCKET for video uploads.")
        warnings_found = True
    if not GCS_AVAILABLE:
        logger.error("Vertex AI mode requires 'google-cloud-storage', but it's not installed.")
        warnings_found = True
else: # Gemini API Mode
    # Check API Key (explicit or env var)
    effective_api_key = GEMINI_API_KEY if GEMINI_API_KEY != "YOUR_API_KEY_HERE" else os.environ.get("GOOGLE_API_KEY")
    if not effective_api_key:
        logger.error("Gemini API mode requires GEMINI_API_KEY or GOOGLE_API_KEY environment variable.")
        warnings_found = True
    else:
        # Don't store the key in the config display if loaded from env
        if GEMINI_API_KEY == "YOUR_API_KEY_HERE" and os.environ.get("GOOGLE_API_KEY"):
            GEMINI_API_KEY = "(Loaded from GOOGLE_API_KEY env var)"
        logger.info("Gemini API mode configured. Videos will be uploaded via File API.")

if warnings_found:
     print("\n\n************************* WARNING *************************")
     print("Configuration errors detected above. Execution might fail.")
     print("***********************************************************\n")

2025-04-24 04:22:58,032 - INFO - Gemini API mode configured. Videos will be uploaded via File API.


# Main Model Selection For Bulk Inference
Non CoT - Direct Answer

## 1. Select Model and Instruction Prompt

### NonCoT output models for Bulk Inferences

In [3]:
# To see all available models, go into the NonCoT_output_models.py file
# NonCoT models are used for Bulk Inference
from models.NonCoT_output_models import get_non_cot_model
non_cot_model_list = ["gemini-2.5-flash-preview-04-17", "gemini-2.5-pro-exp-03-25"]

MODEL_NAME, SYSTEM_PROMPT, PROMPT_TEMPLATES, CONFIG, REQUESTS_PER_MINUTE, MAX_RETRIES, MAX_ASYNC_WORKERS  = get_non_cot_model(non_cot_model_list[0])

# Basic Initialization

## Initialize Google Cloud Clients

In [4]:
storage_client = None
ai_client = None

# --- Initialize Generative AI Client (`google.genai`) --- #
display(Markdown("### Initializing Generative AI Client (`google.genai`)"))
try:
    if USE_VERTEX:
        display(Markdown(f"Vertex AI backend (Project: {PROJECT_ID}, Loc: {LOCATION})..."))
        if not PROJECT_ID or not LOCATION or PROJECT_ID == "your-gcp-project-id":
             raise ValueError("PROJECT_ID/LOCATION invalid for Vertex AI.")
        # Initialize Client for Vertex
        ai_client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
        display(Markdown(f"✅ Vertex AI Client Initialized."))
    else: # Gemini API Mode
        display(Markdown("Gemini API backend (using API Key)..."))
        effective_api_key = GEMINI_API_KEY if GEMINI_API_KEY != "YOUR_API_KEY_HERE" else os.environ.get("GOOGLE_API_KEY")
        if not effective_api_key:
             if os.environ.get("GOOGLE_API_KEY"): effective_api_key = None # Client uses env var
             else: raise ValueError("Gemini API Key required but not found.")
        # Initialize Client for Gemini API
        ai_client = genai.Client(api_key=effective_api_key, vertexai=False)
        display(Markdown(f"✅ Gemini API Client Initialized."))

except ValueError as ve: display(Markdown(f"❌ **Config Error:** {ve}")); ai_client = None
except Exception as e: display(Markdown(f"❌ **AI Client Error:** {e}.")); logger.error("AI Client Init Failed", exc_info=True); ai_client = None

# --- Initialize Storage Client (ONLY for Vertex AI mode) --- #
if USE_VERTEX:
    display(Markdown("### Initializing GCS Client (Vertex Mode Only)"))
    if not GCS_AVAILABLE: display(Markdown("❌ GCS lib missing.")); raise RuntimeError("Missing GCS lib.")
    if not GCS_BUCKET or GCS_BUCKET == "your-gcs-bucket-name": display(Markdown("❌ GCS_BUCKET needed.")); raise ValueError("GCS_BUCKET required.")
    try:
        storage_client = storage.Client(project=PROJECT_ID)
        if not storage_client.bucket(GCS_BUCKET).exists(): display(Markdown(f"⚠️ GCS Bucket `{GCS_BUCKET}` inaccessible."))
        else: display(Markdown(f"✅ GCS Client Initialized (Bucket: '{GCS_BUCKET}')."))
    except Exception as e:
        display(Markdown(f"❌ **GCS Client Error:** {e}.")); logger.error("GCS Client Init Failed", exc_info=True)
        if not SKIP_PREPARE: raise RuntimeError("GCS client failed.")
        else: display(Markdown("⚠️ GCS client failed, but skipping prep."))
else:
    display(Markdown("### Initializing Gemini API Client (File API)"))
    try:
        storage_client = ai_client.files
    except Exception as e:
        display(Markdown(f"❌ **Gemini File API Client Error:** {e}.")); logger.error("Gemini API Client Init Failed", exc_info=True)
    display(Markdown(f"✅ Gemini File API Client Initialized."))

# --- Final Checks --- #
if ai_client is None: raise RuntimeError("AI client failed.")

if USE_VERTEX and storage_client is None and not SKIP_PREPARE: raise RuntimeError("GCS client failed for Vertex prep.")
display(Markdown("✅ Client initialization complete."))


### Initializing Generative AI Client (`google.genai`)

Gemini API backend (using API Key)...

✅ Gemini API Client Initialized.

### Initializing Gemini API Client (File API)

✅ Gemini File API Client Initialized.

✅ Client initialization complete.

## Utility Functions

In [5]:
# --- File/Data Handling ---
def load_processed_qids(filename: str) -> Set[str]:
    processed_qids = set()
    if Path(filename).is_file():
        try:
            df = pd.read_csv(filename, usecols=['qid'], dtype={'qid': str}, on_bad_lines='warn')
            processed_qids = set(df['qid'].dropna().unique())
            logger.info(f"Loaded {len(processed_qids)} processed QIDs from {filename}")
        except Exception as e:
            logger.warning(f"Could not read QIDs from {filename}: {e}. Assuming zero processed.")
    return processed_qids

def download_file_with_progress(url: str, destination: Path):
    logger.info(f"Downloading {url} to {destination}...")
    try:
        response = requests.get(url, stream=True, timeout=600)
        response.raise_for_status()
        total_size = int(response.headers.get('content-length', 0))
        block_size = 1024 * 1024
        with open(destination, 'wb') as f, tqdm(
            desc=f"Downloading {destination.name}", total=total_size, unit='iB', unit_scale=True, unit_divisor=1024
        ) as bar:
            for data in response.iter_content(block_size):
                size = f.write(data)
                bar.update(size)
        if total_size != 0 and bar.n != total_size:
            destination.unlink(missing_ok=True)
            raise RuntimeError(f"Download size mismatch for {destination.name}.")
        logger.info(f"Successfully downloaded {destination}")
    except Exception as e:
        destination.unlink(missing_ok=True)
        logger.error(f"Download failed for {url}: {e}")
        raise

def extract_zip(zip_path: Path, extract_to: Path):
    logger.info(f"Extracting {zip_path.name} to {extract_to}...")
    try:
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            members = [m for m in zip_ref.namelist() if not m.startswith('__MACOSX/') and not m.endswith('.DS_Store')]
            with tqdm(total=len(members), desc=f"Extracting {zip_path.name}") as pbar:
                for member in members:
                    zip_ref.extract(member=member, path=extract_to)
                    pbar.update(1)
        logger.info(f"Successfully extracted {zip_path} to {extract_to}")
    except Exception as e:
        logger.error(f"Extraction error: {e}")
        raise
    
def move_videos_to_main_directory(base_path):
    """Find all MP4 files in subdirectories and move them to the main directory."""
    logger.info(f"Moving all videos to main directory: {base_path}")
    moved_count = 0
    failed_count = 0
    
    # Find all MP4 files in subdirectories (but not in the main directory)
    for file_path in list(base_path.glob('**/*.mp4')):
        # Skip files already in the main directory or hidden Mac files
        if file_path.parent == base_path or file_path.name.startswith('._'):
            continue
            
        # Destination in the main directory
        dest_path = base_path / file_path.name
        
        try:
            # Move the file
            shutil.move(str(file_path), str(dest_path))
            moved_count += 1
            if moved_count % 50 == 0:
                logger.info(f"Moved {moved_count} videos so far...")
        except Exception as e:
            logger.error(f"Error moving {file_path}: {e}")
            failed_count += 1
    
    logger.info(f"Moved {moved_count} videos to main directory. Failed: {failed_count}")
    

def create_or_update_metadata(metadata_path: str, dataset_df: pd.DataFrame, video_updates: Dict[str, Dict]):
    try:
        required_cols = ['video_id', 'qid']
        update_cols = ['local_path', 'gcs_uri', 'file_api_name', 'status']
        dtype_map = {'video_id': str, 'qid': str} # Ensure IDs are strings

        if not Path(metadata_path).is_file():
            logger.info(f"Creating metadata file: {metadata_path}")
            meta_df = dataset_df.copy()
            for col in update_cols: meta_df[col] = pd.NA
            meta_df['status'] = 'pending'
        else:
            logger.debug(f"Loading existing metadata: {metadata_path}")
            meta_df = pd.read_csv(metadata_path, dtype=dtype_map)
            for col in update_cols: # Add missing update columns if needed
                 if col not in meta_df.columns: meta_df[col] = pd.NA

        if not all(col in meta_df.columns for col in required_cols):
            raise ValueError(f"Metadata missing required columns ({required_cols}).")

        updates_df = pd.DataFrame.from_dict(video_updates, orient='index')
        updates_df.index.name = 'video_id'
        updates_df.reset_index(inplace=True)
        updates_df['video_id'] = updates_df['video_id'].astype(str)

        # Use merge for robust updating across potentially multiple rows per video_id
        # First, prepare updates DF with only the necessary columns (video_id + update_cols)
        merge_cols = ['video_id'] + [col for col in update_cols if col in updates_df.columns]
        updates_to_merge = updates_df[merge_cols].drop_duplicates(subset=['video_id'], keep='last')

        # Merge, prioritizing updates
        # Suffixes help identify original vs update cols if needed, but update will overwrite
        merged_df = pd.merge(meta_df, updates_to_merge, on='video_id', how='left', suffixes=('', '_update'))

        # Apply the updates
        for col in update_cols:
            update_col_name = col + '_update'
            if update_col_name in merged_df.columns:
                # Fill NAs in original col with update col, then drop update col
                meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])
                # Alternative: Directly update where update is not NA
                # meta_df[col] = np.where(merged_df[update_col_name].notna(), merged_df[update_col_name], merged_df[col])

        meta_df.to_csv(metadata_path, index=False, encoding='utf-8')
        logger.info(f"Metadata file '{metadata_path}' updated with {len(video_updates)} video records.")

    except Exception as e:
        logger.error(f"Error updating metadata {metadata_path}: {e}", exc_info=True)
        raise

def load_metadata_for_inference(metadata_file: str = METADATA_FILE) -> Dict[str, List[Dict]]:
    if not Path(metadata_file).is_file(): return {}
    video_questions = defaultdict(list)
    required_col = 'gcs_uri' if USE_VERTEX else 'file_api_name'
    try:
        df = pd.read_csv(metadata_file, dtype=str).fillna('')
        if 'video_id' not in df.columns or required_col not in df.columns:
            logger.error(f"Metadata missing 'video_id' or '{required_col}'.")
            return {}
        valid_df = df[df['video_id'].astype(bool) & df[required_col].astype(bool)]
        if len(valid_df) == 0:
             logger.warning(f"No videos found with '{required_col}' in {metadata_file}. Check Step 4.")
             return {}
        for video_id, group in valid_df.groupby('video_id'):
             video_questions[video_id] = group.to_dict('records')
        logger.info(f"Loaded {len(video_questions)} videos ({len(valid_df)} questions) with valid IDs for inference.")
        return dict(video_questions)
    except Exception as e:
        logger.error(f"Error loading metadata for inference: {e}", exc_info=True)
        return {}

# --- Upload/Verification Helpers ---
def upload_to_gcs(storage_client, bucket_name: str, source_file_path: Path, destination_blob_name: str) -> Optional[str]:
    if not GCS_AVAILABLE or storage_client is None or not source_file_path.is_file(): return None
    try:
        blob = storage_client.bucket(bucket_name).blob(destination_blob_name)
        blob.upload_from_filename(str(source_file_path))
        gcs_uri = f"gs://{bucket_name}/{destination_blob_name}"
        logger.debug(f"GCS OK: {source_file_path} -> {gcs_uri}")
        return gcs_uri
    except Exception as e:
        logger.error(f"GCS Fail: {source_file_path}. Error: {e}")
        return None

def upload_via_file_api(storage_client, local_path: Path, display_name: str) -> Optional[str]:
    if storage_client is None or not local_path.is_file(): return None
    try:
        logger.debug(f"Uploading {local_path} via File API...")
        uploaded_file = storage_client.upload(file=local_path)
        logger.info(f"File API OK: {local_path} -> {uploaded_file.name}")
        return uploaded_file.name
    except Exception as e:
        logger.error(f"File API Fail: {local_path}. Error: {e}", exc_info=True)
        return None

def verify_gcs_file_exists(storage_client, gcs_uri: str) -> bool:
    if not GCS_AVAILABLE or storage_client is None or not gcs_uri: return False
    try:
        exists = storage.Blob.from_string(gcs_uri, client=storage_client).exists()
        if not exists: logger.warning(f"GCS verify failed: {gcs_uri}")
        return exists
    except Exception as e:
        logger.error(f"Error verifying GCS {gcs_uri}: {e}")
        return False

def verify_file_api_resource_exists(storage_client, file_api_name: str) -> bool:
    if not storage_client or not file_api_name: return False
    try:
        _ = storage_client.get(name=file_api_name) # Sync get for verification
        return True
    except Exception as e:
        logger.error(f"Error verifying File API {file_api_name}: {e}")
        return False

def verify_local_file_exists(local_path: str) -> bool:
    exists = Path(local_path).is_file() if local_path else False
    if not exists: logger.warning(f"Local verify failed: {local_path}")
    return exists

# --- Prompt Building ---
def build_prompt(question_info: dict) -> str:
    question = question_info.get("question", "")
    q_type = question_info.get("question_type", "default")
    template = PROMPT_TEMPLATES.get(q_type, PROMPT_TEMPLATES["default"])
    # if q_type is MCQ
    if q_type == "Multiple-choice Question with a Single Correct Answer":
        return template.format(question=question).strip() + "\n" + "E. None of the above"
    return template.format(question=question).strip() + "\n" + question_info.get("question_prompt").strip()

# --- Rate Limiter ---
class AsyncRateLimiter:
    """
    An asyncio-compatible token bucket rate limiter.

    Args:
        rate (int): The maximum number of requests allowed per period.
        period (float): The time period in seconds (default: 60 for RPM).
        capacity (int, optional): The maximum burst capacity. Defaults to `rate`.
    """
    def __init__(self, rate: int, period: float = 60.0, capacity: Optional[int] = None):
        if rate <= 0:
            raise ValueError("Rate must be positive")
        if period <= 0:
            raise ValueError("Period must be positive")

        self.rate = rate
        self.period = float(period)
        self.capacity = float(capacity if capacity is not None else rate)
        self._tokens = self.capacity # Start full
        self._last_refill_time = time.monotonic()
        self._lock = asyncio.Lock()

    def _get_tokens_per_second(self) -> float:
        return self.rate / self.period

    async def _refill(self):
        """Replenishes tokens based on elapsed time. Must be called under lock."""
        now = time.monotonic()
        elapsed = now - self._last_refill_time
        if elapsed > 0:
            tokens_to_add = elapsed * self._get_tokens_per_second()
            self._tokens = min(self.capacity, self._tokens + tokens_to_add)
            self._last_refill_time = now

    async def acquire(self):
        """
        Acquires a token, waiting if necessary.
        """
        async with self._lock:
            await self._refill() # Refill based on time since last acquire/refill

            while self._tokens < 1:
                # Calculate how long to wait for 1 token
                tokens_needed = 1.0 - self._tokens
                wait_time = tokens_needed / self._get_tokens_per_second()

                # Release the lock before sleeping
                lock_released = True
                try:
                    self._lock.release()
                    logger.debug(f"Rate limit hit. Waiting for {wait_time:.3f}s for next token.")
                    await asyncio.sleep(wait_time)
                finally:
                    # Re-acquire the lock if it was released
                    if lock_released:
                        await self._lock.acquire()

                # Refill again after waiting, as more time has passed
                await self._refill()

            # Consume a token
            self._tokens -= 1.0


# Download, Extract & Prepare Dataset

## Fetch Dataset from HuggingFace

In [6]:
dataset_path = Path(DATASET_CSV)

if dataset_path.is_file() and SKIP_FETCH:
    logger.info(f"Dataset file '{DATASET_CSV}' exists and SKIP_FETCH is True. Skipping.")
    display(Markdown(f"✅ Skipping fetch: Found existing `{DATASET_CSV}`."))
    # Load the existing dataframe for use in Step 2
    try:
        dataset_df = pd.read_csv(dataset_path, dtype=str) # Load all as string initially
        logger.info(f"Loaded existing dataset from {DATASET_CSV} ({len(dataset_df)} rows).")
    except Exception as e:
        logger.error(f"Failed to load existing dataset file {DATASET_CSV}: {e}")
        display(Markdown(f"❌ Error loading existing `{DATASET_CSV}`: {e}. Please delete the file or set SKIP_FETCH=False."))
        raise
else:
    logger.info(f"Fetching dataset '{HF_DATASET_NAME}' (split: '{HF_DATASET_SPLIT}') from HuggingFace...")
    try:
        dataset = load_dataset(HF_DATASET_NAME, split=HF_DATASET_SPLIT, cache_dir=HF_CACHE_DIR)
        dataset_df = dataset.to_pandas()
        # Ensure key columns are strings
        for col in ['qid', 'video_id', 'question', 'question_type']:
             if col in dataset_df.columns:
                 dataset_df[col] = dataset_df[col].astype(str)
        dataset_df.to_csv(dataset_path, index=False, encoding='utf-8')
        logger.info(f"Successfully fetched dataset and saved to {DATASET_CSV} ({len(dataset_df)} rows).")
        display(Markdown(f"✅ Dataset fetched and saved to `{DATASET_CSV}` ({len(dataset_df)} rows)."))
        display(dataset_df.head())
    except Exception as e:
        logger.error(f"Failed to fetch or save dataset: {e}", exc_info=True)
        display(Markdown(f"❌ **Error fetching dataset:** {e}. Check connection, dataset name/split, cache dir permissions."))
        raise RuntimeError("Dataset fetching failed. Cannot continue.")

# Ensure dataset_df is loaded if skipping fetch didn't load it (e.g., first run with skip=True and no file)
if 'dataset_df' not in locals():
    if dataset_path.is_file():
        try:
            dataset_df = pd.read_csv(dataset_path, dtype=str)
        except Exception as e:
            logger.error(f"Critical error: Could not load dataset from {DATASET_CSV} after attempting fetch/skip: {e}")
            raise
    else:
        raise RuntimeError(f"Critical error: Dataset DataFrame not loaded and file {DATASET_CSV} not found.")

2025-04-24 04:22:58,281 - INFO - Fetching dataset 'lmms-lab/AISG_Challenge' (split: 'test') from HuggingFace...
2025-04-24 04:23:02,466 - INFO - Successfully fetched dataset and saved to dataset.csv (1500 rows).


✅ Dataset fetched and saved to `dataset.csv` (1500 rows).

Unnamed: 0,qid,video_id,question_type,capability,question,duration,question_prompt,answer,youtube_url
0,0008-0,sj81PWrerDk,Primary Open-ended Question,Plot Attribute (Montage),What is the difference between the action of t...,8.85,Please state your answer with a brief explanat...,,https://www.youtube.com/shorts/sj81PWrerDk
1,0008-1,sj81PWrerDk,Paraphrased Open-ended Question,Plot Attribute (Montage),Can you describe how the actions of the last p...,8.85,Please state your answer with a brief explanat...,,https://www.youtube.com/shorts/sj81PWrerDk
2,0008-2,sj81PWrerDk,Correctly-led Open-ended Question,Plot Attribute (Montage),Did the last person open the bottle without us...,8.85,Please state your answer with a brief explanat...,,https://www.youtube.com/shorts/sj81PWrerDk
3,0008-3,sj81PWrerDk,Wrongly-led Open-ended Question,Plot Attribute (Montage),Did the last person in the video open the bott...,8.85,Please state your answer with a brief explanat...,,https://www.youtube.com/shorts/sj81PWrerDk
4,0008-7,sj81PWrerDk,Multiple-choice Question with a Single Correct...,Plot Attribute (Montage),How does the last person in the video open the...,8.85,E. None of the above\nSelect one best answer t...,,https://www.youtube.com/shorts/sj81PWrerDk


## Download, Extract, and Prepare Videos

Downloads, extracts, and uploads videos (to GCS or File API). Updates `video_metadata.csv`.

In [7]:
dataset_path = Path(DATASET_CSV)
dataset_df = pd.read_csv(dataset_path, dtype=str)

### Download Video Archive

In [8]:
display(Markdown("### Downloading Archive"))
# Remember to set SKIP_DOWNLOAD_ZIP to True if you want to skip the download and you already have the zip file
if zip_file_path.is_file() and SKIP_DOWNLOAD_ZIP:
    display(Markdown(f"✅ Skipping download: Found `{zip_file_path}`."))
else:
    try: download_file_with_progress(VIDEO_ZIP_URL, zip_file_path); display(Markdown(f"✅ Downloaded: `{zip_file_path}`."))
    except Exception as e:
        display(Markdown(f"❌ **Download Error:** {e}."))
        if not SKIP_EXTRACT or not SKIP_PREPARE: raise RuntimeError(f"Download failed.")
        else: display(Markdown("⚠️ Download failed, skipping steps."))

### Downloading Archive

✅ Skipping download: Found `downloads/all_videos.zip`.

### Extract Video Archive

In [9]:
display(Markdown("### Extracting Archive"))
# Check if the zip file exists and if we should skip extraction
if any(extracted_videos_path.glob('*.mp4')) and SKIP_EXTRACT:
    display(Markdown(f"✅ Skipping extraction: Files in `{extracted_videos_path}`."))
elif not zip_file_path.is_file():
    display(Markdown(f"❌ Cannot extract: `{zip_file_path}` missing."))
    if not SKIP_PREPARE: raise RuntimeError(f"Zip missing.")
    else: display(Markdown("⚠️ Extraction skipped (no zip)."))
else:
    try: 
        extract_zip(zip_file_path, extracted_videos_path)
        # Move all videos to main directory
        move_videos_to_main_directory(extracted_videos_path)
        display(Markdown(f"✅ Extracted to `{extracted_videos_path}` and moved all videos to main directory."))
    except Exception as e:
        display(Markdown(f"❌ **Extraction Error:** {e}."))
        if not SKIP_PREPARE: raise RuntimeError("Extraction failed.")
        else: display(Markdown("⚠️ Extraction failed, skipping prep."))

### Extracting Archive

✅ Skipping extraction: Files in `extracted_videos`.

### Slow/Speed Up Videos
Losslessly change video speed while also re-encoding audio to maintain pitch. As
a result, is super fast. Could be made faster if using asyncio to concurrently run
ffmpeg. The video results are saved into **speed_videos/0.5**

In [10]:
async def run_subprocess(cmd, check=True, capture_output=False):
    """Helper function to run subprocess asynchronously."""
    stdout_pipe = asyncio.subprocess.PIPE if capture_output else asyncio.subprocess.DEVNULL
    # Capture stderr only if check is True or capture_output is True, otherwise DEVNULL
    stderr_pipe = asyncio.subprocess.PIPE if check or capture_output else asyncio.subprocess.DEVNULL

    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=stdout_pipe,
        stderr=stderr_pipe
    )
    stdout, stderr = await process.communicate()

    if check and process.returncode != 0:
        error_msg = f"Command '{' '.join(cmd)}' failed with return code {process.returncode}"
        stderr_decoded = stderr.decode(errors='ignore') if stderr else ""
        if stderr_decoded:
            error_msg += f"\nStderr: {stderr_decoded}"
        # Raise specific exception to potentially capture stderr later
        raise subprocess.CalledProcessError(process.returncode, cmd, output=stdout, stderr=stderr)

    return stdout, stderr, process.returncode

async def process_single_video(vid_path, speed_videos_path, VIDEO_SPEED_FACTOR, semaphore):
    """Asynchronously processes a single video. Returns status string."""
    vid_path_str = str(vid_path.resolve())
    out_path = speed_videos_path / vid_path.name
    out_path_str = str(out_path.resolve())

    async with semaphore: # Limit concurrency
        if out_path.is_file():
            return 'skipped'

        if VIDEO_SPEED_FACTOR == 1.0:
            try:
                # Use asyncio.to_thread for potentially blocking I/O
                await asyncio.to_thread(shutil.copy, vid_path_str, out_path_str)
                return 'processed'
            except Exception as e:
                try:
                    logger.error(f"Error copying {vid_path.name}: {e}")
                except NameError:
                    print(f"Error copying {vid_path.name}: {e}")
                return 'error'

        # --- Process video with speed change ---
        tf_bitstream_path = None
        tf_audio_path = None
        tf_final_path = None
        try:
            # Create temporary files (synchronous part is okay here)
            # Context manager ensures files are closed before ffmpeg uses them
            with tempfile.NamedTemporaryFile(delete=False, suffix=".h264") as tf_b, \
                 tempfile.NamedTemporaryFile(delete=False, suffix=".aac") as tf_a, \
                 tempfile.NamedTemporaryFile(delete=False, suffix=".mp4") as tf_f:
                tf_bitstream_name = tf_b.name
                tf_audio_name = tf_a.name
                tf_final_name = tf_f.name
            # Store paths for cleanup
            tf_bitstream_path = Path(tf_bitstream_name)
            tf_audio_path = Path(tf_audio_name)
            tf_final_path = Path(tf_final_name)


            # Get original FPS
            ffprobe_cmd = [
                "ffprobe", "-v", "error", "-select_streams", "v", "-of", "default=noprint_wrappers=1:nokey=1",
                "-show_entries", "stream=r_frame_rate", vid_path_str
            ]
            stdout, _, _ = await run_subprocess(ffprobe_cmd, check=True, capture_output=True)
            fps = float(fractions.Fraction(stdout.decode().strip()))
            new_fps = fps * VIDEO_SPEED_FACTOR

            # Extract and speed up audio
            factor = VIDEO_SPEED_FACTOR
            filter_parts = []
            while factor > 2.0:
                filter_parts.append("atempo=2.0")
                factor /= 2.0
            while factor < 0.5:
                filter_parts.append("atempo=0.5")
                factor /= 0.5
            if abs(factor - 1.0) > 1e-6:
                 filter_parts.append(f"atempo={factor:.6f}")

            if not filter_parts:
                 audio_cmd = ["ffmpeg", "-y", "-i", vid_path_str, "-vn", "-c:a", "copy", tf_audio_name]
            else:
                audio_filter = ",".join(filter_parts)
                audio_cmd = ["ffmpeg", "-y", "-i", vid_path_str, "-vn", "-filter:a", audio_filter, "-c:a", "aac", "-b:a", "128k", tf_audio_name]
            await run_subprocess(audio_cmd, check=True)


            # Extract h264 bitstream
            extract_cmd = ["ffmpeg", "-y", "-i", vid_path_str, "-map", "0:v", "-c:v", "copy", "-bsf:v", "h264_mp4toannexb", tf_bitstream_name]
            await run_subprocess(extract_cmd, check=True)

            # Remux bitstream with new audio and FPS
            remux_cmd = ["ffmpeg", "-y", "-fflags", "+genpts", "-r", f"{new_fps:.6f}", "-i", tf_bitstream_name, "-i", tf_audio_name, "-map", "0:v", "-map", "1:a", "-c:v", "copy", "-c:a", "copy", tf_final_name]
            await run_subprocess(remux_cmd, check=True)

            # Move final file (use asyncio.to_thread)
            await asyncio.to_thread(shutil.move, tf_final_name, out_path_str)
            return 'processed'

        except Exception as e:
            err_msg = f"Error processing {vid_path.name}: {e}"
            # Include ffmpeg stderr if available
            if isinstance(e, subprocess.CalledProcessError) and e.stderr:
                 err_msg += f"\nFFmpeg/FFprobe Stderr:\n{e.stderr.decode(errors='ignore')}"
            try:
                logger.error(err_msg)
            except NameError:
                print(err_msg)
            return 'error'
        finally:
            # Clean up temporary files asynchronously using to_thread
            async def _cleanup():
                if tf_bitstream_path and tf_bitstream_path.exists():
                    tf_bitstream_path.unlink(missing_ok=True)
                if tf_audio_path and tf_audio_path.exists():
                    tf_audio_path.unlink(missing_ok=True)
                # tf_final is moved, only delete if error occurred before move
                if tf_final_path and tf_final_path.exists():
                    tf_final_path.unlink(missing_ok=True)
            # Run sync cleanup in thread only if paths were assigned
            if tf_bitstream_path or tf_audio_path or tf_final_path:
                 await asyncio.to_thread(_cleanup)


# --- Main Cell Logic ---

async def run_processing(): # Wrap in an async function to use await
    display(Markdown("### Preparing Videos"))
    if dataset_df is None:
        raise RuntimeError("Dataset DF unavailable.")

    # Ensure output directory exists
    speed_videos_path.mkdir(parents=True, exist_ok=True)

    all_video_ids = sorted(list(dataset_df['video_id'].dropna().unique()))
    # Use logging if available, otherwise print
    try:
        logger.info(f"Processing {len(all_video_ids)} unique video IDs.")
    except NameError:
        print(f"Processing {len(all_video_ids)} unique video IDs.")

    vid_paths = list(extracted_videos_path.glob("*.mp4"))

    # Limit concurrency
    concurrency_limit = MAX_ASYNC_WORKERS
    try:
        logger.info(f"Using concurrency limit: {concurrency_limit}")
    except NameError:
        print(f"Using concurrency limit: {concurrency_limit}")
    semaphore = asyncio.Semaphore(concurrency_limit)

    tasks = []
    # Keep the familiar loop structure for creating tasks
    print(f"Preparing tasks for {len(vid_paths)} videos...")
    for vid_path in vid_paths:
         # Create a task for each video processing job
         # Pass necessary arguments to the task creator
         task = asyncio.create_task(process_single_video(vid_path, speed_videos_path, VIDEO_SPEED_FACTOR, semaphore))
         tasks.append(task)

    # Now, run all the created tasks concurrently and display progress
    # Use asyncio.as_completed with a standard tqdm progress bar
    print(f"Transforming {len(tasks)} Videos...")
    results = []
    # Use the imported tqdm (now tqdm.auto) to create a standard progress bar instance
    with tqdm(total=len(tasks), desc="Transforming Videos", unit="video") as pbar:
        for future in asyncio.as_completed(tasks):
            try:
                result = await future # Get result from completed task
                results.append(result)
            except Exception as exc:
                # Log errors from tasks that failed internally if not caught by process_single_video
                # (process_single_video should ideally return 'error' status instead of raising)
                try:
                    logger.error(f"Task for a video failed: {exc}")
                except NameError:
                    print(f"Task for a video failed: {exc}")
                results.append('error') # Count as error if task itself fails unexpectedly
            finally:
                 pbar.update(1) # Increment progress bar regardless of outcome


    # Count results
    processed = results.count('processed')
    skipped = results.count('skipped')
    errors = results.count('error')

    print(f"\n\n{skipped} videos skipped, {processed} videos processed, {errors} errors, {len(vid_paths)} total.")

# --- Execute the async processing ---
# In a Jupyter Notebook, you usually need to await the top-level async function.
# If top-level await isn't enabled, you might need nest_asyncio or run manually.
# Using await directly is the most common way in modern notebooks.

await run_processing()


### Preparing Videos

2025-04-24 04:23:02,700 - INFO - Processing 289 unique video IDs.
2025-04-24 04:23:02,701 - INFO - Using concurrency limit: 20
Preparing tasks for 289 videos...
Transforming 289 Videos...


Transforming Videos:   0%|          | 0/289 [00:00<?, ?video/s]



289 videos skipped, 0 videos processed, 0 errors, 289 total.


### Preparing and Upload Videos to GCS or File API

Upload all videos into File API, and record the link to that video inside video_meta_data

In [11]:
# --- Prepare Videos (Upload GCS/File API) & Update Metadata --- #
display(Markdown("### Preparing Videos & Updating Metadata"))
if SKIP_PREPARE:
    display(Markdown("✅ Skipping video preparation."))
elif storage_client is None:
     display(Markdown("❌ Cannot prepare: Client not ready.")); raise RuntimeError("Client missing.")
else:
    if dataset_df is None: raise RuntimeError("Dataset DF unavailable.")
    all_video_ids = sorted(list(dataset_df['video_id'].dropna().unique()))
    logger.info(f"Processing {len(all_video_ids)} unique video IDs.")

    videos_to_process_ids = all_video_ids
    if MAX_VIDEOS_TO_PROCESS is not None:
        videos_to_process_ids = all_video_ids[:MAX_VIDEOS_TO_PROCESS]
        logger.info(f"Limiting to {len(videos_to_process_ids)} videos.")

    # Load existing metadata to check status
    existing_statuses = {}
    resource_ids = {}
    required_id_col = 'gcs_uri' if USE_VERTEX else 'file_api_name'
    if Path(METADATA_FILE).is_file():
        try:
            existing_df = pd.read_csv(METADATA_FILE, dtype=str)
            if 'video_id' in existing_df.columns and 'status' in existing_df.columns:
                existing_statuses = pd.Series(existing_df.status.values, index=existing_df.video_id).to_dict()
            if 'video_id' in existing_df.columns and required_id_col in existing_df.columns:
                 resource_ids = pd.Series(existing_df[required_id_col].values, index=existing_df.video_id).dropna().to_dict()
            logger.info("Checked existing metadata statuses/IDs.")
        except Exception as e: logger.warning(f"Could not load existing metadata: {e}")

    video_metadata_updates = {}
    processed_count, upload_failures, missing_local, skipped_count = 0, 0, 0, 0
    num_batches = math.ceil(len(videos_to_process_ids) / UPLOAD_BATCH_SIZE_GCS)
    prep_mode = "GCS Upload" if USE_VERTEX else "File API Upload"

    with tqdm(total=len(videos_to_process_ids), desc=f"Preparing ({prep_mode})") as pbar:
        for i in range(0, len(videos_to_process_ids), UPLOAD_BATCH_SIZE_GCS):
            batch_ids = videos_to_process_ids[i : i + UPLOAD_BATCH_SIZE_GCS]
            batch_num = (i // UPLOAD_BATCH_SIZE_GCS) + 1
            logger.info(f"Prep Batch {batch_num}/{num_batches}...")
            current_batch_updates = {}

            for video_id in batch_ids:
                pbar.set_postfix_str(f"ID: {video_id}")
                update_data = {"local_path": None, "gcs_uri": None, "file_api_name": None, "status": "error_unknown"}
                local_video_path = speed_videos_path / f"{video_id}.mp4"
                current_status = existing_statuses.get(video_id, 'pending')
                existing_resource_id = resource_ids.get(video_id)
                is_already_processed = False

                # Check if already uploaded and verified
                if current_status in ['uploaded_gcs', 'uploaded_file_api'] and existing_resource_id:
                     verified = False
                     if USE_VERTEX: verified = verify_gcs_file_exists(storage_client, existing_resource_id)
                     else: verified = verify_file_api_resource_exists(storage_client, existing_resource_id)
                     if verified:
                         logger.debug(f"Skipping verified video {video_id} ('{current_status}').")
                         is_already_processed = True
                         skipped_count += 1
                         update_data.update({ # Ensure metadata is consistent
                             'local_path': str(local_video_path) if local_video_path.is_file() else None,
                             'status': current_status,
                             required_id_col: existing_resource_id
                         })
                     else:
                         logger.warning(f"Video {video_id} ({current_status}) needs re-processing (verification failed).")
                elif current_status != 'pending':
                     logger.debug(f"Video {video_id} has non-pending status '{current_status}' but no verified resource ID. Re-processing.")

                if is_already_processed:
                    processed_count += 1
                    current_batch_updates[video_id] = update_data
                    pbar.update(1)
                    continue

                # Process if needed
                if local_video_path.is_file():
                    update_data["local_path"] = str(local_video_path)
                    resource_id_result = None
                    if USE_VERTEX:
                        blob_name = f"videos/{video_id}.mp4"
                        resource_id_result = upload_to_gcs(storage_client, GCS_BUCKET, local_video_path, blob_name)
                        if resource_id_result: update_data.update({"gcs_uri": resource_id_result, "status": "uploaded_gcs"})
                        else: update_data["status"] = "gcs_upload_failed"; upload_failures += 1
                    else: # Gemini API
                        resource_id_result = upload_via_file_api(storage_client, local_video_path, f"vid_{video_id}")
                        if resource_id_result: update_data.update({"file_api_name": resource_id_result, "status": "uploaded_file_api"})
                        else: update_data["status"] = "file_api_upload_failed"; upload_failures += 1
                else:
                    logger.warning(f"Local file missing: {local_video_path}")
                    missing_local += 1
                    update_data["status"] = "local_missing"

                current_batch_updates[video_id] = update_data
                processed_count += 1
                pbar.update(1)

            # Update metadata after batch
            if current_batch_updates:
                 try: create_or_update_metadata(METADATA_FILE, dataset_df, current_batch_updates)
                 except Exception as e: logger.error(f"Metadata update failed batch {batch_num}: {e}")
                 video_metadata_updates.update(current_batch_updates)

    logger.info(f"Prep finished. Checked: {processed_count}, Skipped(verified): {skipped_count}, Missing Local: {missing_local}, Upload Failures: {upload_failures}")
    display(Markdown(f"✅ Video preparation complete. See logs. Metadata: `{METADATA_FILE}`."))


### Preparing Videos & Updating Metadata

2025-04-24 04:23:02,796 - INFO - Processing 289 unique video IDs.
2025-04-24 04:23:02,805 - INFO - Checked existing metadata statuses/IDs.


Preparing (File API Upload):   0%|          | 0/289 [00:00<?, ?it/s]

2025-04-24 04:23:02,811 - INFO - Prep Batch 1/29...
2025-04-24 04:23:03,066 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/og0a6do1tpwp "HTTP/1.1 200 OK"
2025-04-24 04:23:03,304 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/18o7oiqwp1m4 "HTTP/1.1 200 OK"
2025-04-24 04:23:03,546 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/nwd6dcssxn5g "HTTP/1.1 200 OK"
2025-04-24 04:23:04,040 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5rugpm724dos "HTTP/1.1 200 OK"
2025-04-24 04:23:04,279 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/uv6xrrdirs62 "HTTP/1.1 200 OK"
2025-04-24 04:23:04,507 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/2uvktznt7pa6 "HTTP/1.1 200 OK"
2025-04-24 04:23:04,781 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/9rzil2rzalbo "HTTP/1

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:06,508 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/lpjysj4nfk2j "HTTP/1.1 200 OK"
2025-04-24 04:23:06,753 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/rn3qhuqto9x2 "HTTP/1.1 200 OK"
2025-04-24 04:23:07,002 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/zrjs3cwo111w "HTTP/1.1 200 OK"
2025-04-24 04:23:07,235 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/3xat2tk0x5ou "HTTP/1.1 200 OK"
2025-04-24 04:23:08,237 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/20dv3z6af9zz "HTTP/1.1 200 OK"
2025-04-24 04:23:08,484 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/cvb9esfjtt6c "HTTP/1.1 200 OK"
2025-04-24 04:23:08,726 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/cum701k1eq5e "HTTP/1.1 200 OK"
2025-04-24 04:23:08,971 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:09,957 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/edigwvqw9av9 "HTTP/1.1 200 OK"
2025-04-24 04:23:10,212 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/w6mpz5ghecx "HTTP/1.1 200 OK"
2025-04-24 04:23:10,449 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/uubrz6yx11kr "HTTP/1.1 200 OK"
2025-04-24 04:23:10,700 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/41cjipgnj7qx "HTTP/1.1 200 OK"
2025-04-24 04:23:10,933 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/s4p21njekab1 "HTTP/1.1 200 OK"
2025-04-24 04:23:11,178 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5izcnmjnqq1l "HTTP/1.1 200 OK"
2025-04-24 04:23:11,418 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/pjjgf7883z4 "HTTP/1.1 200 OK"
2025-04-24 04:23:11,659 - INFO - HTTP Reque

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:12,399 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/g6q7aifdwv4g "HTTP/1.1 200 OK"
2025-04-24 04:23:12,635 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/xwt72aglcv57 "HTTP/1.1 200 OK"
2025-04-24 04:23:12,878 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/f6rtxz6omw09 "HTTP/1.1 200 OK"
2025-04-24 04:23:13,126 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5angqwlkl9r0 "HTTP/1.1 200 OK"
2025-04-24 04:23:13,374 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/1w1strhwfrxk "HTTP/1.1 200 OK"
2025-04-24 04:23:13,605 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/i481ikq7ot66 "HTTP/1.1 200 OK"
2025-04-24 04:23:13,842 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/67d8iijtb2za "HTTP/1.1 200 OK"
2025-04-24 04:23:14,074 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:14,806 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ax94noo6ktgp "HTTP/1.1 200 OK"
2025-04-24 04:23:15,046 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/aui3e5ophi7k "HTTP/1.1 200 OK"
2025-04-24 04:23:15,282 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/zyncqyb7zs39 "HTTP/1.1 200 OK"
2025-04-24 04:23:15,531 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/j0zyodjypu6y "HTTP/1.1 200 OK"
2025-04-24 04:23:15,762 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/749ns7wd08mj "HTTP/1.1 200 OK"
2025-04-24 04:23:16,026 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/c01t64ksxcsf "HTTP/1.1 200 OK"
2025-04-24 04:23:16,300 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ta2hl1tmh1yu "HTTP/1.1 200 OK"
2025-04-24 04:23:16,536 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:17,273 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/l5bez9tkslkn "HTTP/1.1 200 OK"
2025-04-24 04:23:17,511 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/juzenky80gnx "HTTP/1.1 200 OK"
2025-04-24 04:23:17,747 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/x4e7940m2w6y "HTTP/1.1 200 OK"
2025-04-24 04:23:17,982 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/qgdi14i3nyc "HTTP/1.1 200 OK"
2025-04-24 04:23:18,219 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/isv1d80wasnj "HTTP/1.1 200 OK"
2025-04-24 04:23:18,497 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/64a62jztada2 "HTTP/1.1 200 OK"
2025-04-24 04:23:18,730 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/6cfe5hg6snoo "HTTP/1.1 200 OK"
2025-04-24 04:23:18,969 - INFO - HTTP Requ

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:19,708 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ff9jigeowz45 "HTTP/1.1 200 OK"
2025-04-24 04:23:19,962 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/mr1613eza0xe "HTTP/1.1 200 OK"
2025-04-24 04:23:20,202 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/mpgt10hc8lch "HTTP/1.1 200 OK"
2025-04-24 04:23:20,444 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/jdk6x0262403 "HTTP/1.1 200 OK"
2025-04-24 04:23:20,678 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/gvjwjlipxjur "HTTP/1.1 200 OK"
2025-04-24 04:23:20,917 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/l9sni81jwnmf "HTTP/1.1 200 OK"
2025-04-24 04:23:21,152 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5d7lmp57wrz6 "HTTP/1.1 200 OK"
2025-04-24 04:23:21,386 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:22,151 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/moyq9z0ymp93 "HTTP/1.1 200 OK"
2025-04-24 04:23:22,395 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/nw540f1nn5q5 "HTTP/1.1 200 OK"
2025-04-24 04:23:22,630 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/kdozm0rby5su "HTTP/1.1 200 OK"
2025-04-24 04:23:22,866 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/f07n3xut2u8i "HTTP/1.1 200 OK"
2025-04-24 04:23:23,103 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/lfw5qaix3onf "HTTP/1.1 200 OK"
2025-04-24 04:23:23,344 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/44nynnhxiwfa "HTTP/1.1 200 OK"
2025-04-24 04:23:23,592 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/sz9q0h8pfupe "HTTP/1.1 200 OK"
2025-04-24 04:23:23,854 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:24,602 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/vkxvscus9rxk "HTTP/1.1 200 OK"
2025-04-24 04:23:24,860 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ssqb0wt6ep1k "HTTP/1.1 200 OK"
2025-04-24 04:23:25,088 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/guvw073sa871 "HTTP/1.1 200 OK"
2025-04-24 04:23:25,327 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/s06001cxl1nt "HTTP/1.1 200 OK"
2025-04-24 04:23:25,575 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/xpt9rnzz4hnv "HTTP/1.1 200 OK"
2025-04-24 04:23:25,811 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/hjj4wckl6jtt "HTTP/1.1 200 OK"
2025-04-24 04:23:26,051 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/9ygsugwxj24a "HTTP/1.1 200 OK"
2025-04-24 04:23:26,300 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:27,085 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/fgwzyxbdimtz "HTTP/1.1 200 OK"
2025-04-24 04:23:27,327 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/krth2law0nlv "HTTP/1.1 200 OK"
2025-04-24 04:23:27,561 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/6m0dsvyflwha "HTTP/1.1 200 OK"
2025-04-24 04:23:27,808 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/2fwyvt5603kp "HTTP/1.1 200 OK"
2025-04-24 04:23:28,042 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/cuivj7thi5n1 "HTTP/1.1 200 OK"
2025-04-24 04:23:28,276 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/3ipes8iguz2h "HTTP/1.1 200 OK"
2025-04-24 04:23:28,518 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/jl2sphmalzx2 "HTTP/1.1 200 OK"
2025-04-24 04:23:28,753 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:29,487 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/fbcid9zo6wcp "HTTP/1.1 200 OK"
2025-04-24 04:23:29,723 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5fw473mcgg91 "HTTP/1.1 200 OK"
2025-04-24 04:23:29,952 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/7gqdsyn1umib "HTTP/1.1 200 OK"
2025-04-24 04:23:30,188 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/f34stx7asnvf "HTTP/1.1 200 OK"
2025-04-24 04:23:30,421 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/jtp2ynggrq5o "HTTP/1.1 200 OK"
2025-04-24 04:23:30,657 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/fiagw3pnl1wv "HTTP/1.1 200 OK"
2025-04-24 04:23:30,897 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5ptja9ygcguo "HTTP/1.1 200 OK"
2025-04-24 04:23:31,130 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:31,875 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ybfidik6de7d "HTTP/1.1 200 OK"
2025-04-24 04:23:32,112 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/qsv79zfg8lwd "HTTP/1.1 200 OK"
2025-04-24 04:23:32,351 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ui4431gcskzf "HTTP/1.1 200 OK"
2025-04-24 04:23:32,585 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/elaqlfnpas3d "HTTP/1.1 200 OK"
2025-04-24 04:23:32,825 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/e3v1w5c7i3k7 "HTTP/1.1 200 OK"
2025-04-24 04:23:33,069 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/7kp6z721jm58 "HTTP/1.1 200 OK"
2025-04-24 04:23:33,299 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/lzse7xbbutqu "HTTP/1.1 200 OK"
2025-04-24 04:23:33,534 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:34,276 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/de2pc5yot9g9 "HTTP/1.1 200 OK"
2025-04-24 04:23:34,501 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/zc9v2ifp4h6s "HTTP/1.1 200 OK"
2025-04-24 04:23:34,748 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/81dv0ka2ztyl "HTTP/1.1 200 OK"
2025-04-24 04:23:34,999 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/jx2judq7lv1g "HTTP/1.1 200 OK"
2025-04-24 04:23:35,240 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/zeku67m0zywe "HTTP/1.1 200 OK"
2025-04-24 04:23:35,501 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/64ya6kbr5d43 "HTTP/1.1 200 OK"
2025-04-24 04:23:35,737 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/d7irnbhxrldn "HTTP/1.1 200 OK"
2025-04-24 04:23:35,982 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:36,755 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/viv0bl21q5c4 "HTTP/1.1 200 OK"
2025-04-24 04:23:36,988 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/1fm9g1exb705 "HTTP/1.1 200 OK"
2025-04-24 04:23:37,239 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/sqyc54g7pje7 "HTTP/1.1 200 OK"
2025-04-24 04:23:37,469 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/omonipqfw2tb "HTTP/1.1 200 OK"
2025-04-24 04:23:37,708 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5qe03y87hoal "HTTP/1.1 200 OK"
2025-04-24 04:23:37,944 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/edm4z0qywgp5 "HTTP/1.1 200 OK"
2025-04-24 04:23:38,193 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/zrp7n7wsoyyp "HTTP/1.1 200 OK"
2025-04-24 04:23:38,436 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:39,169 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/xx457xl75kla "HTTP/1.1 200 OK"
2025-04-24 04:23:39,415 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/w7jr2cli9uqj "HTTP/1.1 200 OK"
2025-04-24 04:23:39,701 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/v8v6owqzxyg3 "HTTP/1.1 200 OK"
2025-04-24 04:23:39,986 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/2yx4bm9hzx5n "HTTP/1.1 200 OK"
2025-04-24 04:23:40,220 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/h47wxg6qa32j "HTTP/1.1 200 OK"
2025-04-24 04:23:40,476 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/dzexelrbd2ar "HTTP/1.1 200 OK"
2025-04-24 04:23:40,734 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/3cgeyc14cpso "HTTP/1.1 200 OK"
2025-04-24 04:23:40,976 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:41,696 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/aupetd26yrvn "HTTP/1.1 200 OK"
2025-04-24 04:23:41,938 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/wo9fswzfct76 "HTTP/1.1 200 OK"
2025-04-24 04:23:42,166 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/fioxs1p3hif1 "HTTP/1.1 200 OK"
2025-04-24 04:23:42,416 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/dkyy96tfp45m "HTTP/1.1 200 OK"
2025-04-24 04:23:42,653 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5apkgtn4fhyq "HTTP/1.1 200 OK"
2025-04-24 04:23:42,889 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/24a4hlc0d0s5 "HTTP/1.1 200 OK"
2025-04-24 04:23:43,125 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/f5moejgkggjp "HTTP/1.1 200 OK"
2025-04-24 04:23:43,364 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:44,119 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/kd2lb9gs7e89 "HTTP/1.1 200 OK"
2025-04-24 04:23:44,355 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/18ybnazqj7ky "HTTP/1.1 200 OK"
2025-04-24 04:23:44,591 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/3ronraq7p84r "HTTP/1.1 200 OK"
2025-04-24 04:23:44,826 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/28afjdym8lut "HTTP/1.1 200 OK"
2025-04-24 04:23:45,058 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/8mt3ofwyi54 "HTTP/1.1 200 OK"
2025-04-24 04:23:45,296 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/3j8n5pyiw7tl "HTTP/1.1 200 OK"
2025-04-24 04:23:45,576 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/7u9ssd9q06ho "HTTP/1.1 200 OK"
2025-04-24 04:23:45,801 - INFO - HTTP Requ

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:46,523 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/6h9azggd7ef "HTTP/1.1 200 OK"
2025-04-24 04:23:46,760 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/t4wftplvq2lc "HTTP/1.1 200 OK"
2025-04-24 04:23:47,036 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/3ni1g7ro2l2w "HTTP/1.1 200 OK"
2025-04-24 04:23:47,274 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/crnn8mn0j6mt "HTTP/1.1 200 OK"
2025-04-24 04:23:47,504 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/hzzmez8g7b25 "HTTP/1.1 200 OK"
2025-04-24 04:23:47,748 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/7pm4forvmshc "HTTP/1.1 200 OK"
2025-04-24 04:23:47,997 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/jhui0ifm04g6 "HTTP/1.1 200 OK"
2025-04-24 04:23:48,231 - INFO - HTTP Requ

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:48,959 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/zoumj5vyao2k "HTTP/1.1 200 OK"
2025-04-24 04:23:49,198 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/nlxa3i3gdnrp "HTTP/1.1 200 OK"
2025-04-24 04:23:49,440 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/770nw6ns9w2e "HTTP/1.1 200 OK"
2025-04-24 04:23:49,680 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/r6c67axmqz4y "HTTP/1.1 200 OK"
2025-04-24 04:23:49,927 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/o4t3bvlnb9kw "HTTP/1.1 200 OK"
2025-04-24 04:23:50,188 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/gk5eplklu96z "HTTP/1.1 200 OK"
2025-04-24 04:23:50,422 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/gzpsycjgi0w1 "HTTP/1.1 200 OK"
2025-04-24 04:23:50,657 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:51,421 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/hvb2ri4nqtgv "HTTP/1.1 200 OK"
2025-04-24 04:23:51,668 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/c1092etjxzea "HTTP/1.1 200 OK"
2025-04-24 04:23:51,903 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/27n3ttxfby26 "HTTP/1.1 200 OK"
2025-04-24 04:23:52,157 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ptq8rujm6m50 "HTTP/1.1 200 OK"
2025-04-24 04:23:52,396 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/z0yeqldowohe "HTTP/1.1 200 OK"
2025-04-24 04:23:52,636 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/i5b067g4egyx "HTTP/1.1 200 OK"
2025-04-24 04:23:52,878 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/inok5coguibv "HTTP/1.1 200 OK"
2025-04-24 04:23:53,147 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:53,940 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/lfx1ra9ylf5r "HTTP/1.1 200 OK"
2025-04-24 04:23:54,178 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/t6qsw18kpkdc "HTTP/1.1 200 OK"
2025-04-24 04:23:54,450 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/elnva6u749ud "HTTP/1.1 200 OK"
2025-04-24 04:23:54,683 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5zgheweou0ts "HTTP/1.1 200 OK"
2025-04-24 04:23:54,964 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/h7p23c64rm6l "HTTP/1.1 200 OK"
2025-04-24 04:23:55,196 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/sxj6cth6k62v "HTTP/1.1 200 OK"
2025-04-24 04:23:55,435 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/d0l4ocpx1fnp "HTTP/1.1 200 OK"
2025-04-24 04:23:55,670 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:56,414 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/mxcgnkhaewlx "HTTP/1.1 200 OK"
2025-04-24 04:23:56,671 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/2ahf7ffvndsn "HTTP/1.1 200 OK"
2025-04-24 04:23:56,915 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/v6t4gqef81af "HTTP/1.1 200 OK"
2025-04-24 04:23:57,157 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/auf3vycmk135 "HTTP/1.1 200 OK"
2025-04-24 04:23:57,436 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/en7jkft3qc97 "HTTP/1.1 200 OK"
2025-04-24 04:23:57,677 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/4a5e7luohb2e "HTTP/1.1 200 OK"
2025-04-24 04:23:57,921 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/1yvs34t9k8s8 "HTTP/1.1 200 OK"
2025-04-24 04:23:58,157 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:23:58,928 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/p04rl9zcrohm "HTTP/1.1 200 OK"
2025-04-24 04:23:59,166 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/2wv1p28ovc3p "HTTP/1.1 200 OK"
2025-04-24 04:23:59,396 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/bxg5m5v4isjo "HTTP/1.1 200 OK"
2025-04-24 04:23:59,635 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/abxlk2lkk3x2 "HTTP/1.1 200 OK"
2025-04-24 04:23:59,901 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/7e5e1oz1nh7l "HTTP/1.1 200 OK"
2025-04-24 04:24:00,167 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/wwgh8qkmcodm "HTTP/1.1 200 OK"
2025-04-24 04:24:00,402 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/79lbmywn59je "HTTP/1.1 200 OK"
2025-04-24 04:24:00,644 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:24:01,378 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/j7wyq70r52wl "HTTP/1.1 200 OK"
2025-04-24 04:24:01,610 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/gsx98w4lqjop "HTTP/1.1 200 OK"
2025-04-24 04:24:01,855 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/haxc4mh7159w "HTTP/1.1 200 OK"
2025-04-24 04:24:02,129 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ibb4wu1d8brx "HTTP/1.1 200 OK"
2025-04-24 04:24:02,364 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/nx4av2hp1naq "HTTP/1.1 200 OK"
2025-04-24 04:24:02,611 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/656c2m2cju5k "HTTP/1.1 200 OK"
2025-04-24 04:24:02,843 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/l3h1dt75xk8b "HTTP/1.1 200 OK"
2025-04-24 04:24:03,080 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:24:03,885 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/al8tpf053xz5 "HTTP/1.1 200 OK"
2025-04-24 04:24:04,145 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/apzkn0sq1f6l "HTTP/1.1 200 OK"
2025-04-24 04:24:04,406 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/9ctqncrgmef4 "HTTP/1.1 200 OK"
2025-04-24 04:24:04,639 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/hllkc3gx9c89 "HTTP/1.1 200 OK"
2025-04-24 04:24:04,911 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/55yjhjkg2vyu "HTTP/1.1 200 OK"
2025-04-24 04:24:05,149 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/92yj10hjgijq "HTTP/1.1 200 OK"
2025-04-24 04:24:05,386 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/gxp4rxef6l1f "HTTP/1.1 200 OK"
2025-04-24 04:24:05,621 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:24:06,365 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/rhoqttkyc9lj "HTTP/1.1 200 OK"
2025-04-24 04:24:06,600 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ouf9sfbt8osa "HTTP/1.1 200 OK"
2025-04-24 04:24:06,841 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/60jqlomto0sp "HTTP/1.1 200 OK"
2025-04-24 04:24:07,079 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/vg2lr1485xan "HTTP/1.1 200 OK"
2025-04-24 04:24:07,312 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ruv82vo822y8 "HTTP/1.1 200 OK"
2025-04-24 04:24:07,547 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ssnesw8v1aja "HTTP/1.1 200 OK"
2025-04-24 04:24:07,787 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/62a4c83z240y "HTTP/1.1 200 OK"
2025-04-24 04:24:08,023 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:24:08,773 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/1xfiwn8zks25 "HTTP/1.1 200 OK"
2025-04-24 04:24:09,016 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/4p3zx60og2tw "HTTP/1.1 200 OK"
2025-04-24 04:24:09,253 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/z1utf9rbn55e "HTTP/1.1 200 OK"
2025-04-24 04:24:09,513 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ty274jujpbvx "HTTP/1.1 200 OK"
2025-04-24 04:24:09,770 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/5mzo60pt3tsw "HTTP/1.1 200 OK"
2025-04-24 04:24:10,004 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/qaftc0hm2rf3 "HTTP/1.1 200 OK"
2025-04-24 04:24:10,247 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/t266onrqweg9 "HTTP/1.1 200 OK"
2025-04-24 04:24:10,488 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:24:11,233 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/2t50vifybaha "HTTP/1.1 200 OK"
2025-04-24 04:24:11,465 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/tw9ac0npfumg "HTTP/1.1 200 OK"
2025-04-24 04:24:11,740 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/1td14ht7cwlr "HTTP/1.1 200 OK"
2025-04-24 04:24:11,969 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ageb02kiwsjw "HTTP/1.1 200 OK"
2025-04-24 04:24:12,233 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/k6e0f9ynvi04 "HTTP/1.1 200 OK"
2025-04-24 04:24:12,477 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/xusxleowu8hb "HTTP/1.1 200 OK"
2025-04-24 04:24:12,738 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/vu5ek6x1b86k "HTTP/1.1 200 OK"
2025-04-24 04:24:12,971 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


2025-04-24 04:24:13,726 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/wue5cngqnvre "HTTP/1.1 200 OK"
2025-04-24 04:24:13,960 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/8hzmxlyotttf "HTTP/1.1 200 OK"
2025-04-24 04:24:14,193 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/2n2io51nh791 "HTTP/1.1 200 OK"
2025-04-24 04:24:14,435 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/uz5b9csx1u3p "HTTP/1.1 200 OK"
2025-04-24 04:24:14,673 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ckki55yyn8dx "HTTP/1.1 200 OK"
2025-04-24 04:24:14,907 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/amtkogpt3rxj "HTTP/1.1 200 OK"
2025-04-24 04:24:15,149 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/p5cn26d6wodz "HTTP/1.1 200 OK"
2025-04-24 04:24:15,385 - INFO - HTTP Req

  meta_df[col] = merged_df[update_col_name].fillna(merged_df[col])


✅ Video preparation complete. See logs. Metadata: `video_metadata_non_vertex_inj_correct_mcq.csv`.

# Perform Bulk Inference (Asynchronous)

Uses pre-uploaded resources (GCS URI or File API name from metadata).

In [12]:
RESULTS_FINAL_DIR = os.path.join(f"all_results/full_inference_nonCoT/{MODEL_NAME}", RESULTS_FILE)
os.makedirs(os.path.dirname(RESULTS_FINAL_DIR), exist_ok=True)


In [13]:
# --- Inference Functions ---

async def perform_inference_single_async(
    question_info: Dict,
    client: Any,
    semaphore: asyncio.Semaphore,
    rate_limiter: Optional[AsyncRateLimiter],
    results_queue: asyncio.Queue
) -> None: # Return None as result is put in queue
    """
    Async inference for one question, putting the result into a queue.
    """
    qid = question_info.get("qid", "?")
    prompt_text = build_prompt(question_info)
    gcs_uri = question_info.get("gcs_uri")
    file_api_name = question_info.get("file_api_name")
    start_time = time.time()
    result = None # Default result in case of early exit

    # --- Prepare Inputs (Same as before) ---
    try:
        video_part = None
        if USE_VERTEX:
            # ... (GCS URI logic) ...
            if not gcs_uri: raise ValueError("Missing GCS URI.")
            video_part = types.Part.from_uri(mime_type='video/mp4', file_uri=gcs_uri)
        else: # Gemini API Mode
            # ... (File API get logic) ...
             if not file_api_name: raise ValueError("Missing File API name.")
             try:
                 file_object = await client.aio.files.get(name=file_api_name)
                 video_part = file_object
             except genai_errors.NotFoundError: raise FileNotFoundError(f"File API '{file_api_name}' not found.")
             except Exception as e: raise RuntimeError(f"Failed get File API obj: {e}")

        question_content = types.Content(role="user", parts=[types.Part.from_text(text=prompt_text)])
        if video_part is None: raise RuntimeError("Video part preparation failed.")
        contents = [question_content, video_part]

    except (ValueError, FileNotFoundError, RuntimeError) as e:
        logger.error(f"QID {qid} (Async): Input Error - {e}")
        # Put error result in queue
        result = {"qid": qid, "pred": f"ERROR: Input Fail - {e}", "duration": 0, "finish_reason": "N/A", "status": "Failed (Input)"}
        await results_queue.put(result)
        return # Exit the function
    except Exception as e:
         logger.error(f"QID {qid} (Async): Unexpected Input Prep Error: {e}", exc_info=True)
         result = {"qid": qid, "pred": f"ERROR: Input Prep Failed Unexpectedly - {e}", "duration": 0, "finish_reason": "N/A", "status": "Failed (Input Prep)"}
         await results_queue.put(result)
         return # Exit the function

    # --- Perform Inference with Retries, Semaphore, and Rate Limiting ---
    async with semaphore:
        for attempt in range(MAX_RETRIES + 1):
            try:
                if rate_limiter: await rate_limiter.acquire()
                api_start = time.time()
                logger.debug(f"QID {qid} (Async): Attempt {attempt + 1} sending request...")
                response = await client.aio.models.generate_content(
                    model=MODEL_NAME,
                    contents=contents,
                    config=CONFIG
                )

                # Process Response
                answer, reason, status, err_detail = "ERROR", "UNKNOWN", "Success", ""
                try:
                    answer = response.text.strip()
                    if response.candidates and hasattr(response.candidates[0], 'finish_reason') and response.candidates[0].finish_reason is not None:
                        reason = response.candidates[0].finish_reason.name
                except ValueError as ve:
                    status, err_detail = "Blocked/Empty", f"ValueError: {ve}. "
                    # ... (block/safety reason extraction) ...
                    answer = f"ERROR: {status}. {err_detail}"

                result = {"qid": qid, "pred": answer, "duration": time.time()-start_time, "finish_reason": reason, "status": status}
                await results_queue.put(result)
                logger.debug(f"QID {qid} (Async): Attempt {attempt + 1} {status} ({time.time()-api_start:.2f}s API / {time.time()-start_time:.2f}s Total). Result queued.")
                return # Exit function after success


            except Exception as e:
                 result = {"qid": qid, "pred": f"ERROR: - {e}", "duration": time.time()-start_time, "finish_reason": reason, "status": "Failed (Unexpected Error)"}
                 await results_queue.put(result)
                 return

        # Fallback (Should not be reached if logic above is correct)
        logger.error(f"QID {qid} (Async): Exited retry loop unexpectedly.")
        result = {"qid": qid, "pred": "ERROR: Unknown after retries", "duration": time.time()-start_time, "finish_reason": "UNKNOWN", "status": "Failed (Unknown)"}
        await results_queue.put(result)

async def results_writer_task(
    queue: asyncio.Queue,
    filename: str,
    write_batch_size: int = 20, # How many results to buffer before writing
    write_interval_sec: float = 10.0 # Max time between writes
):
    """Gets results from queue and writes them to CSV in batches."""
    results_buffer = []
    last_write_time = time.monotonic()
    # Define expected header based on the dict keys put in the queue
    fieldnames = ["qid", "pred", "status", "duration_sec", "finish_reason"]
    file_exists = Path(filename).is_file()

    logger.info(f"Writer task started. Writing results to {filename}")

    while True:
        try:
            # Wait for an item with a timeout
            result = await asyncio.wait_for(queue.get(), timeout=write_interval_sec)

            if result is None: # Signal to terminate
                logger.info("Writer task received termination signal.")
                break

            if isinstance(result, dict):
                 # Ensure duration is rounded here before adding to buffer
                 result["duration_sec"] = round(result.get("duration", -1), 2)
                 # Remove the original 'duration' key if desired
                 result.pop('duration', None)
                 results_buffer.append(result)
            else:
                logger.warning(f"Writer task received non-dict item: {result}")

            queue.task_done() # Signal that the item was processed

        except asyncio.TimeoutError:
            # Timeout occurred, write buffer if not empty, even if batch size not reached
            logger.debug("Writer task timeout reached.")
            pass # Continue to buffer check below

        except Exception as e:
            logger.error(f"Writer task encountered error getting from queue: {e}", exc_info=True)
            # Decide if this is fatal or if we should try to continue
            await asyncio.sleep(1) # Prevent fast spinning on error
            continue # Try to continue processing

        # Check if we should write the buffer
        buffer_size = len(results_buffer)
        time_since_last_write = time.monotonic() - last_write_time
        should_write = (
             buffer_size > 0 and
             (buffer_size >= write_batch_size or time_since_last_write >= write_interval_sec)
        )

        if should_write:
            logger.info(f"Writing batch of {buffer_size} results to {filename}...")
            try:
                # Use 'with open' for proper handling
                with open(filename, 'a', newline='', encoding='utf-8') as csvfile:
                    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
                    # Write header only if file didn't exist at the start
                    if not file_exists:
                        writer.writeheader()
                        file_exists = True # Prevent writing header again
                    writer.writerows(results_buffer)

                results_buffer = [] # Clear buffer after successful write
                last_write_time = time.monotonic()
                logger.info(f"Batch written successfully.")
            except IOError as e:
                logger.error(f"IOError writing results batch to {filename}: {e}. Results may be lost.")
                # Optional: Add retry logic here, or store failed batch elsewhere
            except Exception as e:
                logger.error(f"Unexpected error writing results batch: {e}", exc_info=True)


    # --- Cleanup: Write any remaining items after receiving None signal ---
    if results_buffer:
        logger.info(f"Writing final remaining {len(results_buffer)} results...")
        try:
            with open(filename, 'a', newline='', encoding='utf-8') as csvfile:
                writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
                if not file_exists: writer.writeheader() # Check again in case file was deleted mid-run
                writer.writerows(results_buffer)
            logger.info("Final results written.")
        except Exception as e:
            logger.error(f"Error writing final results batch: {e}")

    logger.info("Writer task finished.")

# --- Modified Main Execution Function with UI ---
async def run_bulk_inference_async():
    """Runs the bulk inference process asynchronously with queue writer, rate limiting, and UI."""

    # --- Standard Setup ---
    if ai_client is None: logger.error("AI Client not initialized."); display(Markdown("❌ AI Client not initialized.")); return

    # --- Load Data and Prepare Tasks ---
    logger.info("Loading metadata for inference...")
    video_questions_for_inference = load_metadata_for_inference(METADATA_FILE)
    if not video_questions_for_inference: logger.warning("No video data ready."); return
    processed_qids = load_processed_qids(RESULTS_FINAL_DIR)

    inference_tasks_input = []
    tasks_skipped = 0
    required_col = 'gcs_uri' if USE_VERTEX else 'file_api_name'
    for video_id, questions in video_questions_for_inference.items():
        for question_info in questions:
             qid = question_info.get('qid')
             if not qid or qid in processed_qids or not question_info.get(required_col):
                 tasks_skipped += 1; continue
             inference_tasks_input.append(question_info)

    total_tasks = len(inference_tasks_input)
    logger.info(f"Prepared {total_tasks} new inference tasks. Skipped {tasks_skipped}." )
    if total_tasks == 0: logger.info("No new questions to process."); return

    # --- Setup Rate Limiter ---
    rate_limiter = None
    if REQUESTS_PER_MINUTE is not None and REQUESTS_PER_MINUTE > 0:
        capacity = 10 # Example: Keep burst capacity moderate
        rate_limiter = AsyncRateLimiter(rate=REQUESTS_PER_MINUTE, period=60.0, capacity=capacity)
        logger.info(f"Rate limiting enabled: {REQUESTS_PER_MINUTE} RPM, Capacity: {capacity}")
    else:
        logger.info("Rate limiting disabled.")

    # --- Create Queue and Start Writer Task ---
    results_queue = asyncio.Queue()
    writer_handle = asyncio.create_task(
        results_writer_task(results_queue, RESULTS_FINAL_DIR)
    )

    # --- Schedule and Run Inference Tasks with Progress Bar ---
    logger.info(f"Starting async inference for {total_tasks} questions (Concurrency: {MAX_ASYNC_WORKERS})...")
    semaphore = asyncio.Semaphore(MAX_ASYNC_WORKERS)
    start_bulk_time = time.time()

    # Create coroutines
    inference_coroutines = [
        perform_inference_single_async(q_info, ai_client, semaphore, rate_limiter, results_queue)
        for q_info in inference_tasks_input
    ]

    # --- Use asyncio.as_completed with tqdm_notebook for live progress ---
    completed_count = 0
    gather_exception = None
    try:
        # Create a future for each coroutine to track completion
        tasks = [asyncio.ensure_future(coro) for coro in inference_coroutines]
        # Use tqdm with as_completed
        for future in tqdm(asyncio.as_completed(tasks), total=total_tasks, desc="Async Inference"):
            try:
                await future # Wait for the next task to complete, raise exception if task failed
                completed_count += 1
            except Exception as task_exc:
                 # Log error from individual task if it wasn't caught inside
                 logger.error(f"Error surfaced from an inference task: {task_exc}", exc_info=False)
                 # Optionally decide if you want to stop processing other tasks
                 # gather_exception = task_exc # Store first exception
                 # break # Or continue processing others
    except Exception as outer_exc:
        # Catch errors during the setup/iteration of as_completed itself
        logger.error(f"Error during as_completed processing: {outer_exc}", exc_info=True)
        gather_exception = outer_exc

    # Ensure all tasks are awaited even if we broke early due to an error in one task
    # (This might be redundant if as_completed handles cancellation correctly, but safer)
    await asyncio.gather(*tasks, return_exceptions=True)

    bulk_duration = time.time() - start_bulk_time
    logger.info(f"Async inference task processing finished in {bulk_duration:.2f} seconds. Completed: {completed_count}/{total_tasks}.")
    if gather_exception:
         logger.error(f"Bulk inference encountered errors: {gather_exception}")

    # --- Signal Writer Task to Finish (ALWAYS DO THIS) ---
    logger.info("Signaling writer task to complete...")
    await results_queue.put(None)

    # --- Wait for Writer Task to Finish (ALWAYS DO THIS) ---
    logger.info("Waiting for writer task to finish writing remaining results...")
    try:
        await writer_handle # Wait until the writer processes the None signal and exits
        logger.info("Writer task has finished.")
    except Exception as writer_exc:
        logger.error(f"Error waiting for writer task: {writer_exc}", exc_info=True)

    # --- Final Summary ---
    logger.info(f"Bulk inference process complete. See logs and {RESULTS_FINAL_DIR}.")
    # Remove handler to prevent duplicate logs on re-run

# --- Run the async function ---
asyncio.run(run_bulk_inference_async())

2025-04-24 04:24:15,816 - INFO - Loading metadata for inference...
2025-04-24 04:24:15,978 - INFO - Loaded 289 videos (1500 questions) with valid IDs for inference.
2025-04-24 04:24:15,984 - INFO - Loaded 1440 processed QIDs from all_results/full_inference_nonCoT/gemini-2.5-flash-preview-04-17/results_noncot_full_inference.csv
2025-04-24 04:24:15,984 - INFO - Prepared 60 new inference tasks. Skipped 1440.
2025-04-24 04:24:15,985 - INFO - Rate limiting enabled: 100 RPM, Capacity: 10
2025-04-24 04:24:15,985 - INFO - Starting async inference for 60 questions (Concurrency: 20)...


Async Inference:   0%|          | 0/60 [00:00<?, ?it/s]

2025-04-24 04:24:15,991 - INFO - Writer task started. Writing results to all_results/full_inference_nonCoT/gemini-2.5-flash-preview-04-17/results_noncot_full_inference.csv
2025-04-24 04:24:16,342 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/80mvibtw4t4w "HTTP/1.1 200 OK"
2025-04-24 04:24:16,345 - INFO - AFC is enabled with max remote calls: 10.
2025-04-24 04:24:16,350 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/20dv3z6af9zz "HTTP/1.1 200 OK"
2025-04-24 04:24:16,351 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/ue72mq5mpijm "HTTP/1.1 200 OK"
2025-04-24 04:24:16,354 - INFO - AFC is enabled with max remote calls: 10.
2025-04-24 04:24:16,355 - INFO - AFC is enabled with max remote calls: 10.
2025-04-24 04:24:16,376 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/files/f07n3xut2u8i "HTTP/1.1 200 OK"
2025-04-24 04:24:16,378 - INFO - AFC is enabled with ma

# Cleanup and reporting

## Interim Cleanup

In [14]:
def interim_cleanup_and_report():
    """Final cleanup and report of results."""
    if not Path(RESULTS_FINAL_DIR).is_file(): display(Markdown(f"❌ Results file `{RESULTS_FINAL_DIR}` not found.")); return

    try:
        results_df = pd.read_csv(RESULTS_FINAL_DIR, dtype=str)
        logger.info(f"Loaded results from {RESULTS_FINAL_DIR} ({len(results_df)} rows).")
        
        # Correctly identify failed entries
        failed = results_df[results_df['pred'].isna() | (results_df['status'] != 'Success')]
        
        # Success is everything not in failed
        success = results_df[~results_df.index.isin(failed.index)]

        display(Markdown(f"### Results Summary: {len(success)} Success, {len(failed)} Failed."))

        # Sort by 'qid'
        results_df.sort_values(by=['qid'], inplace=True)

        # remove duplicates based on 'qid' and keep the first occurrence
        results_df.drop_duplicates(subset=['qid'], keep='first', inplace=True)
        logger.info(f"Removed duplicates, remaining {len(results_df)} unique QIDs.")

        # Save the cleaned-up results to a new file with datetime suffix
        original_filename = RESULTS_FINAL_DIR.replace('.csv', "")
        success_filename = f"{original_filename}.csv"
        failed_filename = f"{original_filename}_failed.csv"
        failed.to_csv(failed_filename, index=False, encoding='utf-8')
        success.to_csv(success_filename, index=False, encoding='utf-8')
     
    except Exception as e:
        logger.error(f"Error processing results file: {e}", exc_info=True)
        display(Markdown(f"❌ Error processing results file: {e}."))
        return
  
interim_cleanup_and_report()

2025-04-24 04:24:49,231 - INFO - Loaded results from all_results/full_inference_nonCoT/gemini-2.5-flash-preview-04-17/results_noncot_full_inference.csv (1500 rows).


### Results Summary: 1440 Success, 60 Failed.

2025-04-24 04:24:49,235 - INFO - Removed duplicates, remaining 1500 unique QIDs.


After cleaning up, if some of the questions are error, can rerun bulk inference again to redo some error questions due to Rate Limit. With current set up, you need to rerun the previous bulk inference 3 times, each time following with one clean up.

## Process Results file after inference

In [15]:
def results_cleanup_and_report():
    """Final cleanup and report of results."""
    if not Path(RESULTS_FINAL_DIR).is_file(): display(Markdown(f"❌ Results file `{RESULTS_FINAL_DIR}` not found.")); return

    try:
        results_df = pd.read_csv(RESULTS_FINAL_DIR, dtype=str)
        logger.info(f"Loaded results from {RESULTS_FINAL_DIR} ({len(results_df)} rows).")
        
        success = results_df[results_df['status'] == 'Success']
        failed = results_df[results_df['status'] != 'Success']
        logger.info(f"Results Summary: {len(success)} Success, {len(failed)} Failed.")
        display(Markdown(f"### Results Summary: {len(success)} Success, {len(failed)} Failed."))

        avg_duration = results_df['duration_sec'].astype(float).mean()
        logger.info(f"Average duration for successful tasks: {avg_duration:.2f} seconds.")
        display(Markdown(f"### Average Duration: {avg_duration:.2f} seconds."))

        # Sort by 'qid'
        results_df.sort_values(by=['qid'], inplace=True)

        # remove duplicates based on 'qid' and keep the first occurrence
        results_df.drop_duplicates(subset=['qid'], keep='first', inplace=True)
        logger.info(f"Removed duplicates, remaining {len(results_df)} unique QIDs.")

        # select only the relevant columns
        results_df = results_df[['qid', 'pred']]

        # Save the cleaned-up results to a new file with datetime suffix
        original_filename = RESULTS_FINAL_DIR.replace(".csv", "")
        new_name = original_filename + "_" + time.strftime("%Y%m%d_%H%M%S") + ".csv"
        
        
        
        results_df.to_csv(new_name, index=False, encoding='utf-8')
        logger.info(f"Cleaned results saved to {new_name}.")
        display(Markdown(f"✅ Cleaned results saved to `{new_name}`."))
     
    except Exception as e:
        logger.error(f"Error processing results file: {e}", exc_info=True)
        display(Markdown(f"❌ Error processing results file: {e}."))
        return
  
results_cleanup_and_report()

2025-04-24 04:24:49,300 - INFO - Loaded results from all_results/full_inference_nonCoT/gemini-2.5-flash-preview-04-17/results_noncot_full_inference.csv (1440 rows).
2025-04-24 04:24:49,301 - INFO - Results Summary: 1440 Success, 0 Failed.


### Results Summary: 1440 Success, 0 Failed.

2025-04-24 04:24:49,303 - INFO - Average duration for successful tasks: 272.74 seconds.


### Average Duration: 272.74 seconds.

2025-04-24 04:24:49,305 - INFO - Removed duplicates, remaining 1440 unique QIDs.
2025-04-24 04:24:49,311 - INFO - Cleaned results saved to all_results/full_inference_nonCoT/gemini-2.5-flash-preview-04-17/results_noncot_full_inference_20250424_042449.csv.


✅ Cleaned results saved to `all_results/full_inference_nonCoT/gemini-2.5-flash-preview-04-17/results_noncot_full_inference_20250424_042449.csv`.

### Delete original results file if needed

In [16]:
os.remove(RESULTS_FINAL_DIR) # Remove the original results file