# 02 - Augmentation, Pre-training, and Fine-tuning Pipeline

## Notebook Overview

This notebook implements and evaluates a complete pipeline for enhancing a Natural Language Queries (NLQ) model through data augmentation. The entire workflow, central to our project's extension, is divided into three major phases:

1.  **Phase I: LLM-Powered Data Augmentation:** We begin by leveraging a Large Language Model (LLM) to generate a new, synthetic training dataset. Starting from the timestamped narrations in Ego4D, we create NLQ-style questions and automatically associate them with precise temporal ground-truth windows. This phase includes a robust data filtering and validation process to ensure the quality of the synthetic data.

2.  **Phase II: Pre-training on Augmented Data:** The newly generated dataset is used to pre-train a baseline NLQ model (VSLNet). The goal of this phase is to teach the model the fundamental patterns of egocentric question-answering on a large and diverse set of synthetic examples, providing it with a powerful head start before it sees any human-annotated data.

3.  **Phase III: Fine-tuning on Official Data:** Finally, the model pre-trained on our synthetic data is fine-tuned on the official `nlq_train.json` dataset. This step adapts the generalized knowledge acquired during pre-training to the specific distribution and nuances of the official benchmark data. The ultimate goal is to demonstrate that this pre-training/fine-tuning strategy improves performance compared to training on the official data alone.

## 1. Environment and Data Setup
This initial section handles all the necessary setup to prepare our Colab environment. We will mount Google Drive, clone the model repository, install dependencies, and unpack the dataset into the local runtime for fast access.

### 1.1. Mount Google Drive
We begin by mounting Google Drive to access our datasets. Step needed only to upload ego4d_data.zip and to save results for persistency.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 1.2. Clone Model Repository and Set Directory
Next, we clone our `VSLNet_Code` repository, a modified version of the baseline provided for the project. After we set it as the main working directory for this notebook. This allows us to call scripts directly.

In [2]:
%%bash
# Clone the repository (if it doesn't already exist)
if [ ! -d "VSLNet_Code" ]; then
  git clone https://github.com/pietrogiancristofaro2001/ego4d-nlq-project.git
  # We only need the VSLNet_Code folder
  mv ego4d-nlq-project/VSLNet_Code .
  rm -rf ego4d-nlq-project
  echo "Repository cloned successfully."
else
  echo "Repository already exists."
fi

Repository cloned successfully.


Cloning into 'ego4d-nlq-project'...


In [3]:
# Change the notebook's working directory
%cd VSLNet_Code


/content/VSLNet_Code


### 1.3. Configure Environment for Augmentation and Pre-training
This is the main control cell for the first two phases of our project. It generates a `vars.sh` file **inside the current directory (`VSLNet_Code/`)**. This script defines all paths and parameters needed for data augmentation and for the subsequent pre-training run.

In [33]:
# --- Main Configuration ---
#We use our best model configuration for data augumentation, but in case we can change just modifying these parameters
PRETRAIN_MODEL_USED = "vslnet"  # Options: "vslnet", "vslbase"
PRETRAIN_FEATURE_TYPE = "egovlp" # Options: "egovlp", "omnivore"
PRETRAIN_TEXT_ENCODER = "bert"   # Options: "bert", "glove"
RUN_NUMBER = 2 #useful to distnguish different experiments

# --- Auto-generated settings based on configuration ---
if PRETRAIN_FEATURE_TYPE == "egovlp":
    feature_dir_name = "egovlp_fp16"
    visual_feature_dim = 256
elif PRETRAIN_FEATURE_TYPE == "omnivore":
    feature_dir_name = "omnivore_video_swinl_fp16"
    visual_feature_dim = 1536
else:
    raise ValueError("Invalid FEATURE_TYPE selected.")

pretrain_experiment_name = f"pretrain_{PRETRAIN_MODEL_USED}_{PRETRAIN_FEATURE_TYPE}_{PRETRAIN_TEXT_ENCODER}_run{RUN_NUMBER}"

# --- vars.sh content ---
vars_sh_content = f"""
#!/bin/bash

# --- I. SHARED PATH CONFIGURATION ---
export FEATURE_SOURCE_ZIP_PATH=/content/drive/MyDrive/EgoVisionProject/Data
export DRIVE_ZIP_FILENAME=ego4d_data.zip
export LOCAL_DATA_ROOT=/content/data
export EXPERIMENTS_BASE_DIR=$LOCAL_DATA_ROOT/experiments

# --- II. DATA AUGMENTATION & PRE-TRAINING SHARED PATHS ---
export LOCAL_ANNOTATIONS_DIR=$LOCAL_DATA_ROOT/ego4d_data/v1/annotations
export AUGMENTED_JSON_PATH=$LOCAL_ANNOTATIONS_DIR/nlq_train_augmented.json
export NARRATION_JSON_PATH=$LOCAL_ANNOTATIONS_DIR/narration.json
export LOCAL_VAL_SPLIT=$LOCAL_ANNOTATIONS_DIR/nlq_val.json
export LOCAL_TEST_SPLIT=$LOCAL_ANNOTATIONS_DIR/nlq_test_unannotated.json

# --- III. PRE-TRAINING SPECIFIC CONFIGURATION ---
export PRETRAIN_EXPERIMENT_NAME={pretrain_experiment_name}
export PRETRAIN_MODEL_NAME={PRETRAIN_MODEL_USED}
export PRETRAIN_VISUAL_FEATURE_TYPE={PRETRAIN_FEATURE_TYPE}
export PRETRAIN_TEXT_ENCODER_TYPE={PRETRAIN_TEXT_ENCODER}
export PRETRAIN_VISUAL_FEATURE_DIM={visual_feature_dim}
export PRETRAIN_FEATURE_DIR=$LOCAL_DATA_ROOT/ego4d_data/v1/{feature_dir_name}
export PRETRAIN_TRAIN_SPLIT=$AUGMENTED_JSON_PATH
export PRETRAIN_DATASET_DIR=$LOCAL_DATA_ROOT/dataset/$PRETRAIN_EXPERIMENT_NAME
export PRETRAIN_FEATURE_DIR_PROC=$LOCAL_DATA_ROOT/features/$PRETRAIN_EXPERIMENT_NAME/official
export PRETRAINED_CHECKPOINT_PATH=$EXPERIMENTS_BASE_DIR/{pretrain_experiment_name}
"""

# Write the content to vars.sh in the current directory (VSLNet_Code/)
with open("vars.sh", "w") as f:
    f.write(vars_sh_content)


### 1.4. Install Dependencies
We install all required Python libraries from the repository's `requirements.txt`

In [6]:
%%bash
%%capture

pip install -r requirements.txt

Collecting submitit (from -r requirements.txt (line 7))
  Downloading submitit-1.5.3-py3-none-any.whl.metadata (7.9 kB)
Collecting terminaltables (from -r requirements.txt (line 9))
  Downloading terminaltables-3.1.10-py2.py3-none-any.whl.metadata (3.5 kB)
Collecting bitsandbytes (from -r requirements.txt (line 16))
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->-r requirements.txt (line 3))
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->-r requirements.txt (line 3))
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->-r requirements.txt (line 3))
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.7

bash: line 1: fg: no job control


### 1.5. Extract Dataset from Google Drive
We use the variables defined in our `vars.sh` file to copy and extract the main dataset from Drive to the local Colab storage.

In [7]:
%%bash

source vars.sh

# Create local directory and extract data
mkdir -p "$LOCAL_DATA_ROOT"
DRIVE_ZIP_FILE_PATH="$FEATURE_SOURCE_ZIP_PATH/$DRIVE_ZIP_FILENAME"
LOCAL_TEMP_ZIP_FILE="/content/$DRIVE_ZIP_FILENAME"

if [ -f "$DRIVE_ZIP_FILE_PATH" ]; then
    echo "Copying $DRIVE_ZIP_FILENAME..."
    cp "$DRIVE_ZIP_FILE_PATH" "$LOCAL_TEMP_ZIP_FILE"
    echo "Extracting data..."
    unzip -o -q "$LOCAL_TEMP_ZIP_FILE" -d "$LOCAL_DATA_ROOT"
    rm "$LOCAL_TEMP_ZIP_FILE"
    echo "Data setup complete."
else
    echo "ERROR: File not found at $DRIVE_ZIP_FILE_PATH"
fi

Copying ego4d_data.zip...
Extracting data...
Data setup complete.


### 1.6. Load Metadata and Create Valid Narration Groups
This cell performs the core pre-computation. It loads all necessary annotation files, filters out videos that are present in the validation/test sets, creates helper maps for clips, calculates the `beta_map` (average time between narrations), useful to create the ground truth timestamps starting form the single timestamp of the annotations and finally constructs a list of all possible valid groups of `k` consecutive narrations.

In [None]:
import json
import os
import random
import uuid
from tqdm.auto import tqdm
import glob
import numpy as np

print("--- Starting Pre-computation and Filtering ---")

# Define all necessary paths relative to the repo root
repo_root = "/content"
local_data_root = os.path.join(repo_root, "data")
ego4d_json_path = os.path.join(local_data_root, 'ego4d_data', 'ego4d.json')
narration_path = os.path.join(local_data_root, 'ego4d_data', 'v1', 'annotations', 'narration.json')
val_json_path = os.path.join(local_data_root, 'ego4d_data', 'v1', 'annotations', 'nlq_val.json')
test_json_path = os.path.join(local_data_root, 'ego4d_data', 'v1', 'annotations', 'nlq_test_unannotated.json')
feature_dir_path = os.environ.get('PRETRAIN_FEATURE_DIR', os.path.join(local_data_root, 'ego4d_data/v1/egovlp_fp16'))

# Load core JSON files
print("Loading core JSON files...")
with open(ego4d_json_path, 'r') as f: ego4d_data = json.load(f)
with open(narration_path, 'r') as f: all_narrations_data = json.load(f)
print("Files loaded successfully.")


# 1. Exclude videos from val/test sets
print("\nFiltering out validation/test set videos...")
excluded_video_uids = set()
try:
    with open(val_json_path, 'r') as f: val_data = json.load(f)
    for video in val_data.get('videos', []): excluded_video_uids.add(video['video_uid'])
    with open(test_json_path, 'r') as f: test_data = json.load(f)
    for video in test_data.get('videos', []): excluded_video_uids.add(video['video_uid'])
    print(f"Found {len(excluded_video_uids)} unique videos to exclude.")
except FileNotFoundError:
    print(f"Warning: Could not find val/test JSON files.")


# 2. Check for existing features
print("\nFiltering out videos without pre-extracted features...")
existing_video_ids = {os.path.basename(f).split('.')[0] for f in glob.glob(os.path.join(feature_dir_path, '*.pt'))}
print(f"Found {len(existing_video_ids)} videos with features.")


# 3. Create a lookup map for clips for efficient access
print("\nCreating clip lookup maps...")
all_clips_map = {clip['clip_uid']: clip for clip in ego4d_data.get('clips', [])}
video_to_clips_map = {}
for clip in ego4d_data.get('clips', []):
    vid_uid = clip.get('video_uid')
    if vid_uid not in video_to_clips_map: video_to_clips_map[vid_uid] = []
    video_to_clips_map[vid_uid].append(clip)
print("Lookup maps created.")


# 4. Pre-compute Beta map (avg. time between narrations per video)
print("\nPre-computing beta map...")
video_to_beta_map = {}
for video_uid, video_content in all_narrations_data.items():
    if video_uid in excluded_video_uids or video_uid not in existing_video_ids: continue
    narrations_list = video_content.get("narration_pass_1", {}).get("narrations", [])
    if len(narrations_list) < 2: continue
    narrations_list.sort(key=lambda x: x['timestamp_sec'])
    diffs = [narrations_list[i+1]['timestamp_sec'] - narrations_list[i]['timestamp_sec'] for i in range(len(narrations_list)-1)]
    positive_diffs = [d for d in diffs if d > 0]
    if positive_diffs: video_to_beta_map[video_uid] = np.mean(positive_diffs)
print("Beta map computed.")


# 5. Create all possible valid narration groups
print("\nConstructing valid narration groups...")
k_narrations = 5
all_valid_groups = []
for video_uid, video_content in tqdm(all_narrations_data.items(), desc="Processing Videos"):
    # Additional filter: process only videos for which we have a beta value
    if video_uid not in video_to_beta_map: continue

    clips_for_this_video = video_to_clips_map.get(video_uid, [])
    narrations_list = video_content.get("narration_pass_1", {}).get("narrations", [])
    if len(narrations_list) < k_narrations: continue

    narrations_list.sort(key=lambda x: x['timestamp_sec'])

    for i in range(len(narrations_list) - k_narrations + 1):
        current_group = narrations_list[i : i + k_narrations]
        group_start_time = current_group[0]['timestamp_sec']
        group_end_time = current_group[-1]['timestamp_sec']

        # This is the key check: ensure the group is fully contained in a single parent clip
        parent_clip = next((c for c in clips_for_this_video if c['video_start_sec'] <= group_start_time and c['video_end_sec'] >= group_end_time), None)

        if parent_clip:
            all_valid_groups.append({
                "video_uid": video_uid,
                "narrations": current_group,
                "parent_clip_uid": parent_clip['clip_uid']
            })

print(f"\nPreprocessing complete. Found {len(all_valid_groups)} total valid groups.")

--- Starting Pre-computation and Filtering ---
Loading core JSON files...
Files loaded successfully.

Filtering out validation/test set videos...
Found 505 unique videos to exclude.

Filtering out videos without pre-extracted features...
Found 9611 videos with features.

Creating clip lookup maps...
Lookup maps created.

Pre-computing beta map...
Beta map computed.

Constructing valid narration groups...


Processing Videos:   0%|          | 0/9645 [00:00<?, ?it/s]


Preprocessing complete. Found 591181 total valid groups.


## 2. Timestamp Window Analysis & Debugging
This is a critical step. Before running the expensive LLM, we must ensure our logic for creating timestamp windows is robust. In this section, we will analyze the `window_duration` calculation and verify that it produces valid, non-collapsing time intervals. We will experiment with the formula to find a stable configuration. REMOVE BEFORE FINAL SUBMISSION

In [None]:
# --- DEBUGGING SCRIPT ---
# This cell now only performs the analysis, assuming all data structures were created above.

print("--- Starting Timestamp Debugging Analysis ---")

# Parameters from the EgoVLP paper
alpha = 4.9
# NEW PARAMETER: Let's define a minimum duration to prevent windows from collapsing.
MIN_WINDOW_DURATION_SEC = 1.0

# --- Analysis Loop ---
num_groups_to_inspect = 10 # Let's inspect a few random groups
successful_windows = 0
total_narrations_inspected = 0

random.shuffle(all_valid_groups) # Process in random order

for group_data in all_valid_groups[:num_groups_to_inspect]:
    video_uid = group_data['video_uid']
    parent_clip_uid = group_data['parent_clip_uid']
    parent_clip_info = all_clips_map.get(parent_clip_uid)
    beta_i = video_to_beta_map.get(video_uid)

    if not parent_clip_info or not beta_i: continue

    print(f"\n--- Inspecting Group from Video: {video_uid} | Parent Clip: {parent_clip_uid} ---")
    print(f"Parent Clip Boundaries: [{parent_clip_info['video_start_sec']:.2f}, {parent_clip_info['video_end_sec']:.2f}] | Avg. narration gap (beta): {beta_i:.2f}s")

    for narration_obj in group_data["narrations"]:
        total_narrations_inspected += 1
        t_i = narration_obj['timestamp_sec']

        # Original calculation from EgoVLP
        calculated_duration = beta_i / alpha

        # Our new robust calculation
        window_duration = max(MIN_WINDOW_DURATION_SEC, calculated_duration)

        # Calculate and clip the window to the parent clip's boundaries
        start_time_abs = max(parent_clip_info['video_start_sec'], t_i - (window_duration / 2))
        end_time_abs = min(parent_clip_info['video_end_sec'], t_i + (window_duration / 2))

        is_valid = "✅ VALID" if start_time_abs < end_time_abs else "❌ INVALID"
        if start_time_abs < end_time_abs: successful_windows += 1

        print(f"  Narration at {t_i:.2f}s -> "
              f"Proposed duration: {calculated_duration:.2f}s -> "
              f"Final duration: {window_duration:.2f}s -> "
              f"Final Window: [{start_time_abs:.2f}, {end_time_abs:.2f}] -> {is_valid}")

print(f"\n--- Analysis Complete ---")
print(f"Successfully created {successful_windows} valid windows out of {total_narrations_inspected} narrations inspected.")

--- Starting Timestamp Debugging Analysis ---

--- Inspecting Group from Video: ea49e17c-7405-4f0d-860f-6a16b3b83a28 | Parent Clip: f3075c6e-2e8f-46c7-b935-ba6c8ab7aae0 ---
Parent Clip Boundaries: [0.00, 480.00] | Avg. narration gap (beta): 4.78s
  Narration at 210.94s -> Proposed duration: 0.98s -> Final duration: 1.00s -> Final Window: [210.44, 211.44] -> ✅ VALID
  Narration at 235.86s -> Proposed duration: 0.98s -> Final duration: 1.00s -> Final Window: [235.36, 236.36] -> ✅ VALID
  Narration at 236.90s -> Proposed duration: 0.98s -> Final duration: 1.00s -> Final Window: [236.40, 237.40] -> ✅ VALID
  Narration at 244.32s -> Proposed duration: 0.98s -> Final duration: 1.00s -> Final Window: [243.82, 244.82] -> ✅ VALID
  Narration at 245.14s -> Proposed duration: 0.98s -> Final duration: 1.00s -> Final Window: [244.64, 245.64] -> ✅ VALID

--- Inspecting Group from Video: d66f42bb-822b-444a-bce0-ddd15b29bd1b | Parent Clip: 9db68bbc-c349-47e8-941f-e297981b4e8d ---
Parent Clip Boundarie

## 3. LLM-Powered Query Generation
Now that we have a robust method for creating valid timestamp windows, we can proceed with generating the synthetic queries. This section covers loading the Large Language Model (LLM), defining the prompt, and running the main generation loop.

### 3.1. Configure and Load the LLM
We will load the `Gemma-2b-it` model from Google. We use 4-bit quantization (`bitsandbytes`) to load the model efficiently on a Colab GPU.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Define the model ID and quantization configuration
model_id = "google/gemma-2b-it"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

print(f"Loading model: {model_id}...")
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
# The device_map="auto" argument will intelligently place model parts on GPU and CPU.
llm_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
print("LLM loaded successfully.")

Loading model: google/gemma-2b-it...


tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

LLM loaded successfully.


### 3.2. Define the Generation Prompt
We create a function that encapsulates the prompt engineering strategy. This function takes a narration text and formats it into a detailed prompt, including role-playing, constraints, and few-shot examples based on the most common query templates from the NLQ benchmark, to guide the LLM's output effectively.

In [None]:
def create_generation_prompt(narration_text):
    """
    Creates the final, optimized few-shot prompt to guide gemma-2b-it in
    generating a diverse and high-quality NLQ-style query.
    """
    # This prompt is the result of our iterative refinement process.
    prompt = f"""<start_of_turn>user
You are a creative assistant specializing in generating training data for an AI that understands egocentric videos. Your goal is to convert a simple action description into a clear, concise, and relevant question.

### CONTEXT
The question must be answerable by watching a short video clip corresponding to the action. The action is performed by a person wearing the camera, referred to as "#C C".

### TASK
Read the following action description and generate a SINGLE, high-quality question.

### GUIDELINES & CONSTRAINTS
1.  **Simplicity is Key:** The question must be simple, direct, and under 15 words.
2.  **Stay Grounded:** The question MUST be strictly about the objects and actions mentioned in the description. Do not invent details or infer information that isn't present.
3.  **Forbidden Questions:**
    - DO NOT ask "why".
    - DO NOT ask Yes/No questions.
    - DO NOT ask about the person's internal state (e.g., "What was the person thinking?").
4.  **Question Style Examples (Based on common templates):**
    - **"What/Which" questions (about objects/actions):**
        - Input: "#C C picks up the remote" -> Question: "Which did #C C pick up?"
        - Input: "#C C is slicing a tomato" -> Question: "What is #C C slicing?"
    - **"Where" questions (about location):**
        - Input: "#C C puts the book on the shelf" -> Question: "Where did #C C put the book?"
    - **"How" questions (about the manner of an action):**
        - Input: "#C C opens the jar with a cloth" -> Question: "How did #C C open the jar?"

### ACTION DESCRIPTION
"{narration_text}"

### YOUR GENERATED QUESTION
Question:
<end_of_turn>
<start_of_turn>model
"""
    return prompt

# --- Test the final prompt function ---
test_narration = "#C C turns the faucet knob"
print("--- Example of the Final, Optimized Prompt ---")
print(create_generation_prompt(test_narration))

### 3.3. Run the Generation Loop
This is the main loop for data augmentation. We iterate through a shuffled list of the valid narration groups, calculate the robust timestamp window for each narration, and call the LLM to generate a query. The results are collected into a final list.

In [None]:
# Set how many groups of narrations we want to process to generate queries
# Each group has 5 narrations, so this will generate up to num_groups * 5 queries
num_groups_to_generate = 2300
print(f"\nStarting query generation. Goal: Process {num_groups_to_generate} narration groups.")

generated_data_groups = []
random.shuffle(all_valid_groups) # Shuffle to process groups in random order

# This progress bar will track the number of processed groups
with tqdm(total=num_groups_to_generate, desc="Generating Annotations") as pbar:
    for group_data in all_valid_groups:
        # Stop when we have processed enough groups
        if len(generated_data_groups) >= num_groups_to_generate:
            break

        # Get all necessary info for the group
        video_uid = group_data['video_uid']
        beta_i = video_to_beta_map.get(video_uid)
        parent_clip_info = all_clips_map.get(group_data['parent_clip_uid'])
        if not parent_clip_info or not beta_i: continue

        queries_for_this_group = []
        for narration_obj in group_data["narrations"]:
            # 1. Calculate the robust timestamp window
            t_i = narration_obj['timestamp_sec']
            calculated_duration = beta_i / alpha
            window_duration = max(MIN_WINDOW_DURATION_SEC, calculated_duration)
            start_time_abs = max(parent_clip_info['video_start_sec'], t_i - (window_duration / 2))
            end_time_abs = min(parent_clip_info['video_end_sec'], t_i + (window_duration / 2))

            # Safety check to avoid logical errors on the interval
            if start_time_abs >= end_time_abs: continue

            # 2. Generate and parse the query
            final_query = None
            try:
                # Create the prompt using step-by-step strategy
                prompt = create_generation_prompt(narration_obj['narration_text'])
                inputs = tokenizer(prompt, return_tensors="pt").to(llm_model.device)

                # Get the length of the input prompt in tokens for robust separation
                prompt_token_length = inputs['input_ids'].shape[1]

                # Generate the output tokens (prompt + response with max 50 tokens)
                outputs = llm_model.generate(**inputs, max_new_tokens=50, do_sample=False)

                # Get only the newly generated tokens by slicing
                response_tokens = outputs[0][prompt_token_length:]

                # Decode only the response tokens to get the model's raw answer
                raw_response = tokenizer.decode(response_tokens, skip_special_tokens=True).strip()

                # Apply the robust parsing and validation logic
                # Get the first line of the response and clean it
                query_candidate = raw_response.split('\n')[0].strip().strip('\"* ')

                # Validate that the result is a non-empty string and contains a question mark
                if query_candidate and "?" in query_candidate:
                    final_query = query_candidate

            except Exception as e:
                print(f"An error occurred during generation: {e}")
                final_query = None # Ensure it's None in case of an error

            # 3. Collect valid results
            if final_query:
                queries_for_this_group.append({
                    "query": final_query,
                    "template": "LLM-Generated",
                    "video_start_sec": start_time_abs,
                    "video_end_sec": end_time_abs,
                    "clip_start_sec": start_time_abs - parent_clip_info['video_start_sec'],
                    "clip_end_sec": end_time_abs - parent_clip_info['video_start_sec']
                })

        # If we successfully generated at least one query for this group, add it
        if queries_for_this_group:
            generated_data_groups.append({
                "video_uid": video_uid,
                "parent_clip_uid": parent_clip_info['clip_uid'],
                "language_queries": queries_for_this_group
            })
            pbar.update(1)

print(f"\nGeneration complete. Successfully created {len(generated_data_groups)} annotation blocks.")


Starting query generation. Goal: Process 2300 narration groups.


Generating Annotations:   0%|          | 0/2300 [00:00<?, ?it/s]


Generation complete. Successfully created 2300 annotation blocks.


In [None]:
# Optional cell to check the results
print(f"Generated {len(generated_data_groups)} groups.")
#choose a number of queries to see
if generated_data_groups:
    print(json.dumps(generated_data_groups[0:40], indent=2))

Generated 2300 groups.
[
  {
    "video_uid": "49b14463-2172-4b0e-ad13-fdf9383e9a77",
    "parent_clip_uid": "90caba4b-0e92-4ed6-b53f-0f698b56f795",
    "language_queries": [
      {
        "query": "Which object did the woman interact with in the video segment?",
        "template": "LLM-Generated",
        "video_start_sec": 638.1311866329004,
        "video_end_sec": 639.5040357670996,
        "clip_start_sec": 8.099155382900335,
        "clip_end_sec": 9.472004517099549
      },
      {
        "query": "Which object did she place on the dining table?",
        "template": "LLM-Generated",
        "video_start_sec": 640.4404066329004,
        "video_end_sec": 641.8132557670996,
        "clip_start_sec": 10.408375382900317,
        "clip_end_sec": 11.781224517099531
      },
      {
        "query": "Which object did the woman remove from the dining table?",
        "template": "LLM-Generated",
        "video_start_sec": 654.4853766329004,
        "video_end_sec": 655.8582257670996

### 3.4. Format and Save the Augmented Dataset
Finally, we convert the collected data into the official NLQ JSON format and save it to a file. This file can then be used as input for the pre-training phase.

In [None]:
print(f"\nConverting {len(generated_data_groups)} annotation blocks to final format...")

# Define the path for the output file from our vars.sh setup
output_json_path = os.environ.get('AUGMENTED_JSON_PATH', '/content/data/ego4d_data/v1/annotations/nlq_train_augmented.json')

final_output = {"version": "1.0", "description": "Augmented NLQ dataset - Generated with LLM", "videos": []}
output_videos_map = {}

for datum in tqdm(generated_data_groups, desc="Final Conversion"):
    video_uid = datum['video_uid']
    parent_clip_uid = datum['parent_clip_uid']
    parent_clip_info = all_clips_map.get(parent_clip_uid)
    if not parent_clip_info: continue

    # Create video entry if it doesn't exist
    if video_uid not in output_videos_map:
        output_videos_map[video_uid] = {"video_uid": video_uid, "clips": []}

    video_entry = output_videos_map[video_uid]

    # Find or create clip entry
    output_clip_entry = next((c for c in video_entry["clips"] if c["clip_uid"] == parent_clip_uid), None)
    if not output_clip_entry:
        output_clip_entry = {
            "clip_uid": parent_clip_uid,
            "video_start_sec": parent_clip_info['video_start_sec'],
            "video_end_sec": parent_clip_info['video_end_sec'],
            "annotations": []
        }
        video_entry["clips"].append(output_clip_entry)

    # Add the block of generated queries as a new annotation
    new_annotation_block = {
        "annotation_uid": str(uuid.uuid4()),
        "language_queries": datum['language_queries']
    }
    output_clip_entry["annotations"].append(new_annotation_block)

final_output['videos'] = list(output_videos_map.values())

with open(output_json_path, 'w') as f:
    json.dump(final_output, f, indent=2)

print(f"\nProcess complete! Final augmented dataset saved to: {output_json_path}")


Converting 2300 annotation blocks to final format...


Final Conversion:   0%|          | 0/2300 [00:00<?, ?it/s]


Process complete! Final augmented dataset saved to: /content/data/ego4d_data/v1/annotations/nlq_train_augmented.json


### 3.5. Save Augmented Dataset to Google Drive (Optional)
As a final step for the data augmentation phase, this optional cell copies the generated `nlq_train_augmented.json` file from the local Colab storage to our specified folder on Google Drive.

In [None]:
%%bash
source vars.sh

# The source file is the local path where we saved our augmented data
SOURCE_FILE="$AUGMENTED_JSON_PATH"

# The destination directory on Google Drive. We can reuse the path where the original data zip is located.
DEST_DIR="$FEATURE_SOURCE_ZIP_PATH"

# Check if the source file actually exists before trying to copy
if [ -f "$SOURCE_FILE" ]; then
  echo "Copying augmented dataset from:"
  echo "$SOURCE_FILE"
  echo "to Google Drive directory:"
  echo "$DEST_DIR"

  # Ensure the destination directory exists
  mkdir -p "$DEST_DIR"

  # Copy the file
  cp "$SOURCE_FILE" "$DEST_DIR"

  echo -e "\n Copy complete!"
  echo "You can now find your file in your Google Drive."
else
  echo " ERROR: Source file $SOURCE_FILE not found."
  echo "Please ensure the previous cell (3.4) ran successfully and created the file."
fi

Copying augmented dataset from:
/content/data/ego4d_data/v1/annotations/nlq_train_augmented.json
to Google Drive directory:
/content/drive/MyDrive/EgoVisionProject/Data

✅ Copy complete!
You can now find your file in your Google Drive.


## 4. Setup for Pre-training on Augmented Data
Now that we have our augmented dataset, we need to prepare it for the VSLNet model. This involves running the `prepare_ego4d_dataset.py` script, which preprocesses the JSON file and the corresponding features into a format optimized for the data loader.

We will use the configuration variables (prefixed with `PRETRAIN_...`) that we defined in our main `vars.sh` file at the beginning of the notebook.

In [12]:
%%bash
source vars.sh

#useful in case we already have obtained the augmented train.json in another run otherwise comment this row
AUGMENTED_JSON_PATH="/content/drive/MyDrive/EgoVisionProject/Data/nlq_train_augmented.json"

echo "Creating output directories for processed pre-training data..."
mkdir -p "$PRETRAIN_DATASET_DIR"
mkdir -p "$PRETRAIN_FEATURE_DIR_PROC"

echo "Running data preparation script for the pre-training phase..."
# Run the scriptto prepare the dataset
python utils/prepare_ego4d_dataset.py \
    --input_train_split "$AUGMENTED_JSON_PATH" \
    --input_val_split "$LOCAL_VAL_SPLIT" \
    --input_test_split "$LOCAL_TEST_SPLIT" \
    --video_feature_read_path "$PRETRAIN_FEATURE_DIR" \
    --clip_feature_save_path "$PRETRAIN_FEATURE_DIR_PROC" \
    --output_save_path "$PRETRAIN_DATASET_DIR"

echo "Pre-training data preparation finished."

Creating output directories for processed pre-training data...
Running data preparation script for the pre-training phase...
Reading [train]: /content/drive/MyDrive/EgoVisionProject/Data/nlq_train_augmented.json
# train: 11350
Writing [train]: /content/data/dataset/pretrain_vslnet_egovlp_bert_run2/train.json
Reading [val]: /content/data/ego4d_data/v1/annotations/nlq_val.json
# val: 3874
Writing [val]: /content/data/dataset/pretrain_vslnet_egovlp_bert_run2/val.json
Reading [test]: /content/data/ego4d_data/v1/annotations/nlq_test_unannotated.json
# test: 4004
Writing [test]: /content/data/dataset/pretrain_vslnet_egovlp_bert_run2/test.json
Pre-training data preparation finished.


Extracting features:   0%|          | 0/2543 [00:00<?, ?it/s]Extracting features:   0%|          | 3/2543 [00:00<01:41, 25.14it/s]Extracting features:   0%|          | 7/2543 [00:00<01:20, 31.56it/s]Extracting features:   1%|          | 21/2543 [00:00<00:47, 52.89it/s]Extracting features:   1%|▏         | 32/2543 [00:00<00:36, 68.81it/s]Extracting features:   2%|▏         | 40/2543 [00:00<00:38, 64.64it/s]Extracting features:   2%|▏         | 47/2543 [00:00<00:38, 64.59it/s]Extracting features:   2%|▏         | 60/2543 [00:00<00:30, 81.22it/s]Extracting features:   3%|▎         | 71/2543 [00:00<00:27, 89.20it/s]Extracting features:   3%|▎         | 84/2543 [00:01<00:25, 96.36it/s]Extracting features:   4%|▎         | 95/2543 [00:01<00:25, 97.53it/s]Extracting features:   4%|▍         | 105/2543 [00:01<00:27, 89.37it/s]Extracting features:   5%|▍         | 115/2543 [00:01<00:28, 85.11it/s]Extracting features:   5%|▍         | 125/2543 [00:01<00:27, 86.37it/s]Extracting fe

### 4.2. Create Symbolic Links
The training scripts expect the data and feature directories to be in specific locations within the working directory. We create symbolic links (`ln -sfn`) to point from these expected locations to our actual data folders in the local Colab storage. This is a crucial step for the data preparation and training scripts to work properly.

In [16]:
%%bash

source vars.sh

CWD=$(pwd)
#Base directory for symbolic link generation
mkdir -p "$CWD/data/dataset"
# Create also the subdirectory $TASK_NAME below features
mkdir -p "$CWD/data/features/$PRETRAIN_EXPERIMENT_NAME"

# 1. Annotations link

# Remove the previous link if it exists and create the new one
rm -f "$CWD/data/dataset/$PRETRAIN_EXPERIMENT_NAME"
ln -sfn "$PRETRAIN_DATASET_DIR" "$CWD/data/dataset/$PRETRAIN_EXPERIMENT_NAME"
echo "Annotations link: $CWD/data/dataset/$PRETRAIN_EXPERIMENT_NAME -> $PRETRAIN_DATASET_DIR"

# 2. Processed features link

# Remove the previous link if it exists and create the new one
rm -f "$CWD/data/features/$PRETRAIN_EXPERIMENT_NAME/official"
ln -sfn "$PRETRAIN_FEATURE_DIR_PROC" "$CWD/data/features/$PRETRAIN_EXPERIMENT_NAME/official"
echo "Features link: $CWD/data/features/$PRETRAIN_EXPERIMENT_NAME/official -> $PRETRAIN_FEATURE_DIR_PROC"

echo "--- Setup completed. Checks below: ---"
echo "Annotations target PRETRAIN_DATASET_DIR exists?"
ls -ld "$PRETRAIN_DATASET_DIR"
echo "Annotations link $CWD/data/dataset/$PRETRAIN_EXPERIMENT_NAME points to:"
ls -ld "$CWD/data/dataset/$PRETRAIN_EXPERIMENT_NAME"

echo "Features target PRETRAIN_FEATURE_DIR_PROC exists?"
ls -ld "$PRETRAIN_FEATURE_DIR_PROC"
echo "Features link $CWD/data/features/$PRETRAIN_EXPERIMENT_NAME/official points to:"
ls -ld "$CWD/data/features/$PRETRAIN_EXPERIMENT_NAME/official"

Annotations link: /content/VSLNet_Code/data/dataset/pretrain_vslnet_egovlp_bert_run2 -> /content/data/dataset/pretrain_vslnet_egovlp_bert_run2
Features link: /content/VSLNet_Code/data/features/pretrain_vslnet_egovlp_bert_run2/official -> /content/data/features/pretrain_vslnet_egovlp_bert_run2/official
--- Setup completed. Checks below: ---
Annotations target PRETRAIN_DATASET_DIR exists?
drwxr-xr-x 2 root root 4096 Jun 19 15:05 /content/data/dataset/pretrain_vslnet_egovlp_bert_run2
Annotations link /content/VSLNet_Code/data/dataset/pretrain_vslnet_egovlp_bert_run2 points to:
lrwxrwxrwx 1 root root 54 Jun 19 15:11 /content/VSLNet_Code/data/dataset/pretrain_vslnet_egovlp_bert_run2 -> /content/data/dataset/pretrain_vslnet_egovlp_bert_run2
Features target PRETRAIN_FEATURE_DIR_PROC exists?
drwxr-xr-x 2 root root 172032 Jun 19 15:05 /content/data/features/pretrain_vslnet_egovlp_bert_run2/official
Features link /content/VSLNet_Code/data/features/pretrain_vslnet_egovlp_bert_run2/official points

## 5. Launch Pre-training
With the data prepared, we can now launch the pre-training script `main.py`. This script will train the VSLNet model from scratch using only our synthetic dataset. The resulting model checkpoint (the last checkpoint) will be saved locally and can then be optionally copied to Google Drive. This checkpoint will serve as the starting point for the final fine-tuning phase.

In [34]:
%%bash

source vars.sh

# --- Hyper-parameter Configuration for Pre-training ---
export DATALOADER_WORKERS=1
export NUM_WORKERS=2
export BATCH_SIZE=32
export DIM=128
export NUM_EPOCH=10 # Adjust epochs as needed for pre-training
export MAX_POS_LEN=128
export INIT_LR=0.0025

# --- Construct TensorBoard Log Name ---
export TB_LOG_NAME="${PRETRAIN_EXPERIMENT_NAME}_bs${BATCH_SIZE}_dim${DIM}_epoch${NUM_EPOCH}_ilr${INIT_LR}"


mkdir -p "$PRETRAINED_CHECKPOINT_PATH"

echo "--- Starting PRE-TRAINING ---"
echo "Experiment Name: $PRETRAIN_EXPERIMENT_NAME"
echo "Model: $PRETRAIN_MODEL_NAME"
echo "Video Features: $PRETRAIN_VISUAL_FEATURE_TYPE (Dim: $PRETRAIN_VISUAL_FEATURE_DIM)"
echo "Text Encoder: $PRETRAIN_TEXT_ENCODER_TYPE"
echo "Training Data: AUGMENTED"
echo "--------------------------"

#we add the config parameter --pretrain yes to obtain the last checkpoint and we keep the --eval_gt_json also if we don't need it to avoid errors in the script
python main.py \
    --task "$PRETRAIN_EXPERIMENT_NAME" \
    --mode train \
    --pretrain "yes" \
    --predictor "$PRETRAIN_TEXT_ENCODER_TYPE" \
    --dim $DIM \
    --video_feature_dim $PRETRAIN_VISUAL_FEATURE_DIM \
    --max_pos_len $MAX_POS_LEN \
    --init_lr $INIT_LR \
    --epochs $NUM_EPOCH \
    --batch_size $BATCH_SIZE \
    --fv official \
    --eval_gt_json "$LOCAL_VAL_SPLIT" \
    --num_workers $NUM_WORKERS \
    --data_loader_workers $DATALOADER_WORKERS \
    --model_dir "$PRETRAINED_CHECKPOINT_PATH" \
    --log_to_tensorboard $TB_LOG_NAME \
    --tb_log_freq 5 \
    --remove_empty_queries_from train



--- Starting PRE-TRAINING ---
Experiment Name: pretrain_vslnet_egovlp_bert_run2
Model: vslnet
Video Features: egovlp (Dim: 256)
Text Encoder: bert
Training Data: AUGMENTED
--------------------------
Running with Namespace(save_dir='datasets', model_type='vslnet', resume_from_checkpoint=None, task='pretrain_vslnet_egovlp_bert_run2', eval_gt_json=None, fv='official', max_pos_len=128, num_workers=2, data_loader_workers=1, word_size=None, char_size=None, word_dim=300, video_feature_dim=256, char_dim=50, dim=128, highlight_lambda=5.0, num_heads=8, drop_rate=0.2, predictor='bert', gpu_idx='0', seed=12345, mode='train', epochs=10, batch_size=32, num_train_steps=None, init_lr=0.0025, clip_norm=1.0, warmup_proportion=0.0, extend=0.1, period=100, text_agnostic=False, video_agnostic=False, model_dir='/content/data/experiments/pretrain_vslnet_egovlp_bert_run2', model_name='vslnet', suffix=None, log_to_tensorboard='pretrain_vslnet_egovlp_bert_run2_bs32_dim128_epoch10_ilr0.0025', tb_log_dir='./runs'

2025-06-19 17:17:52.014723: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-19 17:17:52.035942: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750353472.058347   44450 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750353472.065334   44450 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-19 17:17:52.089655: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

CalledProcessError: Command 'b'# Source the main configuration file\nsource vars.sh\n\n# --- Hyper-parameter Configuration for Pre-training ---\nexport DATALOADER_WORKERS=1\nexport NUM_WORKERS=2\nexport BATCH_SIZE=32\nexport DIM=128\nexport NUM_EPOCH=10 # Adjust epochs as needed for pre-training\nexport MAX_POS_LEN=128\nexport INIT_LR=0.0025\n\n# --- Construct TensorBoard Log Name ---\nexport TB_LOG_NAME="${PRETRAIN_EXPERIMENT_NAME}_bs${BATCH_SIZE}_dim${DIM}_epoch${NUM_EPOCH}_ilr${INIT_LR}"\n\n\nmkdir -p "$PRETRAINED_CHECKPOINT_PATH"\n\necho "--- Starting PRE-TRAINING ---"\necho "Experiment Name: $PRETRAIN_EXPERIMENT_NAME"\necho "Model: $PRETRAIN_MODEL_NAME"\necho "Video Features: $PRETRAIN_VISUAL_FEATURE_TYPE (Dim: $PRETRAIN_VISUAL_FEATURE_DIM)"\necho "Text Encoder: $PRETRAIN_TEXT_ENCODER_TYPE"\necho "Training Data: AUGMENTED"\necho "--------------------------"\n\npython main.py \\\n    --task "$PRETRAIN_EXPERIMENT_NAME" \\\n    --mode train \\\n    --predictor "$PRETRAIN_TEXT_ENCODER_TYPE" \\\n    --dim $DIM \\\n    --video_feature_dim $PRETRAIN_VISUAL_FEATURE_DIM \\\n    --max_pos_len $MAX_POS_LEN \\\n    --init_lr $INIT_LR \\\n    --epochs $NUM_EPOCH \\\n    --batch_size $BATCH_SIZE \\\n    --fv official \\\n    --num_workers $NUM_WORKERS \\\n    --data_loader_workers $DATALOADER_WORKERS \\\n    --model_dir "$PRETRAINED_CHECKPOINT_PATH" \\\n    --log_to_tensorboard $TB_LOG_NAME \\\n    --tb_log_freq 5 \\\n    --remove_empty_queries_from train\n\n    #--eval_gt_json "$LOCAL_VAL_SPLIT" \\\n'' returned non-zero exit status 1.

### 5.1. Save Pre-training Results to Google Drive (Optional)
This optional step copies the entire pre-training experiment folder (containing model checkpoints and logs) from the local Colab storage to our specified directory on Google Drive for permanent storage.

In [19]:
%%bash
# Source the main configuration file
source vars.sh

# Source directory (local) for the pre-training run
SOURCE_DIR="$LOCAL_DATA_ROOT/experiments/$PRETRAIN_EXPERIMENT_NAME"

# Destination directory (on Google Drive)
DEST_DIR="/content/drive/MyDrive/EgoVisionProject/Experiments"

if [ -d "$SOURCE_DIR" ]; then
  echo "Copying pre-training results from $SOURCE_DIR to $DEST_DIR..."
  mkdir -p "$DEST_DIR"
  cp -r "$SOURCE_DIR" "$DEST_DIR"
  echo "Copy complete!"
  echo "You can find your results in: $DEST_DIR/$PRETRAIN_EXPERIMENT_NAME"
else
  echo "ERROR: Source directory $SOURCE_DIR not found. Was the pre-training completed?"
fi

Copying pre-training results from /content/data/experiments/pretrain_vslnet_egovlp_bert_run2 to /content/drive/MyDrive/EgoVisionProject/Experiments...
Copy complete!
You can find your results in: /content/drive/MyDrive/EgoVisionProject/Experiments/pretrain_vslnet_egovlp_bert_run2


## 6. Setup for Fine-tuning
This section prepares the environment for the final fine-tuning phase. The key step here is to re-configure our `vars.sh` file. We will update it to:
1.  Define a new, unique name for the fine-tuning experiment.
2.  Point the training script to the **official `nlq_train.json`** dataset.
3.  Reference the path to the **pre-trained model checkpoint** that we just saved.

### 6.1. Configure the enviroment for fine-tuning
This is the main control cell for the last step of our project. It generates a `vars.sh` file **inside the current directory (`VSLNet_Code/`)**. This script defines all paths and parameters needed for fine-tuning.
Set BEST CHECKPOINT with the name of our last checkpoint obtained during pre-training step.

In [20]:
# --- Main Configuration ---
#We use our best model configuration already used for pre-training, but in case we can change just modifying parameters
FINETUNING_MODEL_USED = "vslnet"  # Options: "vslnet", "vslbase"
FINETUNING_FEATURE_TYPE = "egovlp" # Options: "egovlp", "omnivore"
FINETUNING_TEXT_ENCODER = "bert"   # Options: "bert", "glove"
RUN_NUMBER = 2

# --- Auto-generated settings based on configuration ---
if FINETUNING_FEATURE_TYPE == "egovlp":
    feature_dir_name = "egovlp_fp16"
    visual_feature_dim = 256
elif FINETUNING_FEATURE_TYPE == "omnivore":
    feature_dir_name = "omnivore_video_swinl_fp16"
    visual_feature_dim = 1536
else:
    raise ValueError("Invalid FEATURE_TYPE selected.")

finetuning_experiment_name = f"finetuning_{FINETUNING_MODEL_USED}_{FINETUNING_FEATURE_TYPE}_{FINETUNING_TEXT_ENCODER}_run{RUN_NUMBER}"

#pretrained model configuration
pretrain_model_used = "vslnet"
pretrain_feature_type = "egovlp"
pretrain_text_encoder = "bert"
pretrain_run_number = 2
pretrain_experiment_name = f"pretrain_{pretrain_model_used}_{pretrain_feature_type}_{pretrain_text_encoder}_run{pretrain_run_number}"

# --- vars.sh content: change experiments_base_dir if we already have the checkpoint on drive ---
vars_sh_content = f"""
#!/bin/bash

# --- I. SHARED PATH CONFIGURATION ---
export FEATURE_SOURCE_ZIP_PATH=/content/drive/MyDrive/EgoVisionProject/Data
export DRIVE_ZIP_FILENAME=ego4d_data.zip
export LOCAL_DATA_ROOT=/content/data
export EXPERIMENTS_BASE_DIR=$LOCAL_DATA_ROOT/experiments

# --- II. FINE-TUNING SHARED PATHS ---
export LOCAL_ANNOTATIONS_DIR=$LOCAL_DATA_ROOT/ego4d_data/v1/annotations
export LOCAL_TRAIN_SPLIT=$LOCAL_ANNOTATIONS_DIR/nlq_train.json
export LOCAL_VAL_SPLIT=$LOCAL_ANNOTATIONS_DIR/nlq_val.json
export LOCAL_TEST_SPLIT=$LOCAL_ANNOTATIONS_DIR/nlq_test_unannotated.json

# --- III. FINE-TUNING SPECIFIC CONFIGURATION ---
export FINETUNING_EXPERIMENT_NAME={finetuning_experiment_name}
export FINETUNING_MODEL_NAME={FINETUNING_MODEL_USED}
export FINETUNING_VISUAL_FEATURE_TYPE={FINETUNING_FEATURE_TYPE}
export FINETUNING_TEXT_ENCODER_TYPE={FINETUNING_TEXT_ENCODER}
export FINETUNING_VISUAL_FEATURE_DIM={visual_feature_dim}
export FINETUNING_FEATURE_DIR=$LOCAL_DATA_ROOT/ego4d_data/v1/{feature_dir_name}
export FINETUNING_DATASET_DIR=$LOCAL_DATA_ROOT/dataset/$FINETUNING_EXPERIMENT_NAME
export FINETUNING_FEATURE_DIR_PROC=$LOCAL_DATA_ROOT/features/$FINETUNING_EXPERIMENT_NAME/official
export BEST_CHECKPOINT=vslnet_177.t7
export PRETRAINED_CHECKPOINT_PATH=$EXPERIMENTS_BASE_DIR/{pretrain_experiment_name}/vslnet_{pretrain_experiment_name}_official_128_bert/model/$BEST_CHECKPOINT
"""

# Write the content to vars.sh in the current directory (VSLNet_Code/)
with open("vars.sh", "w") as f:
    f.write(vars_sh_content)


In [25]:
%%bash

source vars.sh

echo "Creating output directories for processed pre-training data..."
mkdir -p "$FINETUNING_DATASET_DIR"
mkdir -p "$FINETUNING_FEATURE_DIR_PROC"

echo "Running data preparation script for the OFFICIAL training data..."
python utils/prepare_ego4d_dataset.py \
    --input_train_split "$LOCAL_TRAIN_SPLIT" \
    --input_val_split "$LOCAL_VAL_SPLIT" \
    --input_test_split "$LOCAL_TEST_SPLIT" \
    --video_feature_read_path "$FINETUNING_FEATURE_DIR" \
    --clip_feature_save_path "$FINETUNING_FEATURE_DIR_PROC" \
    --output_save_path "$FINETUNING_DATASET_DIR"
echo "Official data preparation finished."

Creating output directories for processed pre-training data...
Running data preparation script for the OFFICIAL training data...
Reading [train]: /content/data/ego4d_data/v1/annotations/nlq_train.json
# train: 11291
Writing [train]: /content/data/dataset/finetuning_vslnet_egovlp_bert_run2/train.json
Reading [val]: /content/data/ego4d_data/v1/annotations/nlq_val.json
# val: 3874
Writing [val]: /content/data/dataset/finetuning_vslnet_egovlp_bert_run2/val.json
Reading [test]: /content/data/ego4d_data/v1/annotations/nlq_test_unannotated.json
# test: 4004
Writing [test]: /content/data/dataset/finetuning_vslnet_egovlp_bert_run2/test.json
Official data preparation finished.


Extracting features:   0%|          | 0/1659 [00:00<?, ?it/s]Extracting features:   2%|▏         | 27/1659 [00:00<00:08, 189.79it/s]Extracting features:   3%|▎         | 46/1659 [00:00<00:10, 161.27it/s]Extracting features:   4%|▍         | 71/1659 [00:00<00:08, 183.99it/s]Extracting features:   5%|▌         | 90/1659 [00:00<00:10, 154.48it/s]Extracting features:   7%|▋         | 111/1659 [00:00<00:09, 169.57it/s]Extracting features:   8%|▊         | 136/1659 [00:00<00:08, 189.81it/s]Extracting features:   9%|▉         | 156/1659 [00:00<00:08, 176.22it/s]Extracting features:  11%|█         | 175/1659 [00:01<00:08, 166.64it/s]Extracting features:  12%|█▏        | 197/1659 [00:01<00:08, 180.39it/s]Extracting features:  13%|█▎        | 216/1659 [00:01<00:08, 172.24it/s]Extracting features:  14%|█▍        | 234/1659 [00:01<00:09, 156.73it/s]Extracting features:  15%|█▌        | 251/1659 [00:01<00:10, 139.41it/s]Extracting features:  16%|█▌        | 266/1659 [00:01<00:10, 136.

Create simbolic links

In [30]:
%%bash

source vars.sh

CWD=$(pwd)
#Base directory for symbolic link generation
mkdir -p "$CWD/data/dataset"
# Create also the subdirectory $TASK_NAME below features
mkdir -p "$CWD/data/features/$FINETUNING_EXPERIMENT_NAME"

# 1. Annotations link

# Remove the previous link if it exists and create the new one
rm -f "$CWD/data/dataset/$FINETUNING_EXPERIMENT_NAME"
ln -sfn "$FINETUNING_DATASET_DIR" "$CWD/data/dataset/$FINETUNING_EXPERIMENT_NAME"
echo "Annotations link: $CWD/data/dataset/$FINETUNING_EXPERIMENT_NAME -> $FINETUNING_DATASET_DIR"

# 2. Processed features link

# Remove the previous link if it exists and create the new one
rm -f "$CWD/data/features/$FINETUNING_EXPERIMENT_NAME/official"
ln -sfn "$FINETUNING_FEATURE_DIR_PROC" "$CWD/data/features/$FINETUNING_EXPERIMENT_NAME/official"
echo "Features link: $CWD/data/features/$FINETUNING_EXPERIMENT_NAME/official -> $FINETUNING_FEATURE_DIR_PROC"

echo "--- Setup completed. Checks below: ---"
echo "Annotations target FINETUNING_DATASET_DIR exists?"
ls -ld "$FINETUNING_DATASET_DIR"
echo "Annotations link $CWD/data/dataset/$FINETUNING_EXPERIMENT_NAME points to:"
ls -ld "$CWD/data/dataset/$FINETUNING_EXPERIMENT_NAME"

echo "Features target FINETUNING_FEATURE_DIR_PROC exists?"
ls -ld "$FINETUNING_FEATURE_DIR_PROC"
echo "Features link $CWD/data/features/$FINETUNING_EXPERIMENT_NAME/official points to:"
ls -ld "$CWD/data/features/$FINETUNING_EXPERIMENT_NAME/official"

Annotations link: /content/VSLNet_Code/data/dataset/finetuning_vslnet_egovlp_bert_run2 -> /content/data/dataset/finetuning_vslnet_egovlp_bert_run2
Features link: /content/VSLNet_Code/data/features/finetuning_vslnet_egovlp_bert_run2/official -> /content/data/features/finetuning_vslnet_egovlp_bert_run2/official
--- Setup completed. Checks below: ---
Annotations target FINETUNING_DATASET_DIR exists?
drwxr-xr-x 2 root root 4096 Jun 19 16:15 /content/data/dataset/finetuning_vslnet_egovlp_bert_run2
Annotations link /content/VSLNet_Code/data/dataset/finetuning_vslnet_egovlp_bert_run2 points to:
lrwxrwxrwx 1 root root 56 Jun 19 16:27 /content/VSLNet_Code/data/dataset/finetuning_vslnet_egovlp_bert_run2 -> /content/data/dataset/finetuning_vslnet_egovlp_bert_run2
Features target FINETUNING_FEATURE_DIR_PROC exists?
drwxr-xr-x 2 root root 131072 Jun 19 16:27 /content/data/features/finetuning_vslnet_egovlp_bert_run2/official
Features link /content/VSLNet_Code/data/features/finetuning_vslnet_egovlp_b

## 7. Launch Fine-tuning
With the environment re-configured, we first prepare the official dataset and then launch the `main.py` script. The critical difference in this run is the addition of the `--resume_from_checkpoint` argument, which loads the weights from our pre-trained model. We also typically use a lower learning rate for fine-tuning.

In [32]:
%%bash
# Source the final, correct configuration file for fine-tuning
source vars.sh

# --- Hyper-parameter Configuration for Fine-tuning ---
# It's common practice to use a smaller learning rate for fine-tuning
export INIT_LR=0.00001
# Fine-tuning often requires fewer epochs to converge
export NUM_EPOCH=5
# Other parameters can remain the same
export DATA_LOADER_WORKERS=1
export NUM_WORKERS=2
export BATCH_SIZE=32
export DIM=128
export MAX_POS_LEN=128


# The output directory for this run, using the unique fine-tuning name
FINETUNING_MODEL_SAVE_DIR="$EXPERIMENT_BASE_DIR/$FINETUNING_EXPERIMENT_NAME"
mkdir -p "$FINETUNING_MODEL_SAVE_DIR"

echo "--- Starting FINE-TUNING ---"
echo "Experiment Name: $FINETUNING_EXPERIMENT_NAME"
echo "Model: $FINETUNING_MODEL_NAME"
echo "Features: $FINETUNING_VISUAL_FEATURE_TYPE"
echo "Loading pre-trained checkpoint from: $PRETRAINED_CHECKPOINT_PATH"
echo "-------------------------------------"

# --- Safety Check for Checkpoint ---
# Before launching a long training run, let's verify the checkpoint file exists
if [ ! -f "$PRETRAINED_CHECKPOINT_PATH" ]; then
    echo "❌ ERROR: Pre-trained checkpoint not found at the specified path."
    echo "Path: $PRETRAINED_CHECKPOINT_PATH"
    echo "Please check your 'vars.sh' configuration and ensure the file exists on your Google Drive."
    exit 1
fi

# --- Launch main.py ---
# This command uses the fine-tuning configuration and, most importantly,
# the --resume_from_checkpoint flag to load the pre-trained model.
python main.py \
    --task "$FINETUNING_EXPERIMENT_NAME" \
    --mode train \
    --predictor "$FINETUNING_TEXT_ENCODER_TYPE" \
    --dim $DIM \
    --video_feature_dim "$FINETUNING_VISUAL_FEATURE_DIM" \
    --max_pos_len $MAX_POS_LEN \
    --init_lr $INIT_LR \
    --epochs $NUM_EPOCH \
    --batch_size $BATCH_SIZE \
    --fv official \
    --num_workers $NUM_WORKERS \
    --data_loader_workers $DATA_LOADER_WORKERS \
    --model_dir "$FINETUNING_MODEL_SAVE_DIR" \
    --eval_gt_json "$LOCAL_VAL_SPLIT" \
    --remove_empty_queries_from train \
    --resume_from_checkpoint "$PRETRAINED_CHECKPOINT_PATH"



--- Starting FINE-TUNING ---
Experiment Name: finetuning_vslnet_egovlp_bert_run2
Model: vslnet
Features: egovlp
Loading pre-trained checkpoint from: /content/data/experiments/pretrain_vslnet_egovlp_bert_run2/vslnet_pretrain_vslnet_egovlp_bert_run2_official_128_bert/model/vslnet_177.t7
-------------------------------------
Running with Namespace(save_dir='datasets', model_type='vslnet', resume_from_checkpoint='/content/data/experiments/pretrain_vslnet_egovlp_bert_run2/vslnet_pretrain_vslnet_egovlp_bert_run2_official_128_bert/model/vslnet_177.t7', task='finetuning_vslnet_egovlp_bert_run2', eval_gt_json='/content/data/ego4d_data/v1/annotations/nlq_val.json', fv='official', max_pos_len=128, num_workers=2, data_loader_workers=1, word_size=None, char_size=None, word_dim=300, video_feature_dim=256, char_dim=50, dim=128, highlight_lambda=5.0, num_heads=8, drop_rate=0.2, predictor='bert', gpu_idx='0', seed=12345, mode='train', epochs=5, batch_size=32, num_train_steps=None, init_lr=1e-05, clip_n

2025-06-19 16:39:00.173410: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-19 16:39:00.194529: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750351140.216754   34626 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750351140.223638   34626 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-19 16:39:00.247745: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr