# 02 - Augmentation, Pre-training, and Fine-tuning Pipeline

## Notebook Overview

This notebook implements and evaluates a complete pipeline for enhancing a Natural Language Queries (NLQ) model through data augmentation. The entire workflow, central to our project's extension, is divided into three major phases:

1.  **Phase I: LLM-Powered Data Augmentation:** We begin by leveraging a Large Language Model (LLM) to generate a new, synthetic training dataset. Starting from the timestamped narrations in Ego4D, we create NLQ-style questions and automatically associate them with precise temporal ground-truth windows. This phase includes a robust data filtering and validation process to ensure the quality of the synthetic data.

2.  **Phase II: Pre-training on Augmented Data:** The newly generated dataset is used to pre-train a baseline NLQ model (e.g., VSLNet). The goal of this phase is to teach the model the fundamental patterns of egocentric question-answering on a large and diverse set of synthetic examples, providing it with a powerful head start before it sees any human-annotated data.

3.  **Phase III: Fine-tuning on Official Data:** Finally, the model pre-trained on our synthetic data is fine-tuned on the official `nlq_train.json` dataset. This step adapts the generalized knowledge acquired during pre-training to the specific distribution and nuances of the official benchmark data. The ultimate goal is to demonstrate that this pre-training/fine-tuning strategy improves performance compared to training on the official data alone.

## 1. Environment and Data Setup
This initial section handles all the necessary setup to prepare our Colab environment. We will mount Google Drive, clone the model repository, install dependencies, and unpack the dataset into the local runtime for fast access.

### 1.1. Mount Google Drive
We begin by mounting Google Drive to access our datasets.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 1.2. Clone Model Repository and Set Directory
Next, we clone the `VSLNet_Code` repository and set it as the main working directory for this notebook. This allows us to call scripts directly.

In [2]:
%%bash
# Clone the repository (if it doesn't already exist)
if [ ! -d "VSLNet_Code" ]; then
  git clone https://github.com/pietrogiancristofaro2001/ego4d-nlq-project.git
  # We only need the VSLNet_Code folder
  mv ego4d-nlq-project/VSLNet_Code .
  rm -rf ego4d-nlq-project
  echo "Repository cloned successfully."
else
  echo "Repository already exists."
fi

Repository cloned successfully.


Cloning into 'ego4d-nlq-project'...


In [4]:
# Change the notebook's working directory
%cd VSLNet_Code


/content/VSLNet_Code


### 1.3. Configure Environment for Augmentation and Pre-training
This is the main control cell for the first two phases of our project. It generates a `vars.sh` file **inside the current directory (`VSLNet_Code/`)**. This script defines all paths and parameters needed for data augmentation and for the subsequent pre-training run.

In [6]:
# --- Main Configuration ---
#We use our best model configuration for data augumentation, but in case we can change just modifying parameters
PRETRAIN_MODEL_USED = "vsl_net"  # Options: "vsl_net", "vsl_base"
PRETRAIN_FEATURE_TYPE = "egovlp" # Options: "egovlp", "omnivore"
PRETRAIN_TEXT_ENCODER = "bert"   # Options: "bert", "glove"

# --- Auto-generated settings based on configuration ---
if PRETRAIN_FEATURE_TYPE == "egovlp":
    feature_dir_name = "egovlp_fp16"
    visual_feature_dim = 256
elif PRETRAIN_FEATURE_TYPE == "omnivore":
    feature_dir_name = "omnivore_video_swinl_fp16"
    visual_feature_dim = 1536
else:
    raise ValueError("Invalid FEATURE_TYPE selected.")

pretrain_experiment_name = f"pretrain_{PRETRAIN_MODEL_USED}_{PRETRAIN_FEATURE_TYPE}_{PRETRAIN_TEXT_ENCODER}"

# --- vars.sh content ---
vars_sh_content = f"""
#!/bin/bash

# --- I. SHARED PATH CONFIGURATION ---
export FEATURE_SOURCE_ZIP_PATH=/content/drive/MyDrive/EgoVisionProject/Data
export DRIVE_ZIP_FILENAME=ego4d_data.zip
export LOCAL_DATA_ROOT=/content/data
export EXPERIMENTS_BASE_DIR=$LOCAL_DATA_ROOT/experiments

# --- II. DATA AUGMENTATION & PRE-TRAINING SHARED PATHS ---
export LOCAL_ANNOTATIONS_DIR=$LOCAL_DATA_ROOT/ego4d_data/v1/annotations
export AUGMENTED_JSON_PATH=$LOCAL_ANNOTATIONS_DIR/nlq_train_augmented.json
export NARRATION_JSON_PATH=$LOCAL_ANNOTATIONS_DIR/narration.json
export LOCAL_VAL_SPLIT=$LOCAL_ANNOTATIONS_DIR/nlq_val.json

# --- III. PRE-TRAINING SPECIFIC CONFIGURATION ---
export PRETRAIN_EXPERIMENT_NAME={pretrain_experiment_name}
export PRETRAIN_MODEL_NAME={PRETRAIN_MODEL_USED}
export PRETRAIN_VISUAL_FEATURE_TYPE={PRETRAIN_FEATURE_TYPE}
export PRETRAIN_TEXT_ENCODER_TYPE={PRETRAIN_TEXT_ENCODER}
export PRETRAIN_VISUAL_FEATURE_DIM={visual_feature_dim}
export PRETRAIN_FEATURE_DIR=$LOCAL_DATA_ROOT/ego4d_data/v1/{feature_dir_name}
export PRETRAIN_TRAIN_SPLIT=$AUGMENTED_JSON_PATH
export PRETRAINED_CHECKPOINT_PATH=$EXPERIMENTS_BASE_DIR/{pretrain_experiment_name}/best.pth
"""

# Write the content to vars.sh in the current directory (VSLNet_Code/)
with open("vars.sh", "w") as f:
    f.write(vars_sh_content)


### 1.4. Install Dependencies
We install all required Python libraries from the repository's `requirements.txt`

In [7]:
%%bash
%%capture

pip install -r requirements.txt

Collecting numpy==1.22.4 (from -r requirements.txt (line 2))
  Downloading numpy-1.22.4.zip (11.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.5/11.5 MB 72.3 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


bash: line 1: fg: no job control
ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11
ERROR: Could not find a version that satisfies the requirement torch==1.11.0 (from versions: 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1)
ERROR: No matching distribution found for torch==1.11.0


CalledProcessError: Command 'b'%%capture\n\npip install -r requirements.txt\n'' returned non-zero exit status 1.

### 1.5. Extract Dataset from Google Drive
We use the variables defined in our `vars.sh` file to copy and extract the main dataset from Drive to the local Colab storage.

In [None]:
%%bash

source vars.sh

# Create local directory and extract data
mkdir -p "$LOCAL_DATA_ROOT"
DRIVE_ZIP_FILE_PATH="$FEATURE_SOURCE_ZIP_PATH/$DRIVE_ZIP_FILENAME"
LOCAL_TEMP_ZIP_FILE="/content/$DRIVE_ZIP_FILENAME"

if [ -f "$DRIVE_ZIP_FILE_PATH" ]; then
    echo "Copying $DRIVE_ZIP_FILENAME..."
    cp "$DRIVE_ZIP_FILE_PATH" "$LOCAL_TEMP_ZIP_FILE"
    echo "Extracting data..."
    unzip -o -q "$LOCAL_TEMP_ZIP_FILE" -d "$LOCAL_DATA_ROOT"
    rm "$LOCAL_TEMP_ZIP_FILE"
    echo "Data setup complete."
else
    echo "ERROR: File not found at $DRIVE_ZIP_FILE_PATH"
fi

### 1.6. Load Metadata and Create Valid Narration Groups
This cell performs the core pre-computation for the data augmentation phase. It loads all necessary annotation files, filters out videos that are present in the validation/test sets, and constructs a list of all possible valid groups of `k=5` consecutive narrations that fall within a single video clip.

In [None]:
import json
import os
import random
import uuid
from tqdm.auto import tqdm
import glob
import numpy as np

print("--- Starting Pre-computation and Filtering ---")

repo_root = "/content"
local_data_root = os.path.join(repo_root, "data")
ego4d_json_path = os.path.join(local_data_root, 'ego4d_data', 'ego4d.json')
narration_path = os.path.join(local_data_root, 'ego4d_data', 'v1', 'annotations', 'narration.json')
val_json_path = os.path.join(local_data_root, 'ego4d_data', 'v1', 'annotations', 'nlq_val.json')
test_json_path = os.path.join(local_data_root, 'ego4d_data', 'v1', 'annotations', 'nlq_test_unannotated.json')

# We source the feature directory path defined in vars.sh for consistency
# A default is provided just in case.
feature_dir_path = os.environ.get('PRETRAIN_FEATURE_DIR', os.path.join(local_data_root, 'ego4d_data/v1/egovlp_fp16'))


# Load JSON files
print("Loading core JSON files...")
with open(ego4d_json_path, 'r') as f: ego4d_data = json.load(f)
with open(narration_path, 'r') as f: all_narrations_data = json.load(f)
print("Files loaded successfully.")


# 1. Exclude videos from val/test sets
print("\nFiltering out validation/test set videos...")
excluded_video_uids = set()
try:
    with open(val_json_path, 'r') as f: val_data = json.load(f)
    for video in val_data.get('videos', []): excluded_video_uids.add(video['video_uid'])
    with open(test_json_path, 'r') as f: test_data = json.load(f)
    for video in test_data.get('videos', []): excluded_video_uids.add(video['video_uid'])
    print(f"Found {len(excluded_video_uids)} unique videos to exclude.")
except FileNotFoundError:
    print(f"Warning: Could not find val/test JSON files.")


# 2. Check for existing features
print("\nFiltering out videos without pre-extracted features...")
existing_video_ids = {os.path.basename(f).split('.')[0] for f in glob.glob(os.path.join(feature_dir_path, '*.pt'))}
print(f"Found {len(existing_video_ids)} videos with features.")


# 3. Create a lookup map for clips for efficient access
print("\nCreating clip lookup maps...")
all_clips_map = {clip['clip_uid']: clip for clip in ego4d_data.get('clips', [])}
video_to_clips_map = {}
for clip in ego4d_data.get('clips', []):
    vid_uid = clip.get('video_uid')
    if vid_uid not in video_to_clips_map: video_to_clips_map[vid_uid] = []
    video_to_clips_map[vid_uid].append(clip)
print("Lookup maps created.")


# 4. Create all possible valid narration groups
print("\nConstructing valid narration groups...")
k_narrations = 5
all_valid_groups = []
for video_uid, video_content in tqdm(all_narrations_data.items(), desc="Processing Videos"):
    # Apply all filters
    if video_uid in excluded_video_uids or video_uid not in existing_video_ids: continue

    clips_for_this_video = video_to_clips_map.get(video_uid, [])
    if not clips_for_this_video: continue

    narrations_list = video_content.get("narration_pass_1", {}).get("narrations", [])
    if len(narrations_list) < k_narrations: continue

    # Sort narrations by time to be safe
    narrations_list.sort(key=lambda x: x['timestamp_sec'])

    # Iterate through all possible consecutive groups of size k
    for i in range(len(narrations_list) - k_narrations + 1):
        current_group = narrations_list[i : i + k_narrations]
        group_start_time = current_group[0]['timestamp_sec']
        group_end_time = current_group[-1]['timestamp_sec']

        # This is the key check: ensure the group is fully contained in a single parent clip
        parent_clip = next((c for c in clips_for_this_video if c['video_start_sec'] <= group_start_time and c['video_end_sec'] >= group_end_time), None)

        if parent_clip:
            all_valid_groups.append({
                "video_uid": video_uid,
                "narrations": current_group,
                "parent_clip_uid": parent_clip['clip_uid']
            })

print(f"\nPreprocessing complete. Found {len(all_valid_groups)} total valid groups.")

## 2. Timestamp Window Analysis & Debugging
This is a critical step. Before running the expensive LLM, we must ensure our logic for creating timestamp windows is robust. In this section, we will analyze the `window_duration` calculation and verify that it produces valid, non-collapsing time intervals. We will experiment with the formula to find a stable configuration.

In [None]:
# --- DEBUGGING SCRIPT ---
# Our "laboratory" to inspect the calculated values without calling the LLM.

print("--- Starting Timestamp Debugging Analysis ---")

# Parameters from the EgoVLP paper
alpha = 4.9
# NEW PARAMETER: Let's define a minimum duration to prevent windows from collapsing.
MIN_WINDOW_DURATION_SEC = 1.0

# Pre-compute Beta map (average time between narrations per video)
video_to_beta_map = {}
for video_uid, video_content in all_narrations_data.items():
    if video_uid in excluded_video_uids or video_uid not in existing_video_ids: continue
    narrations_list = video_content.get("narration_pass_1", {}).get("narrations", [])
    if len(narrations_list) < 2: continue
    narrations_list.sort(key=lambda x: x['timestamp_sec'])
    diffs = [narrations_list[i+1]['timestamp_sec'] - narrations_list[i]['timestamp_sec'] for i in range(len(narrations_list)-1)]
    positive_diffs = [d for d in diffs if d > 0]
    if positive_diffs: video_to_beta_map[video_uid] = np.mean(positive_diffs)

# --- Analysis Loop ---
num_groups_to_inspect = 10 # Let's inspect a few random groups
successful_windows = 0
total_narrations_inspected = 0

random.shuffle(all_valid_groups) # Process in random order

for group_data in all_valid_groups[:num_groups_to_inspect]:
    video_uid = group_data['video_uid']
    parent_clip_uid = group_data['parent_clip_uid']
    parent_clip_info = all_clips_map.get(parent_clip_uid)
    beta_i = video_to_beta_map.get(video_uid)

    if not parent_clip_info or not beta_i: continue

    print(f"\n--- Inspecting Group from Video: {video_uid} | Parent Clip: {parent_clip_uid} ---")
    print(f"Parent Clip Boundaries: [{parent_clip_info['video_start_sec']:.2f}, {parent_clip_info['video_end_sec']:.2f}] | Avg. narration gap (beta): {beta_i:.2f}s")

    for narration_obj in group_data["narrations"]:
        total_narrations_inspected += 1
        t_i = narration_obj['timestamp_sec']

        # Original calculation from EgoVLP
        calculated_duration = beta_i / alpha

        # Our new robust calculation
        window_duration = max(MIN_WINDOW_DURATION_SEC, calculated_duration)

        # Calculate and clip the window to the parent clip's boundaries
        start_time_abs = max(parent_clip_info['video_start_sec'], t_i - (window_duration / 2))
        end_time_abs = min(parent_clip_info['video_end_sec'], t_i + (window_duration / 2))

        is_valid = "✅ VALID" if start_time_abs < end_time_abs else "❌ INVALID"
        if start_time_abs < end_time_abs: successful_windows += 1

        print(f"  Narration at {t_i:.2f}s -> "
              f"Proposed duration: {calculated_duration:.2f}s -> "
              f"Final duration: {window_duration:.2f}s -> "
              f"Final Window: [{start_time_abs:.2f}, {end_time_abs:.2f}] -> {is_valid}")

print(f"\n--- Analysis Complete ---")
print(f"Successfully created {successful_windows} valid windows out of {total_narrations_inspected} narrations inspected.")