# 02 - Augmentation, Pre-training, and Fine-tuning Pipeline

## Notebook Overview

This notebook implements and evaluates a complete pipeline for enhancing a Natural Language Queries (NLQ) model through data augmentation. The entire workflow, central to our project's extension, is divided into three major phases:

1.  **Phase I: LLM-Powered Data Augmentation:** We begin by leveraging a Large Language Model (LLM) to generate a new, synthetic training dataset. Starting from the timestamped narrations in Ego4D, we create NLQ-style questions and automatically associate them with precise temporal ground-truth windows. This phase includes a robust data filtering and validation process to ensure the quality of the synthetic data.

2.  **Phase II: Pre-training on Augmented Data:** The newly generated dataset is used to pre-train a baseline NLQ model (e.g., VSLNet). The goal of this phase is to teach the model the fundamental patterns of egocentric question-answering on a large and diverse set of synthetic examples, providing it with a powerful head start before it sees any human-annotated data.

3.  **Phase III: Fine-tuning on Official Data:** Finally, the model pre-trained on our synthetic data is fine-tuned on the official `nlq_train.json` dataset. This step adapts the generalized knowledge acquired during pre-training to the specific distribution and nuances of the official benchmark data. The ultimate goal is to demonstrate that this pre-training/fine-tuning strategy improves performance compared to training on the official data alone.

## 1. Environment and Data Setup
This initial section handles all the necessary setup to prepare our Colab environment. We will mount Google Drive, clone the model repository, install dependencies, and unpack the dataset into the local runtime for fast access.

### 1.1. Mount Google Drive
We begin by mounting Google Drive to access our datasets.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### 1.2. Clone Model Repository and Set Directory
Next, we clone the `VSLNet_Code` repository and set it as the main working directory for this notebook. This allows us to call scripts directly.

In [None]:
%%bash
# Clone the repository (if it doesn't already exist)
if [ ! -d "VSLNet_Code" ]; then
  git clone https://github.com/pietrogiancristofaro2001/ego4d-nlq-project.git
  # We only need the VSLNet_Code folder
  mv ego4d-nlq-project/VSLNet_Code .
  rm -rf ego4d-nlq-project
  echo "Repository cloned successfully."
else
  echo "Repository already exists."
fi