# 1. Environment Setup and Data Unpacking

This initial section handles the crucial setup steps for the Colab environment. The process is organized as follows:

1.  **Configuration**: Key paths and filenames are defined as variables at the top for easy modification and clarity.
2.  **Mount Google Drive**: The Colab environment is connected to the user's Google Drive to access the project's data archive.
3.  **Unpack Dataset**: The data (`.zip` archive) is located on Drive and then unpacked into the local Colab filesystem for improved performance during analysis and training.

In [None]:
import os
from google.colab import drive

# --- 1. Configuration ---
# Define all necessary paths and filenames here.
GDRIVE_PROJECT_PATH = "/content/drive/MyDrive/EgoVisionProject/Data" # Main project folder on Google Drive
ZIP_FILE_NAME    = "ego4d_data.zip"
LOCAL_DATA_PATH  = "/content/data" # Local temporary workspace in Colab

print("Configuration set.")

# --- 2. Mount Google Drive ---
print("Connecting to Google Drive...")
drive.mount('/content/drive', force_remount=True)
print("Drive connected.")

# --- 3. Unpack the Dataset ---
gdrive_zip_file = os.path.join(GDRIVE_PROJECT_PATH, ZIP_FILE_NAME)
print(f"\nLooking for dataset archive at: {gdrive_zip_file}")

if os.path.exists(gdrive_zip_file):
    print("Dataset archive found. Unpacking to local storage...")
    os.makedirs(LOCAL_DATA_PATH, exist_ok=True)

    # Unzip directly from Drive to the local path.
    # -o flag overwrites files without asking, -q flag for quiet mode.
    !unzip -o -q "{gdrive_zip_file}" -d "{LOCAL_DATA_PATH}"
    print("Unpacking complete.")

    # Verification Step
    print("\n--- Verifying contents of the annotations directory: ---")
    !ls -lH "{LOCAL_DATA_PATH}/ego4d_data/v1/annotations"
    print("---------------------------------------------------------")
else:
    print(f"ERROR: Dataset archive not found at '{gdrive_zip_file}'. Please check the path.")

# 2. Exploratory Data Analysis (EDA)

This section is dedicated to the exploratory analysis of the Ego4D-NLQ dataset. We will load the training data prepared by the `00_Setup_Environment.ipynb` notebook, examine its structure, and visualize key statistics. The main goals are:

1.  **Understand the Data Structure**: Inspect the format of the annotations, including video clips, queries, and response timestamps.
2.  **Analyze Query Characteristics**: Investigate the distribution of query templates and the number of words per query.
3.  **Analyze Temporal Characteristics**: Examine the duration of video clips and the corresponding answer segments.
4.  **Analyze Video Scenarios**: See how query templates are distributed within the most common video scenarios.



## 2.1 DataFrame Preparation and Initial Inspection

In this first step, we import all the necessary Python libraries for our analysis, including `pandas` for data manipulation and `seaborn` for visualization. We then load the `ego4d_nlq_train.json` annotation file into a pandas DataFrame. Finally, we display the first few rows of the created DataFrame to verify its structure.

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

# Set plot style
sns.set(style='darkgrid')

# --- Correct Path Definition ---
# The setup script unpacks the data into '/content/data'.
ANNOTATIONS_DIR = "/content/data/ego4d_data/v1/annotations"
JSON_FILE_NAME = "nlq_train.json"
json_file_path = os.path.join(ANNOTATIONS_DIR, JSON_FILE_NAME)

print(f"Attempting to load data from: {json_file_path}")

# Check if the file exists before proceeding
if not os.path.exists(json_file_path):
    print(f"ERROR: File not found at '{json_file_path}'.")
    print("Please ensure that the '00_Setup_Environment.ipynb' notebook has been run successfully and the path is correct.")
else:
    # Load the JSON file
    with open(json_file_path, 'r') as f:
        data = json.load()

    records = []
    for video_info in data['videos']:
        video_uid = video_info['video_uid']
        for clip_info in video_info['clips']:
            clip_uid = clip_info['clip_uid']
            for ann_info in clip_info['annotations']:
                language_queries = ann_info.get('language_queries')
                if language_queries:
                    for query_info in language_queries:
                        records.append({
                            'video_uid': video_uid,
                            'clip_uid': clip_uid,
                            'annotation_uid': ann_info['annotation_uid'],
                            'query': query_info['query'],
                            'template': query_info['template'],
                            'start_time': query_info['clip_start_sec'],
                            'end_time': query_info['clip_end_sec'],
                            'response_start_time': query_info['response_track_start_sec'],
                            'response_end_time': query_info['response_track_end_sec'],
                        })

    df = pd.DataFrame(records)
    print("DataFrame created successfully.")
    display(df.head())

## 2.2 Analysis of Query Templates

Here, we visualize the counts of each query template using a bar chart to clearly see their frequencies.


In [None]:
# Calculate the frequency of each query template
template_counts = df['template'].value_counts()

# Create the plot
plt.figure(figsize=(12, 8))
ax = sns.barplot(x=template_counts.index, y=template_counts.values)

# Add titles and labels for clarity
plt.title('Distribution of Query Templates', fontsize=16)
plt.xlabel('Query Template', fontsize=12)
plt.ylabel('Number of Queries', fontsize=12)
plt.xticks(rotation=45, ha='right') # Rotate labels to prevent overlap

# Add the count annotations on top of each bar
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center',
                xytext=(0, 9),
                textcoords='offset points',
                fontsize=10)

# Ensure everything fits nicely
plt.tight_layout()
plt.show()

## 2.3 Analysis of Clip and Response Durations

We visualize the distributions of the full clip durations and the ground-truth response durations using box plots, annotated with key statistics.

In [None]:
# Calculate the duration of the full clips and the response segments in seconds
df['clip_duration'] = df['end_time'] - df['start_time']
df['response_duration'] = df['response_end_time'] - df['response_start_time']

# Prepare data for plotting
duration_data = df[['clip_duration', 'response_duration']]

# Create the figure with two subplots side-by-side
fig, axes = plt.subplots(1, 2, figsize=(16, 7))
fig.suptitle('Distribution of Clip and Response Durations (in seconds)', fontsize=18)

# --- Plot 1: Clip Duration ---
sns.boxplot(y=duration_data['clip_duration'], ax=axes[0])
axes[0].set_title('Full Clip Duration', fontsize=14)
axes[0].set_ylabel('Duration (s)', fontsize=12)

# Calculate statistics for Clip Duration
mean_clip = duration_data['clip_duration'].mean()
median_clip = duration_data['clip_duration'].median()
std_clip = duration_data['clip_duration'].std()

# Add text annotations for Clip Duration statistics
axes[0].text(0.05, 0.95, f'Mean: {mean_clip:.2f}s\nStd Dev: {std_clip:.2f}s\nMedian: {median_clip:.2f}s',
             transform=axes[0].transAxes, fontsize=12,
             verticalalignment='top', bbox=dict(boxstyle='round,pad=0.5', fc='wheat', alpha=0.5))


# --- Plot 2: Response Duration ---
sns.boxplot(y=duration_data['response_duration'], ax=axes[1])
axes[1].set_title('Response Span Duration', fontsize=14)
axes[1].set_ylabel('Duration (s)', fontsize=12)

# Calculate statistics for Response Duration
mean_resp = duration_data['response_duration'].mean()
median_resp = duration_data['response_duration'].median()
std_resp = duration_data['response_duration'].std()

# Add text annotations for Response Duration statistics
axes[1].text(0.05, 0.95, f'Mean: {mean_resp:.2f}s\nStd Dev: {std_resp:.2f}s\nMedian: {median_resp:.2f}s',
             transform=axes[1].transAxes, fontsize=12,
             verticalalignment='top', bbox=dict(boxstyle='round,pad=0.5', fc='wheat', alpha=0.5))


# Show the plots
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

## 2.4 Analysis of Scenarios and Templates

We now load the `ego4d.json` metadata file to get the official "scenario" for each video. We then analyze how query templates are distributed as a percentage within the top 5 most common scenarios.

In [None]:
# --- 1. Load Scenario Metadata ---
metadata_base_dir = "/content/data/ego4d_data"
metadata_file_name = "ego4d.json"
metadata_path = os.path.join(metadata_base_dir, metadata_file_name)
print(f"Loading scenario metadata from: {metadata_path}")
with open(metadata_path, 'r') as f:
    metadata = json.load(f)

# --- 2. Create and apply mapping ---
video_uid_to_scenario = {v['video_uid']: v['scenario'] for v in metadata['videos']}
df['scenario'] = df['video_uid'].map(video_uid_to_scenario)

# --- 3. Perform the analysis and plotting ---
top_scenarios = df['scenario'].value_counts().nlargest(5).index
df_top_scenarios = df[df['scenario'].isin(top_scenarios)]
scenario_template_dist = pd.crosstab(df_top_scenarios['scenario'],
                                     df_top_scenarios['template'],
                                     normalize='index') * 100


ax = scenario_template_dist.plot(kind='bar', stacked=True, figsize=(14, 8),
                                 title='Percentage of Query Templates in Top 5 Scenarios')

plt.xlabel('Scenario', fontsize=12)
plt.ylabel('Percentage of Queries (%)', fontsize=12)
plt.xticks(rotation=0)
plt.legend(title='Template', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()

## 2.5 Analysis of Query Length

Finally, we analyze the complexity of the natural language queries themselves. A simple and effective metric for this is the number of words in each query. This analysis helps us understand the verbosity of the questions our model must interpret. We will create a new column in our DataFrame to store the word count for each query and then visualize the distribution of these lengths using a histogram.

In [None]:
df['query_length'] = df['query'].apply(lambda x: len(x.split()))

plt.figure(figsize=(10, 6))

# Correction: The original code uses plt.hist, not sns.histplot.
plt.hist(df['query_length'], bins=20, color='darkviolet', alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75) # Correction: Original grid is only on y-axis

plt.title('Distribution of Number of Words per Query', fontsize=16)
plt.xlabel('Number of Words', fontsize=12)
plt.ylabel('Number of Queries', fontsize=12)
plt.show()

display(df['query_length'].describe())