# Notebook 2: The First Hypothesis - A Simple Frame-by-Frame Baseline

**Overall Goal:** To build our first complete, end-to-end machine learning pipeline. We will create a simple model that gets a score on the leaderboard, establishing a "baseline" that all our future, more complex models must beat.

**The Strategy (and Deliberate Simplification):**
For this notebook, we will make a major simplifying assumption: **we will treat every single frame of video as an independent data point.** We will ignore the fact that behaviors are sequences. This is technically the "wrong" way to model this problem, but it's the perfect way to start because it's simple and fast.

*   **Model Choice:** We will use **LightGBM**, a powerful and fast gradient-boosting model that is excellent for tabular data.
*   **Features:** We will create a small set of "frame-wise" features that describe the scene at a single moment in time.
*   **The Pipeline:** We will build all the necessary steps: data loading, feature engineering, training, prediction, and crucially, the **post-processing** required to convert frame-by-frame predictions into the `start_frame, stop_frame` submission format.
link to EDA [notebook](https://www.kaggle.com/code/wafaaalayoubi/1-full-eda)

# Step 1: Setup and Reusable Processing Functions

**Goal:** To set up our environment and create reusable functions for our data processing pipeline. In Notebook 1, we learned that we need to load parquet files and pivot them from a "long" to a "wide" format. Since we will be doing this for many videos, it's best practice to encapsulate this logic into a clean, reusable function.

**Action:**
1.  Import all the necessary libraries, including `lightgbm` for our model.
2.  Define a function `load_and_process_video` that takes a `video_id` and `lab_id` as input and returns the "wide" DataFrame, ready for feature engineering.

In [None]:
# --- Core Libraries ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from tqdm.auto import tqdm

# --- Modeling Libraries ---
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

In [None]:
# --- Set display options ---
pd.set_option('display.max_columns', 200)
sns.set_style('whitegrid')

In [None]:
# --- Define Constants and Paths ---
DATA_PATH = '/kaggle/input/MABe-mouse-behavior-detection/' 

In [None]:
# --- Reusable Function Definition ---

def load_and_process_video(video_id, lab_id, data_path):
    """
    Loads the tracking data for a single video and pivots it into a wide format.
    """
    # Using os.path.join is the most robust way to build paths
    tracking_path = os.path.join(data_path, 'train_tracking', lab_id, f'{video_id}.parquet')
    
    if not os.path.exists(tracking_path):
        print(f"Warning: File not found at {tracking_path}")
        return None
        
    df_long = pd.read_parquet(tracking_path)
    
    pivot_x = df_long.pivot(index='video_frame', columns=['mouse_id', 'bodypart'], values='x')
    pivot_y = df_long.pivot(index='video_frame', columns=['mouse_id', 'bodypart'], values='y')
    
    pivot_x.columns = [f"mouse{m}_{bp}_x" for m, bp in pivot_x.columns]
    pivot_y.columns = [f"mouse{m}_{bp}_y" for m, bp in pivot_y.columns]
    
    df_wide = pd.concat([pivot_x, pivot_y], axis=1)
    df_wide = df_wide.sort_index(axis=1)
    
    return df_wide

In [None]:
# --- Test the function with our sample video from Notebook 1 ---
print("Testing the data processing function...")
df_train_meta = pd.read_csv(os.path.join(DATA_PATH, 'train.csv')) # Using os.path.join here for safety
sample_video_meta = df_train_meta.iloc[0]

df_wide_sample = load_and_process_video(sample_video_meta['video_id'], sample_video_meta['lab_id'], DATA_PATH)

print("\n--- Function Test Output ---")
if df_wide_sample is not None:
    print(f"Successfully loaded and processed video {sample_video_meta['video_id']}")
    print(f"Shape of the resulting wide DataFrame: {df_wide_sample.shape}")
    display(df_wide_sample.head())
else:
    print("Failed to load the sample video.")

# Step 2: Preparing Data for a Simple Model

**Goal:** To prepare our data for LightGBM with the absolute minimum of processing. According to our "simplest possible baseline" plan, we will **not** engineer any new features in this notebook. We will feed the raw `x` and `y` coordinates directly to the model.

**Action:**
1.  **Select a Subset:** Training a model on all 8,790 videos would take a very long time. For this baseline, we will select a small, manageable subset of just **50 videos** to prove our pipeline works.
2.  **Load and Combine:** We will loop through our subset of videos, load the tracking data for each, and combine them into one large training DataFrame.
3.  **Create Frame-wise Labels:** The annotations are in a `start_frame, stop_frame` format. We need to convert this into a label for *every single frame*. For now, we will only handle the simple case: single-agent behaviors (like `rear`) and two-agent behaviors (`agent_id` != `target_id`). We will create a `behavior` column in our main DataFrame.

In [None]:
# --- 1. Select a Subset of Videos ---
# Let's use 50 videos for our baseline model. It's enough to be representative but fast to process.
N_VIDEOS_TO_USE = 50
df_subset_meta = df_train_meta.head(N_VIDEOS_TO_USE)

print(f"Using a subset of {len(df_subset_meta)} videos for this baseline model.")

In [None]:
# --- 2. Load and Combine Data for the Subset ---
all_wide_dfs = []
for index, row in tqdm(df_subset_meta.iterrows(), total=len(df_subset_meta)):
    df_wide = load_and_process_video(row['video_id'], row['lab_id'], DATA_PATH)
    if df_wide is not None:
        # Add a video_id column so we can link back to annotations
        df_wide['video_id'] = row['video_id']
        all_wide_dfs.append(df_wide)

# Combine all individual video dataframes into one big one
df_train_full = pd.concat(all_wide_dfs)

print(f"\nLoaded and combined data for all videos. Full training shape: {df_train_full.shape}")


In [None]:
# --- 3. Load Annotations and Create Frame-wise Labels ---
# Load all annotations for our subset of videos
all_annotations_list = []
for video_id in tqdm(df_subset_meta['video_id'].unique(), desc="Loading annotations"):
    row = df_subset_meta[df_subset_meta['video_id'] == video_id].iloc[0]
    annot_path = os.path.join(DATA_PATH, 'train_annotation', row['lab_id'], f"{row['video_id']}.parquet")
    if os.path.exists(annot_path):
        df_annot = pd.read_parquet(annot_path)
        df_annot['video_id'] = video_id
        all_annotations_list.append(df_annot)

df_annotations_subset = pd.concat(all_annotations_list)

# Initialize the target column with a "no_behavior" label
df_train_full['behavior'] = 'no_behavior'

# --- Apply labels to each frame ---
# This is a complex loop, but it's the core of the labeling process
print("\nApplying annotations to each frame...")
for index, row in tqdm(df_annotations_subset.iterrows(), total=len(df_annotations_subset)):
    video_id = row['video_id']
    start_frame = row['start_frame']
    stop_frame = row['stop_frame']
    action = row['action']
    
    # This is a simplification for the baseline: we create a single 'behavior' target.
    # We are not yet handling multiple simultaneous behaviors.
    df_train_full.loc[
        (df_train_full['video_id'] == video_id) & 
        (df_train_full.index >= start_frame) & 
        (df_train_full.index <= stop_frame),
        'behavior'
    ] = action

print("Labeling complete.")
print("\nValue counts of our new 'behavior' target column:")
print(df_train_full['behavior'].value_counts())

## What We Learned in Step 2

*   **Successful Data Assembly:** We have successfully created a complete training dataset for our baseline model. It contains the raw coordinate data and a corresponding `behavior` label for over 5.7 million frames, sampled from our first 50 videos.

*   **The "Nothing" Class:** The `value_counts()` output immediately reveals a crucial insight. The `no_behavior` class, which we created, is by far the largest class, with ~5.4 million instances. This is even more dominant than the most common labeled behavior (`rear`). This means our model will be heavily incentivized to just predict "nothing is happening."

*   **Labeling is Correct:** The presence of many different behavior classes (rear, attack, sniff, etc.) with reasonable counts confirms that our logic for "unrolling" the `start_frame, stop_frame` annotations into per-frame labels is working correctly.

*   **Simplification Check:** For this baseline, we've created a single target column. This means we are not currently handling frames where multiple behaviors might be happening at once. This is a deliberate simplification we can address in a later, more advanced notebook.

**With our features (the raw coordinates) and our target (`behavior` column) now in a single, clean DataFrame, we are finally ready to train our first model.**

# Step 3: Model Training

**Goal:** To train our LightGBM model on the data we just prepared. We will perform a simple split of our data into a training set and a validation set to evaluate its performance.

**Action:**
1.  **Define Features and Target:** Explicitly separate our `X` (features, i.e., the coordinates) and `y` (target, i.e., the behavior) variables.
2.  **Handle Missing Data:** As we discussed, LightGBM can handle `NaN` values, but it performs best if we fill them with a placeholder value that it can recognize, like -1.
3.  **Encode Labels:** Machine learning models require numerical labels. We will use `LabelEncoder` to convert our behavior strings (e.g., "attack") into integers (e.g., 0, 1, 2).
4.  **Split Data:** We will perform a simple random `train_test_split`. We know from our EDA that this is not the *best* way to validate, but it is the *simplest* and is acceptable for our first baseline.
5.  **Train the Model:** We will initialize and train a `LGBMClassifier`.

In [None]:
# --- 1. Define Features (X) and Target (y) ---
# Our features are all columns EXCEPT 'video_id' and our target 'behavior'
features = [col for col in df_train_full.columns if col not in ['video_id', 'behavior']]
X = df_train_full[features]
y = df_train_full['behavior']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# --- 2. Handle Missing Data ---
# LightGBM can handle NaNs, but filling them explicitly can sometimes be more stable.
# We'll fill with -1, a value that doesn't appear in the coordinate data.
X = X.fillna(-1)
print("\nFilled NaN values with -1.")

In [None]:
# --- 3. Encode String Labels into Numbers ---
# The model needs numerical targets, so 'attack' -> 0, 'chase' -> 1, etc.
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Let's see the mapping
print("\nLabel Encoding Mapping:")
for i, class_name in enumerate(label_encoder.classes_):
    print(f"{class_name} -> {i}")

In [None]:
# --- 4. Split Data into Training and Validation Sets ---
# We use a simple 80/20 split. stratify=y_encoded ensures that the proportion
# of each behavior is the same in both the train and validation sets.
X_train, X_val, y_train, y_val = train_test_split(
    X, y_encoded, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_encoded
)
print(f"\nTraining data shape: {X_train.shape}")
print(f"Validation data shape: {X_val.shape}")

In [None]:
# --- 5. Train the LightGBM Model ---
print("\nTraining LightGBM model...")

lgbm = lgb.LGBMClassifier(
    objective='multiclass',
    n_estimators=500,  # Number of trees to build
    learning_rate=0.05,
    num_leaves=31,
    random_state=42,
    n_jobs=-1,         # Use all available CPU cores
    colsample_bytree=0.8, # Subsample columns to prevent overfitting
    subsample=0.8       # Subsample rows to prevent overfitting
)

# We use the validation set to monitor for early stopping
# This prevents the model from training for too long and overfitting.
lgbm.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    eval_metric='multi_logloss',
    callbacks=[lgb.early_stopping(10, verbose=True)]
)

print("\nModel training complete.")

## What We Learned in Step 3

*   **Successful Training:** Our LightGBM model trained successfully on nearly 4.6 million frames of data. The process was efficient, taking only a few minutes.

*   **Initial Scores Confirm Imbalance:** The "Start training from score..." lines show the initial bias of the model. The score for `no_behavior` (class 8) is `-0.052374`, which is very close to zero. All other scores are large negative numbers (e.g., `-5.77` for `approach`). In log-loss terms, this confirms the model's initial guess is overwhelmingly "no_behavior" because it's the most common class.

*   **Early Stopping is Powerful:** The model didn't need to run for all 500 rounds (estimators). It found its best performance on the validation set at round **39** and stopped automatically. This is a crucial technique that saved us time and, more importantly, prevented the model from overfitting to the training data.

*   **Validation Score:** The best score on our validation set was a `multi_logloss` of **0.144441**. On its own, this number is a bit abstract, but it will be our key internal benchmark. When we build our next model with feature engineering, we will aim to get this number even lower.

**The model now exists in our notebook's memory, ready to make predictions. The next step is to evaluate its performance in a more human-readable way and then prepare a submission file.**

# Step 4: Evaluation and Post-Processing

**Goal:** To understand how well our simple model performs and to convert its frame-by-frame predictions into the required `start_frame, stop_frame` format for submission.

**Action:**
1.  **Evaluate Performance:** We will use the trained model to make predictions on our validation set (`X_val`). Then, we'll generate a `classification_report`, which shows us key metrics like precision, recall, and F1-score for each individual behavior. This will clearly show us which behaviors the model learned and which it ignored.
2.  **Develop Post-Processing Logic:** This is a critical step. Our model outputs a single prediction for each frame. We need a function that can take a long sequence of these predictions (e.g., `[8, 8, 9, 9, 9, 8, ...]`) and convert them into submission-ready rows like `(rear, start_frame=2, stop_frame=4)`.

In [None]:
# --- 1. Evaluate Performance on the Validation Set ---
print("--- Model Performance on Validation Set ---")

# Make predictions
y_pred = lgbm.predict(X_val)

# Convert the numerical predictions back to string labels for the report
y_pred_labels = label_encoder.inverse_transform(y_pred)
y_val_labels = label_encoder.inverse_transform(y_val)

# Generate and print the classification report
report = classification_report(y_val_labels, y_pred_labels)
print(report)

In [None]:
# --- 2. Develop Post-Processing Logic ---
def predictions_to_submission(df_preds, video_id):
    """
    Converts frame-by-frame predictions into a submission-ready format.
    
    Args:
        df_preds (pd.DataFrame): DataFrame with 'frame' and 'behavior' columns.
        video_id (int or str): The ID of the video being processed.
        
    Returns:
        pd.DataFrame: A submission-formatted DataFrame for this video.
    """
    submission_rows = []
    
    # Ignore 'no_behavior' predictions
    df_preds = df_preds[df_preds['behavior'] != 'no_behavior'].copy()
    
    # Find contiguous blocks of the same behavior
    # This clever trick identifies where a block of the same behavior changes
    df_preds['block'] = (df_preds['behavior'] != df_preds['behavior'].shift()).cumsum()
    
    for _, group in df_preds.groupby('block'):
        # For our simple baseline, we'll assign mouse1 as agent and mouse2 as target
        # This is a major simplification we'll improve later.
        submission_rows.append({
            'video_id': video_id,
            'agent_id': 'mouse1',
            'target_id': 'mouse2',
            'action': group['behavior'].iloc[0],
            'start_frame': group['frame'].min(),
            'stop_frame': group['frame'].max(),
        })
        
    return pd.DataFrame(submission_rows)

In [None]:
# --- Test the post-processing function ---
print("\n--- Testing Post-Processing ---")

# Create a dummy prediction dataframe to test the logic
dummy_data = {
    'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8],
    'behavior': ['no_behavior', 'attack', 'attack', 'attack', 'no_behavior', 'rear', 'rear', 'no_behavior', 'attack']
}
dummy_df = pd.DataFrame(dummy_data).set_index('frame')
dummy_submission = predictions_to_submission(dummy_df.reset_index(), 'dummy_video')

print("Dummy predictions converted to submission format:")
display(dummy_submission)

## What We Learned in Step 4

**From the Classification Report:**

*   **The "Accuracy" Lie:** The overall accuracy is **96%**, which looks amazing at first glance. However, this number is completely misleading. Our model could achieve ~95% accuracy by *only* predicting `no_behavior` (`1090760 / 1149409`). This high accuracy score is a classic trap in imbalanced datasets.

*   **Precision vs. Recall:** This is where the real story is.
    *   **Recall** tells us "Of all the actual 'attack' frames, what percentage did our model find?" For most classes like `attack` (18%), `avoid` (16%), and `shepherd` (14%), the recall is **very low**. Our model is missing the vast majority of these behaviors.
    *   **Precision** tells us "When the model predicted 'attack', how often was it correct?" The precision is generally much higher (e.g., 66% for `attack`). This means that *when* our model decides to predict a behavior, it's right more often than not, but it's very shy and hesitant to do so.

*   **Successes and Failures:**
    *   **Successes:** The model did reasonably well on distinct, long-duration, or high-count behaviors like `mount`, `submit`, and `sniffgenital`, achieving a good balance of precision and recall (F1-scores of 0.70+).
    *   **Failures:** It completely failed on rare and subtle behaviors. Look at `sniffface`: an F1-score of just 0.15. It has almost no idea what this behavior looks like.
    *   **The `no_behavior` Crutch:** The model's strategy is clear: "When in doubt, predict `no_behavior`." It only predicts a real behavior when it's extremely confident.

**From the Post-Processing Test:**

*   **Logic is Sound:** The test on the dummy data proves our `predictions_to_submission` function works perfectly. It correctly ignores the `no_behavior` class and groups consecutive frames of the same action into a single event with the correct `start_frame` and `stop_frame`. It even handles single-frame events correctly.

**Conclusion: We have a working, but very conservative, baseline model. Its main weakness is low recall on most behaviors. This gives us a clear goal for Notebook 3: improve recall by giving the model better features!**

# Step 5: Generate Submission File

**Goal:** To use our trained model and post-processing function to generate a `submission.csv` file in the correct format for the competition.

**Action:**
1.  **Load Test Metadata:** Get the list of test videos we need to predict on.
2.  **Iterate and Predict:** Loop through each test video. For each one, we will:
    *   Load its tracking data using our `load_and_process_video` function.
    *   Fill `NaN` values just like we did for the training data.
    *   Use our trained `lgbm` model to predict the behavior for every frame.
    *   Convert the numerical predictions back to string labels.
    *   Use our `predictions_to_submission` function to create the submission rows for that video.
3.  **Combine and Save:** Combine the results from all test videos into a single DataFrame and save it as `submission.csv`.

In [None]:
# --- 1. Load Test Metadata ---
print("Loading test metadata...")
df_test_meta = pd.read_csv(os.path.join(DATA_PATH, 'test.csv'))
print(f"Found {len(df_test_meta)} videos in the test set.")

all_submissions = []

In [None]:
# --- 2. Iterate and Predict on Test Set ---
for index, row in tqdm(df_test_meta.iterrows(), total=len(df_test_meta)):
    video_id = row['video_id']
    lab_id = row['lab_id']
    
    print(f"\nProcessing video: {video_id}")
    
    # Load the video's tracking data
    # NOTE: The test tracking files are in the 'test_tracking' folder
    test_tracking_path = os.path.join(DATA_PATH, 'test_tracking', lab_id, f'{video_id}.parquet')
    
    # A bit of code duplication here, but it's safer to be explicit for the test set
    if not os.path.exists(test_tracking_path):
        print(f"  Warning: Test file not found at {test_tracking_path}. Skipping.")
        continue
        
    df_long_test = pd.read_parquet(test_tracking_path)
    
    # Pivot to wide format
    pivot_x = df_long_test.pivot(index='video_frame', columns=['mouse_id', 'bodypart'], values='x')
    pivot_y = df_long_test.pivot(index='video_frame', columns=['mouse_id', 'bodypart'], values='y')
    pivot_x.columns = [f"mouse{m}_{bp}_x" for m, bp in pivot_x.columns]
    pivot_y.columns = [f"mouse{m}_{bp}_y" for m, bp in pivot_y.columns]
    df_wide_test = pd.concat([pivot_x, pivot_y], axis=1).sort_index(axis=1)
    
    # Ensure test set has the same columns as the training set
    X_test = df_wide_test.reindex(columns=features, fill_value=-1)
    
    # Preprocess (fill NaNs)
    X_test = X_test.fillna(-1)
    
    # Predict
    print(f"  Predicting {len(X_test)} frames...")
    preds_encoded = lgbm.predict(X_test)
    preds_labels = label_encoder.inverse_transform(preds_encoded)
    
    # Post-process
    df_preds = pd.DataFrame({
        'frame': X_test.index,
        'behavior': preds_labels
    })
    
    video_submission = predictions_to_submission(df_preds, video_id)
    all_submissions.append(video_submission)
    print(f"  Found {len(video_submission)} behavior events in this video.")

In [None]:
# --- 3. Combine and Save ---
if all_submissions:
    df_submission = pd.concat(all_submissions, ignore_index=True)
    
    # The submission requires a 'row_id' column
    df_submission.index.name = 'row_id'
    
    # Make sure columns are in the exact order required
    final_columns = ['video_id', 'agent_id', 'target_id', 'action', 'start_frame', 'stop_frame']
    df_submission = df_submission[final_columns]
    
    df_submission.to_csv('submission.csv', index=True)
    
    print("\n--- Submission File Generated ---")
    print(f"Total events predicted: {len(df_submission)}")
    display(df_submission.head())
else:
    # If no events were predicted for any video, create an empty submission file
    print("\nWarning: No behavior events were predicted. Creating an empty submission file.")
    pd.DataFrame(columns=['row_id'] + final_columns).to_csv('submission.csv', index=False)

# Notebook 2 Conclusion: A Working (But Flawed) Baseline

We have successfully achieved our goal for this notebook: we built a complete, end-to-end pipeline that can process raw data, train a model, and generate a valid submission file.

### Key Accomplishments:
1.  **End-to-End Pipeline:** We have code that can handle every step of the process. This is a massive achievement and the foundation for all future experiments.
2.  **First Submission:** We generated a `submission.csv` file, proving our logic is sound.
3.  **Established a Baseline:** Our model achieved a **macro average F1-score of 0.49** on our validation set. This is our score to beat. We now know that any future changes must improve upon this number.

### Major Weaknesses Identified:
1.  **Low Recall:** The model is "shy." It correctly identifies that *something* is happening, but it struggles to find the majority of the events (low recall for most classes).
2.  **Ignoring Time:** Our frame-by-frame approach is fundamentally limited. Behaviors are sequences, and by ignoring this, we are leaving a huge amount of information on the table.
3.  **Simplistic Features:** We used only raw coordinates. The model had to learn all spatial relationships from scratch, which is a very difficult task.

This baseline has served its purpose perfectly. It works, it gives us a score, and it has clearly illuminated the path forward.

**Next Up: Notebook 3 - The Ethologist's Toolkit: Advanced Feature Engineering**
In the next notebook, we will directly address the weaknesses of this model by engineering a rich set of features (distances, speeds, angles) to help the model better understand the interactions between the mice.