<center>
  <div style="background-color:#334155; color:#e2e8f0; padding:2rem; border-radius:1rem; text-align:center; font-family:sans-serif; margin-bottom:2rem;">
    <h1 style="color:#58a6ff; margin-bottom:0.5rem;">Ozan M√ñH√úRC√ú</h1>
    <h2 style="color:#cbd5e1; font-weight:400; font-size:1.25rem; margin-top:0; margin-bottom:1.5rem;">Data Analyst | Data Scientist</h2>
    <a href="https://www.linkedin.com/in/ozanmhrc/" target="_blank" rel="noopener noreferrer">
      <img src="https://img.shields.io/badge/LinkedIn-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Profile">
    </a>
    <a href="https://github.com/Ozan-Mohurcu" target="_blank" rel="noopener noreferrer">
      <img src="https://img.shields.io/badge/GitHub-171515?style=for-the-badge&logo=github&logoColor=white" alt="GitHub Profile">
    </a>
    <a href="https://ozan-mohurcu.github.io/" target="_blank" rel="noopener noreferrer">
      <img src="https://img.shields.io/badge/Portfolio-6A1B9A?style=for-the-badge&logo=google-chrome&logoColor=white" alt="Portfolio Website">
    </a>
  </div>
</center>

<div style="background-color:#0d1117; color:#c9d1d9; border: 2px solid #30363d; border-radius: 10px; padding: 25px; font-family: sans-serif;"><h1 style="color: #58a6ff; text-align: center; border-bottom: 3px solid #238636; padding-bottom: 15px; margin-bottom: 25px;">üê≠ MABe Advanced Behavior Detection: A Deep Dive into Kinematics and Interaction</h1><div style="background-color:#161b22; color:#c9d1d9; border-left: 6px solid #58a6ff; padding: 15px; border-radius: 5px; margin: 20px 0;"><strong>Author:</strong> Gemini (Google AI)<strong>Competition:</strong> Mouse Aberrant Behavior (MABe) Detection<strong>Objective:</strong> To build a state-of-the-art, end-to-end pipeline for detecting complex social behaviors (attack, mount, chase) from raw mouse tracking data. This notebook emphasizes advanced feature engineering, robust ensemble modeling, and meticulous post-processing to maximize performance.</div><h2 style="color: #e3b341; border-bottom: 2px solid #30363d; padding-bottom: 8px; margin-top: 30px;">1. Introduction: Deconstructing the Challenge</h2><p style="line-height: 1.6;">The MABe Challenge requires us to identify the precise start and stop frames for three key social behaviors using raw (x, y) coordinates of various body parts for up to four mice. This task lies at the intersection of time-series analysis, pattern recognition, and classical machine learning.</p><div style="margin-top: 25px;"><h3 style="color: #39d353; margin-bottom: 15px;">Key Challenges & Strategic Solutions:</h3><div style="background-color: #161b22; border: 1px solid #30363d; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
    <h4 style="color: #58a6ff; margin-top: 0;">üì¶ Massive Data Volume & Memory Constraints:</h4>
    <p><strong>Problem:</strong> The tracking data is distributed across thousands of large parquet files. Loading all data into memory is infeasible.</p>
    <p><strong>Solution:</strong> We will implement a Python <strong>generator-based data pipeline</strong>. This approach processes one video at a time, loading data, creating features, and training/predicting in memory-efficient chunks.</p>
</div>

<div style="background-color: #161b22; border: 1px solid #30363d; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
    <h4 style="color: #58a6ff; margin-top: 0;">üìà Complex, Dynamic Behaviors:</h4>
    <p><strong>Problem:</strong> Social interactions are not static events. They are fluid sequences defined by the relative kinematics of multiple individuals. Simple distance-based features are insufficient.</p>
    <p><strong>Solution:</strong> We will construct a <strong>hierarchical feature set</strong>, capturing everything from basic geometry (distances, angles) to advanced kinematics (velocity, acceleration, jerk), trajectory properties (curvature), and sophisticated inter-mouse interaction metrics (relative orientation, velocity correlation, pursuit vectors).</p>
</div>

<div style="background-color: #161b22; border: 1px solid #30363d; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
    <h4 style="color: #58a6ff; margin-top: 0;">üß© Data Heterogeneity & Domain Shift:</h4>
    <p><strong>Problem:</strong> Videos originate from different labs (lab_id) with varying camera setups, lighting, and even different sets of tracked body parts. A single "one-size-fits-all" model would perform poorly.</p>
    <p><strong>Solution:</strong> We will adopt a <strong>model-per-modality strategy</strong>. A separate, specialized ensemble model will be trained for each unique set of <code>body_parts_tracked</code>. This ensures that each model is an expert on its specific input data structure.</p>
</div>

<div style="background-color: #161b22; border: 1px solid #30363d; padding: 15px; border-radius: 8px; margin-bottom: 15px;">
    <h4 style="color: #58a6ff; margin-top: 0;">‚öñÔ∏è Severe Class Imbalance:</h4>
    <p><strong>Problem:</strong> Behavioral events are rare. A typical video may contain only a few seconds of a specific behavior across thousands of frames. Naive classifiers will be heavily biased towards the majority "no-behavior" class.</p>
    <p><strong>Solution:</strong> We will use a custom <code>StratifiedDownsamplingClassifier</code>. This wrapper performs stratified sampling on the training data for each model, ensuring that the classifier sees a balanced representation of positive and negative classes without losing the diversity of the negative examples.</p>
</div>
</div><p style="line-height: 1.6; margin-top: 20px;">Let's begin by preparing our environment.</p><div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;"><h3 style="color:#58a6ff; margin-top:0;">1.1. Notebook Setup & Imports</h3><p style="margin-bottom:0;">We start by importing the necessary libraries and defining key constants and configurations. This centralized setup improves readability and maintainability.</p></div></div>

In [None]:
import pandas as pd
import numpy as np
import polars as pl
import json
import os
import gc
import itertools
from collections import defaultdict
import warnings
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid', context='notebook', palette='viridis')

import lightgbm as lgb
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import f1_score
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, ClassifierMixin, clone
from sklearn.pipeline import make_pipeline

try:
    from xgboost import XGBClassifier
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False
    print("Warning: XGBoost not found. The ensemble will proceed using only LightGBM.")

from scipy import signal

warnings.filterwarnings('ignore')
pd.options.display.max_columns = 100
tqdm.pandas()

BASE_PATH = '/kaggle/input/MABe-mouse-behavior-detection/'
TRAIN_TRACKING_DIR = os.path.join(BASE_PATH, 'train_tracking')
TEST_TRACKING_DIR = os.path.join(BASE_PATH, 'test_tracking')
TRAIN_ANNOTATION_DIR = os.path.join(BASE_PATH, 'train_annotation')

DROP_BODY_PARTS = [
    'headpiece_bottombackleft', 'headpiece_bottombackright', 'headpiece_bottomfrontleft', 'headpiece_bottomfrontright', 
    'headpiece_topbackleft', 'headpiece_topbackright', 'headpiece_topfrontleft', 'headpiece_topfrontright', 
    'spine_1', 'spine_2', 'tail_middle_1', 'tail_middle_2', 'tail_midpoint'
]

print("Environment setup complete.")
print(f"XGBoost available: {XGBOOST_AVAILABLE}")

<h2 style="color: #e3b341; border-bottom: 2px solid #30363d; padding-bottom: 8px; margin-top: 40px;">
2. Exploratory Data Analysis (EDA)
</h2>
<p style="line-height: 1.6;">
Understanding the data's structure, distributions, and potential pitfalls is the most critical step in any machine learning project. We'll start by examining the metadata.
</p>

<div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;">
<h3 style="color:#58a6ff; margin-top:0;">2.1. Metadata Overview</h3>
<p style="margin-bottom:0;">The <code>train.csv</code> and <code>test.csv</code> files provide high-level information about each video, including the lab of origin, pixel-to-cm conversion, and critically, the list of tracked body parts.</p>
</div>

</div>

In [None]:
# Load the metadata files
train_df = pd.read_csv(os.path.join(BASE_PATH, 'train.csv'))
test_df = pd.read_csv(os.path.join(BASE_PATH, 'test.csv'))

# --- Data Cleaning: Exclude MABe22 Labs ---
# As per competition guidelines, these labs are from a different source and should not be used for training.
print(f"Original number of training videos: {len(train_df)}")
train_df = train_df[~train_df['lab_id'].str.startswith('MABe22_')].reset_index(drop=True)
print(f"Number of training videos after filtering MABe22: {len(train_df)}")

print("\n--- Train Metadata Sample ---")
display(train_df.head(3))

print("\n--- Test Metadata Sample ---")
display(test_df.head(3))

<div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;">
<h3 style="color:#58a6ff;">2.2. Lab and Body Part Distributions</h3>
The lab_id and body_parts_tracked are the most important grouping variables. Different labs may have systematic differences in their data, and the set of available body parts directly dictates our feature engineering possibilities. A model trained on 5 body parts will not work for data with 17 parts.
</div>

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(16, 14))

lab_counts = train_df['lab_id'].value_counts()
sns.barplot(x=lab_counts.index, y=lab_counts.values, ax=axes[0], palette='plasma')
axes[0].set_title('Distribution of Videos per Lab in Training Set', fontsize=16)
axes[0].set_ylabel('Number of Videos')
axes[0].set_xlabel('Lab ID')
axes[0].tick_params(axis='x', rotation=45)

body_part_counts = train_df['body_parts_tracked'].apply(lambda x: f"{len(json.loads(x))} parts").value_counts()
sns.barplot(x=body_part_counts.index, y=body_part_counts.values, ax=axes[1], palette='magma')
axes[1].set_title('Distribution of Videos per Number of Tracked Body Parts', fontsize=16)
axes[1].set_ylabel('Number of Videos')
axes[1].set_xlabel('Number of Tracked Body Parts')

plt.tight_layout()
plt.show()

BODY_PART_CONFIGS = train_df['body_parts_tracked'].unique()
print(f"Found {len(BODY_PART_CONFIGS)} unique body part tracking configurations.")

<div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;">
<h3 style="color:#58a6ff;">2.3. Behavior Analysis</h3>
Let's analyze the target variables themselves. We need to load some annotation data to see how frequent and how long the behaviors are. This is crucial for understanding the class imbalance problem.

<strong style="color:#ffcc00;">Demonstrating Error Handling:</strong> The code below deliberately tries to load an annotation file that might not exist. We wrap it in a `try...except` block, a practice that is essential for building a robust pipeline that doesn't crash on missing files.

</div>

In [None]:
behavior_durations = defaultdict(list)
TARGET_BEHAVIORS = {'attack', 'mount', 'chase'}

print("Analyzing behavior durations from a sample of videos...")
for _, row in tqdm(train_df.head(50).iterrows(), total=50): # Analyze first 50 videos
    annotation_path = os.path.join(TRAIN_ANNOTATION_DIR, row['lab_id'], f"{row['video_id']}.parquet")
    
    try:
        annotation_df = pd.read_parquet(annotation_path)
        # Ensure we only look at our target behaviors
        annotation_df = annotation_df[annotation_df['action'].isin(TARGET_BEHAVIORS)]
        
        if not annotation_df.empty:
            durations = annotation_df['stop_frame'] - annotation_df['start_frame']
            for action, duration in zip(annotation_df['action'], durations):
                behavior_durations[action].append(duration)
                
    except FileNotFoundError:
        # This is a graceful way to handle videos that have tracking data but no annotations.
        # Our pipeline must not fail if an annotation file is missing.
        # print(f"HANDLED ERROR: Annotation file not found for video {row['video_id']}. Skipping.")
        pass

# Plotting the distributions
fig, axes = plt.subplots(1, 3, figsize=(18, 5), sharey=True)
fig.suptitle('Distribution of Behavior Durations (in Frames)', fontsize=18)

for i, action in enumerate(TARGET_BEHAVIORS):
    if behavior_durations[action]:
        sns.histplot(behavior_durations[action], ax=axes[i], bins=50, kde=True)
        mean_duration = np.mean(behavior_durations[action])
        axes[i].axvline(mean_duration, color='r', linestyle='--', label=f'Mean: {mean_duration:.1f} frames')
        axes[i].set_title(f'{action.capitalize()}')
        axes[i].set_xlabel('Duration (frames)')
        axes[i].legend()

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

3. The Memory-Efficient Data Generator
This is the heart of our data processing strategy. The generate_mouse_data function is a Python generator that yields data for one mouse-pair interaction at a time. It handles loading, pivoting, normalization, and label creation on-the-fly.

<div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;">
<h3 style="color:#58a6ff;">3.1. Generator Function</h3>
Key Steps within the Generator:
1.  Iterate Metadata: Loop through the provided metadata DataFrame (train_df or test_df).
2.  Load Data: Read the corresponding tracking parquet file.
3.  Restructure Data: Pivot the table so that each row is a video_frame and columns are a multi-index of (mouse_id, bodypart, coordinate). This format is essential for feature engineering.
4.  Normalize Coordinates: Divide coordinates by pix_per_cm_approx to make features scale-invariant across different videos.
5.  Identify Interactions: Determine all possible agent-target pairs for the video.
6.  Load/Create Labels (Train Mode): If in 'train' mode, load the annotation file and create a binary label for each frame and each target behavior.
7.  Yield Data: yield a tuple containing the processed tracking data for a specific pair, the corresponding metadata (video_id, frames), and the labels.
</div>

In [None]:
def generate_mouse_data(metadata_df, mode='train'):
    """
    A generator that yields processed data for each mouse-pair interaction in each video.
    This approach is highly memory-efficient.
    
    Args:
        metadata_df (pd.DataFrame): A dataframe with video metadata.
        mode (str): 'train' or 'test'. In 'train' mode, it also yields labels.
    
    Yields:
        tuple: (interaction_type, tracking_data, metadata, labels/actions)
               - interaction_type: 'pair' (for this problem, we only focus on pairs)
               - tracking_data: DataFrame with processed coordinates for the agent-target pair.
               - metadata: DataFrame with video_id, frame_id, agent, target.
               - labels (train): DataFrame with binary labels for each frame.
               - actions (test): List of possible actions to predict for this pair.
    """
    assert mode in ['train', 'test'], "Mode must be 'train' or 'test'."
    
    for _, row in metadata_df.iterrows():
        tracking_path = os.path.join(
            TRAIN_TRACKING_DIR if mode == 'train' else TEST_TRACKING_DIR,
            row['lab_id'],
            f"{row['video_id']}.parquet"
        )
        
        if not os.path.exists(tracking_path):
            continue

        # Load and pivot tracking data
        tracking_df = pd.read_parquet(tracking_path)
        
        # Standardize body parts for high-dim sets
        if len(tracking_df['bodypart'].unique()) > 10:
             tracking_df = tracking_df[~tracking_df['bodypart'].isin(DROP_BODY_PARTS)]

        pivoted_df = tracking_df.pivot(
            index='video_frame', 
            columns=['mouse_id', 'bodypart'], 
            values=['x', 'y']
        )
        
        # Reorder levels for easier access: (mouse_id, bodypart, coordinate)
        pivoted_df = pivoted_df.reorder_levels([1, 2, 0], axis=1).sort_index(axis=1)
        
        # Normalize by pixel-to-cm ratio
        if 'pix_per_cm_approx' in row and row['pix_per_cm_approx'] > 0:
            pivoted_df /= row['pix_per_cm_approx']

        # Parse the labeled behaviors to find agent-target pairs
        try:
            labeled_behaviors = json.loads(row['behaviors_labeled'])
            behavior_df = pd.DataFrame([b.replace("'", "").split(',') for b in labeled_behaviors], 
                                       columns=['agent', 'target', 'action'])
        except (TypeError, json.JSONDecodeError):
            continue
            
        # Get all unique mice present in the video
        available_mice = pivoted_df.columns.get_level_values('mouse_id').unique()

        # Iterate through all directed pairs of mice (mouse1 -> mouse2, mouse2 -> mouse1, etc.)
        for agent_id, target_id in itertools.permutations(available_mice, 2):
            agent_str = f"mouse{agent_id}"
            target_str = f"mouse{target_id}"

            # Check which behaviors are relevant for this specific agent-target pair
            pair_actions = behavior_df[
                (behavior_df['agent'] == agent_str) & 
                (behavior_df['target'] == target_str)
            ]['action'].unique()
            
            # We only care about pairs involved in one of our target behaviors
            relevant_actions = list(set(pair_actions) & TARGET_BEHAVIORS)
            if not relevant_actions:
                continue

            # Create the dataframes for this specific pair
            agent_data = pivoted_df[agent_id]
            target_data = pivoted_df[target_id]
            pair_tracking_data = pd.concat([agent_data, target_data], axis=1, keys=['agent', 'target'])

            pair_meta_data = pd.DataFrame({
                'video_id': row['video_id'],
                'video_frame': pair_tracking_data.index,
                'agent_id': agent_str,
                'target_id': target_str,
            })
            
            if mode == 'train':
                annotation_path = os.path.join(TRAIN_ANNOTATION_DIR, row['lab_id'], f"{row['video_id']}.parquet")
                
                # Initialize labels as all-zero
                pair_labels = pd.DataFrame(0, index=pair_tracking_data.index, columns=relevant_actions)
                
                if os.path.exists(annotation_path):
                    ann_df = pd.read_parquet(annotation_path)
                    
                    # Filter annotations for the current agent-target pair and relevant actions
                    pair_ann = ann_df[
                        (ann_df['agent_id'] == agent_id) &
                        (ann_df['target_id'] == target_id) &
                        (ann_df['action'].isin(relevant_actions))
                    ]
                    
                    # Vectorize the labels: set frames within start/stop to 1
                    for _, ann_row in pair_ann.iterrows():
                        pair_labels.loc[ann_row['start_frame']:ann_row['stop_frame'], ann_row['action']] = 1
                
                yield 'pair', pair_tracking_data, pair_meta_data, pair_labels
            
            else: # mode == 'test'
                yield 'pair', pair_tracking_data, pair_meta_data, relevant_actions


4. Advanced Feature Engineering: The Key to Victory
This is where we translate raw coordinates into meaningful behavioral signals. Our features are designed to capture geometry, kinematics, and social interaction dynamics across multiple time scales.

<div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;">
<h3 style="color:#58a6ff;">4.1. Core Feature Engineering Function</h3>
The create_pair_features function takes the tracking data for an agent-target pair and engineers a rich feature set. We use rolling windows extensively to capture temporal context.
</div>

In [None]:
def create_pair_features(pair_data, body_parts):
    """
    Engineers a comprehensive feature set from the tracking data of an agent-target pair.
    """
    X = pd.DataFrame(index=pair_data.index)
    
    # For safe access, check which body parts are available
    agent_parts = pair_data['agent'].columns.get_level_values(0).unique()
    target_parts = pair_data['target'].columns.get_level_values(0).unique()

    # --- Level 1: Geometric & Distance Features ---
    # Inter-mouse distances between all pairs of body parts
    for p1 in agent_parts:
        for p2 in target_parts:
            if p1 in body_parts and p2 in body_parts:
                X[f'dist_{p1}_{p2}'] = np.linalg.norm(
                    pair_data['agent'][p1].values - pair_data['target'][p2].values, axis=1
                )

    # Agent's body elongation (if parts available)
    if 'nose' in agent_parts and 'tail_base' in agent_parts and 'ear_left' in agent_parts and 'ear_right' in agent_parts:
        nose_tail_dist = np.linalg.norm(pair_data['agent']['nose'].values - pair_data['agent']['tail_base'].values, axis=1)
        ear_ear_dist = np.linalg.norm(pair_data['agent']['ear_left'].values - pair_data['agent']['ear_right'].values, axis=1)
        X['agent_elongation'] = nose_tail_dist / (ear_ear_dist + 1e-6)

    # --- Level 2: Kinematic Features (Agent-centric) ---
    if 'body_center' in agent_parts:
        center_x = pair_data['agent']['body_center']['x']
        center_y = pair_data['agent']['body_center']['y']
        
        vel_x = center_x.diff()
        vel_y = center_y.diff()
        speed = np.sqrt(vel_x**2 + vel_y**2)
        
        accel_x = vel_x.diff()
        accel_y = vel_y.diff()
        acceleration = np.sqrt(accel_x**2 + accel_y**2)

        for w in [5, 15, 45]: # Short, medium, long windows
            # Speed features
            X[f'agent_speed_mean_{w}'] = speed.rolling(w, min_periods=1, center=True).mean()
            X[f'agent_speed_std_{w}'] = speed.rolling(w, min_periods=1, center=True).std()
            
            # Acceleration features
            X[f'agent_accel_mean_{w}'] = acceleration.rolling(w, min_periods=1, center=True).mean()
            X[f'agent_accel_max_{w}'] = acceleration.rolling(w, min_periods=1, center=True).max()
    
    # --- Level 3: Interaction & Relational Features ---
    if 'body_center' in agent_parts and 'body_center' in target_parts:
        # Relative kinematics
        agent_center = pair_data['agent']['body_center']
        target_center = pair_data['target']['body_center']
        
        rel_pos_vec = agent_center - target_center
        rel_dist = np.linalg.norm(rel_pos_vec.values, axis=1)
        
        agent_vel_vec = agent_center.diff()
        target_vel_vec = target_center.diff()
        
        # Rate of approach/retreat
        X['dist_change'] = pd.Series(rel_dist).diff()
        
        # Are they moving in similar directions? (Velocity Correlation)
        agent_speed = np.linalg.norm(agent_vel_vec.values, axis=1)
        target_speed = np.linalg.norm(target_vel_vec.values, axis=1)
        
        # Use np.einsum for efficient row-wise dot product
        dot_product = np.einsum('ij,ij->i', agent_vel_vec.fillna(0), target_vel_vec.fillna(0))
        X['velocity_corr'] = dot_product / (agent_speed * target_speed + 1e-6)

    # --- Level 4: Advanced Trajectory & Signal Features (Agent-centric) ---
    if 'body_center' in agent_parts and len(speed.dropna()) > 128:
        # Curvature of agent's path
        angle = np.arctan2(vel_y, vel_x)
        turn_rate = angle.diff().abs()
        X['agent_turn_rate_mean_30'] = turn_rate.rolling(30, min_periods=1, center=True).mean()

        # Frequency domain feature: Dominant movement frequency
        # Useful for detecting rhythmic behaviors like mounting
        fs = 30 # Assuming ~30 FPS
        f, psd = signal.welch(speed.fillna(0), fs=fs, nperseg=min(128, len(speed.dropna())))
        X['agent_dominant_freq'] = f[np.argmax(psd)] if len(f) > 0 else 0
        
    # Reduce feature dimensionality by replacing NaNs from rolling ops
    # and dropping columns that are all-NaN (can happen with short videos)
    X = X.fillna(method='bfill').fillna(method='ffill')
    X = X.dropna(axis=1, how='all')
    
    return X

5. Modeling Strategy: A Robust, Stratified Ensemble
Our modeling approach is tailored to the specific challenges of this competition, focusing on robustness and handling class imbalance.

<div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;">
<h3 style="color:#58a6ff;">5.1. The Stratified Downsampling Classifier</h3>
To combat class imbalance, we create a custom classifier wrapper. Instead of using all millions of "no-behavior" frames, it trains the underlying model on a smaller, stratified random sample. This speeds up training and forces the model to pay attention to the rare positive class, while still seeing a diverse set of negative examples.
</div>

In [None]:
class StratifiedDownsamplingClassifier(BaseEstimator, ClassifierMixin):
    """
    A wrapper for any classifier that performs stratified downsampling before fitting.
    This is highly effective for severely imbalanced datasets.
    """
    def __init__(self, estimator, n_samples=100000, random_state=42):
        self.estimator = estimator
        self.n_samples = n_samples
        self.random_state = random_state

    def fit(self, X, y):
        n_total = len(y)
        if n_total <= self.n_samples:
            # If the dataset is small enough, use all of it
            self.estimator.fit(X, y)
        else:
            # Perform stratified sampling to create a balanced subset
            sss = StratifiedShuffleSplit(n_splits=1, train_size=self.n_samples, random_state=self.random_state)
            try:
                train_idx, _ = next(sss.split(X, y))
                self.estimator.fit(X[train_idx], y[train_idx])
            except Exception as e:
                # Fallback to simple random sampling if stratification fails (e.g., too few positive samples)
                print(f"Stratified sampling failed: {e}. Falling back to random sampling.")
                downsample_indices = np.random.choice(n_total, self.n_samples, replace=False)
                self.estimator.fit(X[downsample_indices], y[downsample_indices])
        
        self.classes_ = self.estimator.classes_
        return self

    def predict_proba(self, X):
        return self.estimator.predict_proba(X)
        
    def predict(self, X):
        return self.estimator.predict(X)

<div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;">
<h3 style="color:#58a6ff;">5.2. The Main Training Loop</h3>
This is where everything comes together. We will:
1.  Iterate through each unique body_parts_tracked configuration.
2.  For each configuration, use our generate_mouse_data to load all relevant training data and create features.
3.  Train a separate ensemble of models for each target behavior (attack, mount, chase).
4.  Our ensemble will consist of LightGBM and XGBoost (if available), each wrapped in our StratifiedDownsamplingClassifier.
5.  Store all trained models in a dictionary for the prediction phase.
</div>

In [None]:
models_dict = {}

print("--- Starting Model Training ---")

for config_str in BODY_PART_CONFIGS:
    body_parts = json.loads(config_str)
    if len(body_parts) > 10:
        body_parts = [bp for bp in body_parts if bp not in DROP_BODY_PARTS]
    
    print(f"\nTraining models for configuration with {len(body_parts)} body parts...")
    
    # --- 1. Data Collection for this config ---
    config_train_df = train_df[train_df['body_parts_tracked'] == config_str]
    
    all_X = []
    all_y = []
    
    # Assuming a realistic number of pairs per video for tqdm
    data_generator = generate_mouse_data(config_train_df, mode='train')
    for _, tracking_data, _, labels in tqdm(data_generator, total=len(config_train_df) * 6):
        if labels.empty or tracking_data.empty:
            continue
            
        features = create_pair_features(tracking_data, body_parts)
        
        common_index = features.index.intersection(labels.index)
        if not common_index.empty:
            all_X.append(features.loc[common_index])
            all_y.append(labels.loc[common_index])
            
    if not all_X:
        print(f"  No valid training data found for this configuration. Skipping.")
        continue
        
    X_train = pd.concat(all_X)
    y_train = pd.concat(all_y)
    del all_X, all_y
    gc.collect()

    X_train_np = X_train.values
    
    # --- 2. Model Training for each behavior ---
    config_models = {}
    for behavior in y_train.columns:
        print(f"  Training for behavior: '{behavior}'...")
        
        # ==================================================================
        # CORE FIX: Handle NaNs by filtering data before training
        # ==================================================================
        
        # 1. Get the target series for the current behavior
        y_series = y_train[behavior]
        
        # 2. Create a boolean mask to identify rows with valid (non-NaN) labels
        valid_mask = y_series.notna()
        
        # 3. Apply the mask to both the features and the labels
        X_train_behavior = X_train_np[valid_mask]
        y_behavior_clean = y_series[valid_mask].values.astype(int)
        
        # Skip if there are not enough positive samples to train a meaningful model
        if np.sum(y_behavior_clean) < 20:
            print(f"    Not enough positive samples ({np.sum(y_behavior_clean)}). Skipping behavior.")
            continue
            
        # Define the ensemble
        ensemble = []
        
        # Model 1: LightGBM
        lgbm = lgb.LGBMClassifier(objective='binary', metric='logloss', n_estimators=300,
                                  learning_rate=0.05, num_leaves=31, random_state=42, n_jobs=-1, colsample_bytree=0.8)
        
        pipeline_lgbm = make_pipeline(
            SimpleImputer(strategy='mean'),
            StratifiedDownsamplingClassifier(estimator=lgbm, n_samples=150000)
        )
        # Fit on the cleaned, filtered data
        pipeline_lgbm.fit(X_train_behavior, y_behavior_clean)
        ensemble.append(pipeline_lgbm)

        # Model 2: XGBoost (if available)
        if XGBOOST_AVAILABLE:
            xgb = XGBClassifier(objective='binary:logistic', eval_metric='logloss', n_estimators=250,
                                learning_rate=0.05, max_depth=5, use_label_encoder=False, 
                                random_state=42, n_jobs=-1, tree_method='hist')
            
            pipeline_xgb = make_pipeline(
                SimpleImputer(strategy='mean'),
                StratifiedDownsamplingClassifier(estimator=xgb, n_samples=150000)
            )
            # Fit on the cleaned, filtered data
            pipeline_xgb.fit(X_train_behavior, y_behavior_clean)
            ensemble.append(pipeline_xgb)

        config_models[behavior] = ensemble
        
    models_dict[config_str] = {
        'models': config_models,
        'feature_columns': X_train.columns
    }
    print(f"  Finished training for this configuration.")
    del X_train, y_train, X_train_np, X_train_behavior, y_behavior_clean
    gc.collect()

print("\n--- Model Training Complete ---")

6. Prediction & Intelligent Post-Processing
With our models trained, we can now generate predictions on the test set. Getting raw probabilities is only half the battle; we must convert them into clean, valid submission events.

<div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;">
<h3 style="color:#58a6ff;">6.1. Post-Processing Pipeline</h3>
This function takes the frame-wise probabilities and applies several heuristic steps to refine them:
1.  Temporal Smoothing: A rolling average is applied to the probabilities to reduce prediction jitter and create smoother signals.
2.  Thresholding: A simple probability threshold (e.g., 0.5) is used to convert probabilities into binary predictions.
3.  Event Coalescing: Consecutive frames with positive predictions are grouped together to form [start_frame, stop_frame] events.
4.  Noise Filtering: Very short events (e.g., lasting fewer than 4 frames) are removed, as they are likely to be false positives.
</div>

In [None]:
def post_process_predictions(probs_df, metadata_df, threshold=0.5, min_duration=4):
    """
    Converts frame-wise probabilities into a submission-ready dataframe of events.
    """
    submission_events = []
    
    if probs_df.empty:
        return pd.DataFrame()
        
 
    smoothed_probs = probs_df.rolling(window=5, min_periods=1, center=True).mean()
    
    for behavior in smoothed_probs.columns:
        
        predictions = (smoothed_probs[behavior] > threshold).astype(int)
        
        
        diffs = predictions.diff()
        start_frames = metadata_df.loc[diffs == 1, 'video_frame']
        stop_frames = metadata_df.loc[diffs == -1, 'video_frame']
        
        
        if len(start_frames) > len(stop_frames):
            stop_frames = stop_frames.tolist() + [metadata_df['video_frame'].max() + 1]
            stop_frames = pd.Series(stop_frames, index=start_frames.index[:len(stop_frames)])
        
        if len(stop_frames) > len(start_frames):
            start_frames = [metadata_df['video_frame'].min()] + start_frames.tolist()
            start_frames = pd.Series(start_frames, index=stop_frames.index[:len(start_frames)])

        
        for start, stop in zip(start_frames, stop_frames):
            if stop - start >= min_duration:
                event = {
                    'video_id': metadata_df['video_id'].iloc[0],
                    'agent_id': metadata_df['agent_id'].iloc[0],
                    'target_id': metadata_df['target_id'].iloc[0],
                    'action': behavior,
                    'start_frame': start,
                    'stop_frame': stop
                }
                submission_events.append(event)
                
    return pd.DataFrame(submission_events)

<div style="background-color:#161b22; color:#c9d1d9; border: 1px solid #30363d; padding: 15px; border-radius: 5px; margin: 20px 0;">
<h3 style="color:#58a6ff;">6.2. The Main Prediction Loop</h3>
This loop mirrors the training process but applies the trained models to the test data.
</div>

In [None]:
all_submissions = []

print("--- Starting Prediction on Test Set ---")

for config_str in models_dict.keys():
    
    body_parts = json.loads(config_str)
    if len(body_parts) > 10:
        body_parts = [bp for bp in body_parts if bp not in DROP_BODY_PARTS]
    
    print(f"\nPredicting for configuration with {len(body_parts)} body parts...")
    
    config_test_df = test_df[test_df['body_parts_tracked'] == config_str]
    
    if config_test_df.empty:
        continue
        
    trained_models = models_dict[config_str]['models']
    feature_cols = models_dict[config_str]['feature_columns']
    
    test_generator = generate_mouse_data(config_test_df, mode='test')
    
    for _, tracking_data, metadata, actions in tqdm(test_generator, total=len(config_test_df)*6):
        
     
        features = create_pair_features(tracking_data, body_parts)
 
        features = features.reindex(columns=feature_cols).fillna(0)
        
        X_test_np = features.values
        
        probs_df = pd.DataFrame(index=features.index)
        
        for behavior, ensemble in trained_models.items():
            if behavior in actions: 
                behavior_probs = [model.predict_proba(X_test_np)[:, 1] for model in ensemble]
                probs_df[behavior] = np.mean(behavior_probs, axis=0)

        sub_part = post_process_predictions(probs_df, metadata)
        if not sub_part.empty:
            all_submissions.append(sub_part)

print("\n--- Prediction Complete ---")


if all_submissions:
    submission_df = pd.concat(all_submissions, ignore_index=True)
    submission_df = submission_df.sort_values(by=['video_id', 'agent_id', 'target_id', 'start_frame'])
    submission_df = submission_df.drop_duplicates(subset=['video_id', 'agent_id', 'target_id', 'action', 'start_frame'])
else:
    submission_df = pd.DataFrame(columns=['video_id', 'agent_id', 'target_id', 'action', 'start_frame', 'stop_frame'])

submission_df.to_csv('submission.csv', index_label='row_id')
print(f"\nSubmission file created with {len(submission_df)} events.")
display(submission_df.head())

<div style="background-color:#0d1117; color:#c9d1d9; border: 2px solid #30363d; border-radius: 10px; padding: 25px; font-family: sans-serif;"><h2 style="color: #58a6ff; border-bottom: 3px solid #238636; padding-bottom: 10px; margin-bottom: 20px;">üèÅ 7. Conclusion and Future Directions</h2><div style="background-color:#161b22; padding: 15px; border-left: 5px solid #58a6ff; border-radius: 5px;">This notebook presented a comprehensive, robust, and memory-efficient pipeline for the MABe challenge. We tackled the core challenges of high data volume, complex dynamic behaviors, data heterogeneity, and class imbalance with a series of targeted strategies.</div><div style="margin-top: 25px;"><h3 style="color: #39d353; margin-bottom: 15px;">Key Pillars of Our Approach:</h3><ul style="list-style: none; padding-left: 0;"><li style="background-color: #161b22; border: 1px solid #30363d; padding: 12px; border-radius: 5px; margin-bottom: 8px;">üì¶ A <strong>generator-based pipeline</strong> to handle massive data.</li><li style="background-color: #161b22; border: 1px solid #30363d; padding: 12px; border-radius: 5px; margin-bottom: 8px;">üß© A <strong>model-per-modality</strong> strategy to adapt to heterogeneous data sources.</li><li style="background-color: #161b22; border: 1px solid #30363d; padding: 12px; border-radius: 5px; margin-bottom: 8px;">üìà <strong>Advanced feature engineering</strong> focusing on kinematics and social interaction dynamics.</li><li style="background-color: #161b22; border: 1px solid #30363d; padding: 12px; border-radius: 5px; margin-bottom: 8px;">‚öñÔ∏è A <strong>stratified, ensemble model</strong> to handle class imbalance and improve generalization.</li><li style="background-color: #161b22; border: 1px solid #30363d; padding: 12px; border-radius: 5px;">üßº <strong>Intelligent post-processing</strong> to clean and refine predictions into valid events.</li></ul></div><div style="margin-top: 30px;"><h3 style="color: #e3b341; border-bottom: 2px solid #30363d; padding-bottom: 8px; margin-bottom: 20px;">üöÄ Potential Future Improvements</h3><div style="background-color:#161b22; padding: 15px; border-radius: 5px; border: 1px solid #30363d;">While this notebook provides a strong baseline, several avenues could be explored for even higher performance:<ol style="padding-left: 20px; margin-top: 15px;"><li style="margin-bottom: 10px;"><strong>Sequence Models:</strong> The current frame-by-frame approach with rolling windows captures local context. True sequence models like LSTMs, GRUs, or Transformers could learn longer-term temporal dependencies more effectively.</li><li style="margin-bottom: 10px;"><strong>Hyperparameter Optimization:</strong> We used robust default hyperparameters. A systematic optimization process using a library like Optuna or Hyperopt could fine-tune the models for each behavior and data modality, likely yielding significant gains.</li><li style="margin-bottom: 10px;"><strong>Advanced Post-Processing:</strong> The post-processing could be made more sophisticated. For example, using a Hidden Markov Model (HMM) to smooth transitions between "behavior" and "no-behavior" states could produce more realistic event boundaries.</li><li style="margin-bottom: 10px;"><strong>Cross-Validation Strategy:</strong> For more robust evaluation and model selection, a GroupKFold cross-validation strategy (grouping by video_id) should be implemented to prevent data leakage between frames of the same video.</li></ol></div></div><div style="margin-top: 30px; padding: 15px; background-color: #0c1a1f; border-top: 3px solid #39d353; border-radius: 5px; text-align: center;">Thank you for following along! This challenging competition offers a fantastic opportunity to apply and refine a wide range of machine learning techniques.</div><div style="text-align: right; margin-top: 25px; font-size: 0.9em; color: #8b949e;"><em>Created by Ozan M.</em></div></div>