# Exploring the ESA Conjunction Data Messages (CDM) Dataset

So I found this really cool dataset from the European Space Agency. They released real conjunction data messages from their space debris monitoring operations as part of a collision avoidance challenge. This is actual operational data from 2015-2019 that was used to make real decisions about whether satellites needed to dodge debris.

The idea is that when two objects in orbit are predicted to pass close to each other, the tracking system generates a series of CDMs over the days leading up to the closest approach. Each CDM refines the prediction as more tracking data comes in. The question is -- can we predict whether a conjunction will be dangerous enough to require action?

Let's see what we're working with.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set_theme(style='whitegrid', palette='deep')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100

DATA_DIR = Path('../data/cdm')
print('Files available:', list(DATA_DIR.glob('*')))

In [None]:
# Load the training data
train = pd.read_csv(DATA_DIR / 'train_data.csv')
print(f'Training set: {train.shape[0]:,} rows, {train.shape[1]} columns')
print(f'\nColumn names ({len(train.columns)}):')
for i, col in enumerate(train.columns):
    print(f'  {i:3d}. {col}')

## First look at the data structure

103 columns is a lot. Let me get a sense of what types of features we have before diving into individual ones.

In [None]:
# Quick overview of data types and missing values
print('Data types:')
print(train.dtypes.value_counts())
print(f'\nTotal missing values: {train.isnull().sum().sum():,}')
print(f'Columns with any missing: {(train.isnull().sum() > 0).sum()}')

# Which columns have missing data?
missing = train.isnull().sum()
missing_cols = missing[missing > 0].sort_values(ascending=False)
if len(missing_cols) > 0:
    print(f'\nMissing value breakdown:')
    for col, count in missing_cols.items():
        print(f'  {col}: {count:,} ({100*count/len(train):.1f}%)')

In [None]:
train.head(3)

In [None]:
train.describe()

## Understanding the event structure

Each row is a single CDM (conjunction data message), but multiple CDMs belong to the same "event" -- meaning the same pair of objects approaching each other. The CDMs within an event form a time series as predictions get refined closer to the time of closest approach (TCA).

Let me figure out how events are structured.

In [None]:
# Check if there's an event_id column or similar
event_cols = [c for c in train.columns if 'event' in c.lower() or 'id' in c.lower()]
print('Columns that might identify events:', event_cols)

# Also look for time-related columns
time_cols = [c for c in train.columns if 'time' in c.lower() or 'tca' in c.lower() or 'date' in c.lower()]
print('Time-related columns:', time_cols)

In [None]:
# Let's figure out how many unique events there are
# and how many CDMs per event
event_col = None
for candidate in ['event_id', 'event', 'conjunction_id']:
    if candidate in train.columns:
        event_col = candidate
        break

if event_col is None:
    # Try to find it by looking for columns with the right cardinality
    # We expect ~13k unique events in 162k rows
    for col in train.columns:
        nunique = train[col].nunique()
        if 10000 < nunique < 20000:
            print(f'Potential event column: {col} ({nunique:,} unique values)')

print(f'\nUsing event column: {event_col}')

In [None]:
# Distribution of CDMs per event
# Need to identify the event grouping column first
# Let's look at all columns and their unique counts to find it
print('Unique value counts for each column:')
for col in train.columns:
    n = train[col].nunique()
    print(f'  {col}: {n:,} unique')

## Identifying the target variable

The Kelvins challenge was about predicting collision risk. Let me find the target column and understand the class distribution. Since this is a safety-critical problem, I expect the classes to be heavily imbalanced -- most conjunctions are safe.

In [None]:
# Look for target/risk/label columns
target_candidates = [c for c in train.columns if any(
    kw in c.lower() for kw in ['risk', 'label', 'target', 'class', 'danger', 'collision']
)]
print('Potential target columns:', target_candidates)

# For each candidate, show value distribution
for col in target_candidates:
    print(f'\n{col}:')
    print(train[col].value_counts())

In [None]:
# Let's also look at miss_distance -- this is the key physical quantity
miss_cols = [c for c in train.columns if 'miss' in c.lower() or 'distance' in c.lower()]
print('Miss distance related columns:', miss_cols)

for col in miss_cols:
    if train[col].dtype in ['float64', 'float32', 'int64']:
        print(f'\n{col}:')
        print(f'  min:    {train[col].min():.6f}')
        print(f'  median: {train[col].median():.6f}')
        print(f'  mean:   {train[col].mean():.6f}')
        print(f'  max:    {train[col].max():.6f}')
        print(f'  std:    {train[col].std():.6f}')

## Miss distance distribution

Miss distance is super important -- it's the predicted closest approach between two objects. Small miss distance = scary. Let me visualize the distribution.

In [None]:
# Find the main miss distance column
miss_col = None
for candidate in ['miss_distance', 'MISS_DISTANCE', 'miss_dist']:
    if candidate in train.columns:
        miss_col = candidate
        break

# If we can't find an exact match, look for the most likely one
if miss_col is None:
    for col in miss_cols:
        if train[col].dtype in ['float64', 'float32']:
            miss_col = col
            break

if miss_col:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Raw distribution
    axes[0].hist(train[miss_col].dropna(), bins=100, edgecolor='black', alpha=0.7)
    axes[0].set_xlabel('Miss Distance')
    axes[0].set_ylabel('Count')
    axes[0].set_title('Miss Distance Distribution (raw)')
    
    # Log-scale -- miss distances often span orders of magnitude
    log_miss = np.log10(train[miss_col].dropna().clip(lower=1e-10))
    axes[1].hist(log_miss, bins=100, edgecolor='black', alpha=0.7, color='coral')
    axes[1].set_xlabel('log10(Miss Distance)')
    axes[1].set_ylabel('Count')
    axes[1].set_title('Miss Distance Distribution (log scale)')
    
    plt.tight_layout()
    plt.show()
else:
    print('Could not identify miss distance column. Columns available:')
    print([c for c in train.columns if train[c].dtype in ['float64', 'float32']])

## Time-to-TCA analysis

One of the most interesting aspects of this dataset is the temporal structure. CDMs arrive at different times before the predicted closest approach. Earlier CDMs have more uncertainty, later ones are more precise. How does the prediction quality evolve over time?

In [None]:
# Find the time_to_tca column
tca_col = None
for candidate in ['time_to_tca', 'TIME_TO_TCA', 'days_to_tca', 't_to_tca']:
    if candidate in train.columns:
        tca_col = candidate
        break

# If not found, search more broadly
if tca_col is None:
    for col in time_cols:
        if train[col].dtype in ['float64', 'float32'] and train[col].min() >= 0:
            print(f'Candidate time column: {col}')
            print(f'  Range: [{train[col].min():.4f}, {train[col].max():.4f}]')
            print(f'  Mean: {train[col].mean():.4f}')
            tca_col = col  # take the first reasonable one

if tca_col:
    print(f'Using time-to-TCA column: {tca_col}')
    
    fig, ax = plt.subplots(figsize=(12, 5))
    ax.hist(train[tca_col].dropna(), bins=100, edgecolor='black', alpha=0.7, color='steelblue')
    ax.set_xlabel('Time to TCA (days)')
    ax.set_ylabel('Number of CDMs')
    ax.set_title('When do CDMs arrive relative to closest approach?')
    ax.axvline(x=1.0, color='red', linestyle='--', alpha=0.7, label='1 day before TCA')
    ax.axvline(x=3.0, color='orange', linestyle='--', alpha=0.7, label='3 days before TCA')
    ax.legend()
    plt.tight_layout()
    plt.show()

## Object types

Not all space objects are the same. There are active satellites (payloads), rocket bodies left over from launches, and debris fragments from collisions or explosions. The type of objects involved in a conjunction matters a lot -- two active satellites can both maneuver, but debris can't dodge.

Let me see what object types are in the data.

In [None]:
# Find object type columns
type_cols = [c for c in train.columns if 'type' in c.lower() or 'object' in c.lower()]
print('Object-related columns:', type_cols)

for col in type_cols:
    if train[col].nunique() < 20:  # likely categorical
        print(f'\n{col} value counts:')
        print(train[col].value_counts())

In [None]:
# If we found object types, let's see how risk varies by type pair
obj_type_cols = [c for c in type_cols if train[c].nunique() < 10]

if len(obj_type_cols) >= 1 and len(target_candidates) >= 1:
    obj_col = obj_type_cols[0]
    tgt_col = target_candidates[0]
    
    # Cross-tabulate object type vs risk
    ct = pd.crosstab(train[obj_col], train[tgt_col], normalize='index')
    print(f'Risk rate by {obj_col}:')
    print(ct)
    
    ct.plot(kind='bar', stacked=True, figsize=(10, 5))
    plt.title(f'Risk Distribution by {obj_col}')
    plt.ylabel('Proportion')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## Correlation analysis

With 103 features, there's probably a lot of redundancy. Let me look at correlations between the numeric features to understand the structure better and identify which features might be most predictive.

In [None]:
# Get numeric columns only
numeric_cols = train.select_dtypes(include=[np.number]).columns.tolist()
print(f'Number of numeric features: {len(numeric_cols)}')

# Correlation matrix -- might be big so let's just look at the top correlations
corr = train[numeric_cols].corr()

# Find the most correlated feature pairs (excluding self-correlation)
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
high_corr = []
for col in upper.columns:
    for idx in upper.index:
        val = upper.loc[idx, col]
        if abs(val) > 0.9:
            high_corr.append((idx, col, val))

high_corr.sort(key=lambda x: abs(x[2]), reverse=True)
print(f'\nHighly correlated pairs (|r| > 0.9): {len(high_corr)}')
for a, b, r in high_corr[:20]:
    print(f'  {a} <-> {b}: {r:.4f}')

In [None]:
# Heatmap of a subset of interesting features
# Pick features that seem most relevant to conjunction risk
interesting_keywords = ['miss', 'mahalanobis', 'relative', 'collision', 'risk',
                         'time', 'tca', 'cov', 'semi', 'ecc', 'inc']

interesting_cols = [c for c in numeric_cols if any(
    kw in c.lower() for kw in interesting_keywords
)]

if len(interesting_cols) > 3:
    # Limit to ~20 for readability
    interesting_cols = interesting_cols[:20]
    
    fig, ax = plt.subplots(figsize=(14, 12))
    sub_corr = train[interesting_cols].corr()
    sns.heatmap(sub_corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
                square=True, ax=ax, cbar_kws={'shrink': 0.8})
    ax.set_title('Correlation Matrix - Key Features')
    plt.tight_layout()
    plt.show()

## Looking at individual event trajectories

This is the part I'm most curious about. Each event is a time series of CDM updates. As we get closer to the time of closest approach, the miss distance estimate should converge. But does it converge smoothly, or does it jump around?

Let me pick a few events and plot their CDM sequences.

In [None]:
# We need to identify the event grouping and then plot trajectories
# Let's check what column groups CDMs into events
print('Columns and sample values from first 5 rows:')
print(train.iloc[:5, :10].to_string())
print('...')
print(train.iloc[:5, -10:].to_string())

In [None]:
# Once we identify the event column, plot some example trajectories
# This is a placeholder that adapts to whatever the event column turns out to be

# Try to find grouping -- look for columns where consecutive rows share values
# (CDMs from same event should be grouped together in the CSV)
potential_group_cols = []
for col in train.columns:
    if train[col].dtype == 'object' or train[col].nunique() < 20000:
        # Check if consecutive rows often share values
        same_as_next = (train[col] == train[col].shift(1)).mean()
        if 0.5 < same_as_next < 0.99:
            potential_group_cols.append((col, same_as_next, train[col].nunique()))

potential_group_cols.sort(key=lambda x: x[2])
print('Potential grouping columns (sorted by unique count):')
for col, pct, nunique in potential_group_cols:
    print(f'  {col}: {nunique:,} unique, {pct:.1%} consecutive matches')

In [None]:
# Plot CDM trajectories for a sample of events
# This will adapt once we know the exact column names

def plot_event_trajectories(df, event_col, time_col, value_col, n_events=6):
    """Plot how a value evolves across CDM updates for sample events."""
    events = df[event_col].unique()
    # Pick events with at least 5 CDMs for interesting trajectories
    event_sizes = df.groupby(event_col).size()
    good_events = event_sizes[event_sizes >= 5].index
    sample = np.random.choice(good_events, size=min(n_events, len(good_events)), replace=False)
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 8))
    axes = axes.flatten()
    
    for i, eid in enumerate(sample):
        event_data = df[df[event_col] == eid].sort_values(time_col, ascending=False)
        axes[i].plot(event_data[time_col], event_data[value_col], 'o-', markersize=4)
        axes[i].set_xlabel('Time to TCA (days)')
        axes[i].set_ylabel(value_col)
        axes[i].set_title(f'Event {eid} ({len(event_data)} CDMs)')
        axes[i].invert_xaxis()  # Time counts down to TCA
    
    plt.suptitle(f'CDM Trajectories: How {value_col} evolves before closest approach', y=1.02)
    plt.tight_layout()
    plt.show()

# We'll call this once we confirm the column names
print('Ready to plot -- just need to confirm event/time/value column names')
print(f'Candidate event cols: {[x[0] for x in potential_group_cols[:3]]}')
print(f'Candidate time col: {tca_col}')
print(f'Candidate value col: {miss_col}')

In [None]:
# Let's just try plotting with our best guesses
if potential_group_cols and tca_col and miss_col:
    event_id_col = potential_group_cols[0][0]
    print(f'Plotting with: event={event_id_col}, time={tca_col}, value={miss_col}')
    np.random.seed(42)
    plot_event_trajectories(train, event_id_col, tca_col, miss_col)
else:
    print('Need to manually identify column names first.')
    print('Available columns:', list(train.columns))

## Summary of CDM dataset

Initial takeaways from exploring this data:

1. **Scale**: ~162k CDM records across ~13k unique conjunction events
2. **Features**: 103 columns covering orbital elements, miss distance components, covariance matrices, relative velocities, and metadata
3. **Temporal structure**: Each event is a sequence of CDMs with decreasing time-to-TCA. Average ~12 CDMs per event.
4. **Class imbalance**: Expected to be severe -- most conjunctions are safe. This is realistic and important to handle correctly.
5. **Miss distances span orders of magnitude**: Will need log-scale treatment for regression.

Next steps:
- Look at the CelesTrak TLE data for the visualization side
- Understand the covariance matrix features (these encode prediction uncertainty)
- Start thinking about feature engineering for the XGBoost model
- Figure out the sequence padding/truncation strategy for the Transformer