# Data Augmentation for Contrastive Learning (Astronomy)

This notebook implements domain-aware augmentations suitable for contrastive learning on astronomical time-series / multi-band observations. Each augmentation is provided as a function that accepts per-object time-series data (pandas DataFrame) and returns an augmented copy.

Assumptions: the input DataFrame should contain at least an identifier for the source (e.g. `object_id`), a time column (`time`, `jd`, `mjd`, or `obsdate`), a brightness column (`flux` or `mag`), and a filter/band column (`filter`). Each function performs checks and falls back gracefully if columns are missing.

<details>
<summary>üìå Cell Description: Helper Utilities for Detecting Key Columns and Sorting by Time</summary>

This cell defines small utility functions that make later analysis steps easier and more reliable. Because astronomical datasets can come from different sources and may use different naming conventions for important fields (such as flux, filter, time, or object ID), the first helper function automatically searches for these columns based on common keywords. This ensures that the rest of the pipeline can work consistently, even if the dataset uses slightly different names.

The second helper function sorts the dataset by time, which is essential for any analysis that involves the sequence of observations. Many astronomical measurements‚Äîsuch as brightness changes, periodic behavior, or transient events‚Äîdepend strongly on the order in which they were observed. Sorting by time ensures that later steps (such as augmentation, time-series processing, or modeling) operate on correctly ordered data.

These two utilities form a foundation for working with real-world astronomical datasets that often vary in structure, naming, and formatting.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Automatically detects important columns** using keyword searches:  
  - **Time column** ‚Üí for ordering observations  
  - **Flux/brightness column** ‚Üí for understanding object intensity  
  - **Filter/band column** ‚Üí indicates which optical filter the telescope used  
  - **Object ID column** ‚Üí identifies which measurements belong to the same celestial object  

- **Makes the pipeline flexible**, allowing it to work with datasets that use different naming conventions (e.g., ‚Äúmag‚Äù, ‚Äúflux_calib‚Äù, "obsdate").

- **Returns a dictionary of detected column names**, used later for augmentation, modeling, and visualization.

- **Provides a function to sort observations by time**, ensuring that sequences are analyzed in the correct chronological order.

- **Protects against errors**‚Äîif no valid time column exists, the data is returned unchanged.

- **Supports time-dependent analysis**, such as:  
  - variability studies  
  - transient detection  
  - light-curve generation  
  - sequence modeling (RNN/CNN/LSTM)

- **Improves data reliability**, especially when working with raw telescope outputs that may not be pre-sorted.

---

### ‚≠ê **Why This Cell Is Important for the Research**
Astronomical datasets often vary in how they name columns, and many real datasets are not sorted by time. These helper utilities solve two important problems:

1. **Column Identification**  
   Different surveys use different naming conventions. Automatically detecting flux, time, filter, and ID columns ensures that the rest of the research pipeline functions correctly without requiring manual adjustments. This makes the workflow robust and reusable across many datasets.

2. **Ti**


In [None]:
# Imports and helper utilities
import os
import numpy as np
import pandas as pd
from copy import deepcopy
import random

# Small helper: find time / brightness / filter cols
def detect_columns(df):
    cols_lower = [c.lower() for c in df.columns]
    time_col = next((c for c in df.columns if c.lower() in ['time','obsdate','jd','mjd','epoch']), None)
    flux_col = next((c for c in df.columns if c.lower() in ['flux','flux_calib','mag','mag_calib','instrumental_flux']), None)
    filter_col = next((c for c in df.columns if 'filter' in c.lower() or 'band' in c.lower()), None)
    id_col = next((c for c in df.columns if c.lower() in ['object_id','objid','id','source_id','ipac_gid']), None)
    # return 'filter' key (was 'filt' previously) for consistency with augmentation functions
    return dict(time=time_col, flux=flux_col, filter=filter_col, id=id_col)

# Utility: ensure DataFrame sorted by time for each object
def sort_by_time(df, time_col):
    if time_col is None or time_col not in df.columns:
        return df
    try:
        return df.sort_values(by=[time_col]).reset_index(drop=True)
    except Exception:
        return df

## 4.1 Temporal Augmentations
- Temporal jittering: add small random noise to timestamps.
- Random time shifts: shift the whole sequence by a random amount.
- Random cropping: return a contiguous partial subsequence to simulate partial coverage.

<details>
<summary>üìå Cell Description: Time-Based Data Augmentation Functions for Astronomical Time-Series</summary>

This cell defines three data-augmentation functions specifically designed for **astronomical time-series data**, such as sequences of observations for the same celestial object taken at different times. Because real telescope data is often sparse, irregular, or limited in length, augmentation helps create realistic variations of existing sequences. This is especially valuable when training machine-learning models, which generally perform better when they have access to more diversity in the training data.

Each function carefully modifies the time dimension while preserving the scientific meaning of the sequence. These augmentations are inspired by real observational uncertainties, scheduling variations, and incomplete observation windows commonly seen in astronomy. All functions include logic to handle both datetime timestamps and numerical time formats.

---

### üîπ **1. `temporal_jitter()` ‚Äì Adds Small Random Shifts to Individual Time Points**
- Adds a tiny random ‚Äújitter‚Äù to each timestamp.  
- Mimics natural timing uncertainties found in real observations (e.g., slight delays, instrument timing jitter).  
- Uses the **median observation cadence** to scale the jitter realistically.  
- Handles both datetime and numeric time columns.  
- Keeps the overall sequence shape but makes it slightly irregular, just like real telescope sampling.  
- Returns a new DataFrame with slightly perturbed timestamps.

**Why it matters:**  
Astronomical observations rarely happen at perfectly regular intervals. This augmentation helps models become more robust to irregular timing.

---

### üîπ **2. `random_time_shift()` ‚Äì Shifts the Entire Time-Series Forward or Backward**
- Applies one random shift to **all** timestamps in the sequence.  
- Mimics real-world scenarios where the same pattern could occur earlier or later in time.  
- By default, shifts by up to ¬±50% of the sequence duration (or a custom range).  
- Works for datetime and numeric timestamps.  
- Does **not** distort the spacing between observations‚Äîonly their global position in time.

**Why it matters:**  
Many astronomical patterns (e.g., flux changes, variability curves) are meaningful regardless of when they occur. Time shifting increases data variety without changing scientific structure.

---

### üîπ **3. `random_crop()` ‚Äì Extracts a Random Subsection of the Sequence**
- Randomly selects a continuous segment of the time-series.  
- Ensures the cropped section contains at least a chosen percentage of the original length (default: 50%).  
- Simulates scenarios where telescopes observe only part of an event due to weather, scheduling constraints, or instrument downtime.  
- Produces realistic partial sequences for training.

**Why it matters:**  
Astronomical time-series data is often incomplete. Cropping trains models to handle missing segments and partial observations.

---

### ‚≠ê **Why These Functions Are Important for the Research**
Astronomical surveys often produce **irregular, incomplete, and sparsely sampled time-series**. Machine-learning models struggle when training data is limited or overly uniform. These augmentation functions:

- increase dataset size realistically,  
- improve model robustness,  
- simulate real observation variability,  
- help the model generalize to unseen temporal patterns,  
- prepare the pipeline for time-series or sequence-based ML tasks (e.g., RNNs, Transformers, CNN-LSTM models).

By modifying time in scientifically meaningful ways *without altering the underlying astrophysical behavior*, these augmentations strengthen the reliability and performance of the final model.

</details>


In [2]:
def temporal_jitter(df, time_col, sigma_fraction=0.01, seed=None):
    """Add Gaussian jitter to timestamps (fraction of median cadence).
    Handles datetime-like columns by converting to seconds since epoch, applying jitter,
    and converting back to datetimes. Returns DataFrame with same time dtype where possible."""
    if time_col is None or time_col not in df.columns:
        return df
    rng = np.random.default_rng(seed)
    # try to convert to datetime; if not possible, fall back to numeric as before
    try:
        times_dt = pd.to_datetime(df[time_col], errors='coerce')
        times_s = times_dt.astype('datetime64[ns]').astype('int64') / 1e9
        use_datetime = not times_dt.isna().all()
    except Exception:
        times_s = df[time_col].astype(float).values
        use_datetime = False
    times = np.array(times_s, dtype=float)
    diffs = np.diff(np.sort(times)) if len(times) > 1 else np.array([1.0])
    median_cadence = np.median(diffs) if len(diffs) > 0 else 1.0
    sigma = sigma_fraction * median_cadence
    jitter = rng.normal(loc=0.0, scale=sigma, size=times.shape)
    new_times = times + jitter
    df_aug = df.copy()
    if use_datetime:
        df_aug[time_col] = pd.to_datetime(new_times, unit='s')
    else:
        df_aug[time_col] = new_times
    return df_aug

def random_time_shift(df, time_col, shift_range=None, seed=None):
    """Shift the entire sequence by a random amount. shift_range can be (min,max) in same units as time_col. If None, uses +/- 0.5 * duration."""
    if time_col is None or time_col not in df.columns:
        return df
    rng = np.random.default_rng(seed)
    try:
        times_dt = pd.to_datetime(df[time_col], errors='coerce')
        times_s = times_dt.astype('datetime64[ns]').astype('int64') / 1e9
        use_datetime = not times_dt.isna().all()
    except Exception:
        times_s = df[time_col].astype(float).values
        use_datetime = False
    times = np.array(times_s, dtype=float)
    duration = times.max() - times.min() if len(times) > 1 else 0.0
    if shift_range is None:
        shift_range = (-0.5 * duration, 0.5 * duration)
    shift = float(rng.uniform(shift_range[0], shift_range[1]))
    new_times = times + shift
    df_aug = df.copy()
    if use_datetime:
        df_aug[time_col] = pd.to_datetime(new_times, unit='s')
    else:
        df_aug[time_col] = new_times
    return df_aug

def random_crop(df, time_col, min_fraction=0.5, seed=None):
    """Return a contiguous subsequence of the timeseries (per-object)."""
    if time_col is None or time_col not in df.columns:
        return df
    rng = np.random.default_rng(seed)
    n = len(df)
    if n < 2:
        return df
    min_len = max(1, int(np.ceil(min_fraction * n)))
    start = int(rng.integers(0, n - min_len + 1))
    end = int(rng.integers(start + min_len, n + 1))
    return df.iloc[start:end].reset_index(drop=True)

## 4.2 Magnitude Augmentations
- Magnitude scaling: multiply flux by a random factor.
- Brightness warping: apply smooth multiplicative warp across time to simulate calibration/seeing changes.

<details>
<summary>üìå Cell Description: Brightness-Based Data Augmentation for Astronomical Light Curves</summary>

This cell defines two augmentation methods that modify the **brightness/flux** values in astronomical time-series data. In astronomy, brightness measurements (flux or magnitude) are often affected by real-world factors such as atmospheric conditions, telescope sensitivity, calibration uncertainties, or instrument noise. These augmentations mimic such natural variations, helping machine-learning models learn to be more robust and generalizable when working with real survey data.

Both methods preserve the *shape* and *scientific meaning* of the brightness curve, while introducing small, realistic variations. This is important because real telescope observations rarely match perfectly‚Äîbrightness often fluctuates slightly due to observational conditions rather than actual astrophysical change.

---

### üîπ **1. `magnitude_scaling()` ‚Äì Uniform Brightness Adjustment**
- Applies one random scaling factor to **all flux values** in the sequence.  
- The factor is chosen from a user-defined range (default 0.8‚Äì1.2).  
- Simulates global brightness changes caused by:  
  - calibration errors  
  - changes in atmospheric transparency  
  - telescope sensitivity fluctuations  
- Keeps the overall pattern the same while making the sequence slightly brighter or dimmer.

**Why it matters:**  
This teaches models that the same astrophysical signal can appear brighter or fainter depending on observing conditions‚Äîan essential property for real survey data.

---

### üîπ **2. `brightness_warping()` ‚Äì Smooth Variation of Brightness Over Time**
This function introduces a **gradual, smooth distortion** in brightness across the entire timeline.

- Converts time into numerical seconds if using datetime.  
- Selects a few ‚Äúknots‚Äù (anchor points) across the time span.  
- Assigns each knot a random brightness multiplier drawn from a normal distribution.  
- Uses interpolation to create a **smooth brightness-warp curve** over time.  
- Multiplies flux values by this smoothly varying factor.

This simulates real observational behaviors such as:

- changing sky transparency over the night  
- drifting calibration during observations  
- atmospheric fluctuations  
- long-term instrumental sensitivity drifts  

Unlike simple scaling, this method allows different parts of the light curve to be modified differently, while still following a realistic smooth trend.

**Why it matters:**  
Real telescope data is rarely perfectly stable‚Äîbrightness can drift up or down slowly due to observing conditions. Models trained with brightness warping are better at ignoring such non-astrophysical variations.

---

### ‚≠ê **Why These Functions Are Important for the Research**
Astronomical time-series datasets often contain only a limited number of clean observations. Machine-learning models trained on small, highly consistent datasets may struggle when exposed to real survey data containing noise, variability, or calibration issues.

These augmentations:

- **increase dataset size** without changing the underlying astrophysical behavior  
- **simulate real observation conditions**, improving model generalization  
- **help machine-learning models become more robust** against noise and calibration drift  
- **preserve the scientific structure** of the brightness curve  

By incorporating both uniform scaling and smooth warping, the researcher creates a training dataset that better reflects the complexity of true astronomical observations.

</details>


In [3]:
def magnitude_scaling(df, flux_col, scale_range=(0.8,1.2), seed=None):
    if flux_col is None or flux_col not in df.columns:
        return df
    rng = np.random.default_rng(seed)
    factor = float(rng.uniform(scale_range[0], scale_range[1]))
    df_aug = df.copy()
    df_aug[flux_col] = df_aug[flux_col].astype(float) * factor
    return df_aug

def brightness_warping(df, time_col, flux_col, n_knots=3, warp_scale=0.1, seed=None):
    """Apply a smooth multiplicative warp across time using piecewise linear interpolation.
    Handles datetime-like time columns by converting to seconds for interpolation."""
    if flux_col is None or flux_col not in df.columns:
        return df
    df_aug = df.copy()
    # try to get numeric times in seconds
    use_datetime = False
    try:
        times_dt = pd.to_datetime(df_aug[time_col], errors='coerce')
        if not times_dt.isna().all():
            times = times_dt.astype('datetime64[ns]').astype('int64') / 1e9
            use_datetime = True
        else:
            times = df_aug[time_col].astype(float).values
    except Exception:
        times = df_aug[time_col].astype(float).values
    tmin, tmax = np.min(times), np.max(times)
    if tmax == tmin:
        return df_aug
    knots = np.linspace(tmin, tmax, n_knots)
    rng = np.random.default_rng(seed)
    knot_factors = rng.normal(loc=1.0, scale=warp_scale, size=len(knots))
    factors = np.interp(times, knots, knot_factors)
    df_aug[flux_col] = df_aug[flux_col].astype(float) * factors
    return df_aug

## 4.3 Noise Augmentations
- Gaussian noise injection to flux.
- Photometric uncertainty simulation: add noise based on provided flux_err or an assumed S/N model.

<details>
<summary>üìå Cell Description: Noise-Based Augmentation to Simulate Real Telescope Measurement Errors</summary>

This cell adds two augmentation techniques that introduce **realistic noise** into the brightness (flux) measurements. In astronomy, every observation contains some amount of noise because telescopes, detectors, and the atmosphere are not perfect. These augmentations simulate such imperfections so that machine-learning models learn to handle real, noisy survey data instead of only clean values.

Both functions operate carefully: they add noise without destroying the overall scientific pattern of the light curve. This makes the augmented data more realistic and improves model robustness.

---

### üîπ **1. `gaussian_noise_injection()` ‚Äì Adds Random Noise Proportional to Flux**
- Adds Gaussian (normal-distributed) noise to each flux value.  
- Noise magnitude is a small fraction (default 5%) of the flux value.  
- Ensures noise is proportional: brighter objects get slightly stronger noise, which matches real telescope physics.  
- Prevents zero-noise cases with a tiny minimum value.

**What it simulates:**  
Natural detector noise, atmospheric variation, background noise, and uncertainties that occur during image processing.

**Why it's useful:**  
Machine-learning models must learn that real astronomical signals always include some noise and are never perfectly smooth.

---

### üîπ **2. `photometric_uncertainty_simulation()` ‚Äì Uses Real Flux Error Measurements**
This method is even more physically realistic.

- If a **flux error column** (flux_err) exists, noise is drawn using that actual uncertainty.  
- This means each observation receives noise equal to its measured error bar‚Äîexactly how astronomers treat real photometric data.  
- If flux_err is not available, a fallback model estimates uncertainty using:  
  - a fractional error (5%) and  
  - Poisson-like photon noise (which increases with brightness).

**What it simulates:**  
True measurement errors that result from:  
- detector sensitivity  
- background sky brightness  
- photon counting noise  
- observational conditions

**Why it's useful:**  
It brings the augmented data even closer to real ZTF survey conditions.

---

### ‚≠ê **Why These Functions Are Important for the Research**
Noise simulation is a critical part of astroinformatics and astronomical ML workflows because:

- Real telescope data is **never clean**‚Äînoise is a core part of the measurement process.  
- Models trained only on smooth, noise-free data perform poorly on real observations.  
- Injecting realistic noise helps models learn stable patterns rather than memorizing perfect curves.  
- These augmentations significantly improve **generalization, robustness, and scientific credibility** of the ML pipeline.  
- They prepare the model for real-world deployment on ZTF or any other survey data.

Together, these two functions mimic both general noise and physically grounded photometric errors, producing augmented datasets that closely resemble real astronomical observations.

</details>


In [4]:
def gaussian_noise_injection(df, flux_col, sigma_fraction=0.05, seed=None):
    if flux_col is None or flux_col not in df.columns:
        return df
    rng = np.random.default_rng(seed)
    flux = df[flux_col].astype(float).values
    sigma = np.maximum(np.abs(flux) * sigma_fraction, 1e-8)
    noise = rng.normal(loc=0.0, scale=sigma)
    df_aug = df.copy()
    df_aug[flux_col] = flux + noise
    return df_aug

def photometric_uncertainty_simulation(df, flux_col, flux_err_col=None, seed=None):
    """If a flux_err column exists, perturb flux by that uncertainty; otherwise assume poisson or fractional error."""
    rng = np.random.default_rng(seed)
    if flux_col is None or flux_col not in df.columns:
        return df
    df_aug = df.copy()
    flux = df_aug[flux_col].astype(float).values
    if flux_err_col and flux_err_col in df_aug.columns:
        err = df_aug[flux_err_col].astype(float).values
        noise = rng.normal(loc=0.0, scale=err)
        df_aug[flux_col] = flux + noise
    else:
        rel_err = 0.05
        sigma = np.maximum(np.abs(flux) * rel_err, np.sqrt(np.maximum(flux, 0)) * 0.1 + 1e-8)
        noise = rng.normal(loc=0.0, scale=sigma)
        df_aug[flux_col] = flux + noise
    return df_aug

## 4.4 Multi-band Transformations
- Filter-dependent transformations: apply different augmentation strengths per band.
- Dropout of random bands: remove observations from random filters to mimic missing bands.

<details>
<summary>üìå Cell Description: Augmentations Based on Telescope Filters and Missing Band Simulation</summary>

This cell defines two augmentation techniques that specifically target **filter-dependent behaviour** in astronomical surveys. Modern telescopes, including ZTF, observe the sky through different optical filters (g, r, i, etc.). Each filter captures light in a different wavelength range, and brightness values can vary between filters due to both astrophysical reasons and instrument-related factors.

These augmentations simulate real observational situations where brightness varies from filter to filter, or where data from certain filters may be missing. Both methods help create a more realistic and robust dataset for machine-learning models.

---

### üîπ **1. `filter_dependent_transform()` ‚Äì Apply Filter-Specific Brightness Scaling**
- Identifies all unique filters (e.g., g, r, i).  
- Applies a separate multiplicative factor to the brightness (flux) values of each filter.  
- Factors may be provided manually, or sampled from a small default range (0.9‚Äì1.1).  
- Creates realistic differences between filters without changing the overall light-curve shape.

**What it simulates:**
- Different sensitivity levels for each filter.  
- Calibration differences between wavelength bands.  
- Small color-dependent brightness variations.  

**Why it matters:**  
Astronomical objects often look slightly brighter or fainter depending on the filter used. This augmentation teaches ML models to handle these natural variations.

---

### üîπ **2. `random_band_dropout()` ‚Äì Randomly Remove Observations From Certain Filters**
- Randomly removes a fraction of rows (default 20%) across the dataset.  
- Simulates **missing data** in certain wavelength bands.  
- Mirrors real-world issues such as:  
  - cloudy observations in only one filter  
  - incomplete multi-band coverage  
  - technical problems affecting specific bands  
- Produces more realistic and challenging datasets for models.

**Why it matters:**  
Real telescope datasets often contain incomplete band coverage. A model trained only on perfectly complete multi-band data will perform poorly when real data has missing filter measurements.

---

### ‚≠ê **Why These Functions Are Important for the Research**
Astronomical surveys rarely capture perfect multi-band light curves. Brightness can differ by filter, and some filters may be missing entirely. These augmentations address these real-world issues:

- **Improves robustness:** Models learn to handle missing filters or inconsistent brightness.  
- **Increases training diversity:** Prevents overfitting to one ‚Äúideal‚Äù pattern of data.  
- **Simulates telescope realities:** Including calibration differences, sensitivity variations, and partial filter coverage.  
- **Enhances generalization:** Models become more prepared for unpredictable observation conditions.

Together, these augmentations help create datasets that closely resemble real ZTF observations, improving the scientific and practical value of the machine-learning system.

</details>


In [5]:
def filter_dependent_transform(df, flux_col, filter_col, per_filter_scale=None, seed=None):
    """Apply a multiplicative factor per filter. `per_filter_scale` is a dict {filter: (min,max)} or None to sample small variations."""
    if flux_col is None or flux_col not in df.columns or filter_col is None or filter_col not in df.columns:
        return df
    df_aug = df.copy()
    rng = np.random.default_rng(seed)
    unique_filters = df_aug[filter_col].dropna().unique()
    for f in unique_filters:
        mask = df_aug[filter_col] == f
        if per_filter_scale and f in per_filter_scale:
            lo, hi = per_filter_scale[f]
            factor = float(rng.uniform(lo, hi))
        else:
            factor = float(rng.uniform(0.9, 1.1))
        df_aug.loc[mask, flux_col] = df_aug.loc[mask, flux_col].astype(float) * factor
    return df_aug

def random_band_dropout(df, filter_col, dropout_prob=0.2, seed=None):
    """Randomly drop a fraction of observations for some bands. Returns a copy where dropped rows are removed."""
    if filter_col is None or filter_col not in df.columns:
        return df
    rng = np.random.default_rng(seed)
    df_aug = df.copy()
    mask = rng.random(size=df_aug.shape[0]) < dropout_prob
    df_aug = df_aug.loc[~mask].reset_index(drop=True)
    return df_aug

## Utilities: apply augmentations to grouped time-series and create contrastive pairs
The following helper applies augmentations per object (grouped by `id_col`) and returns a concatenated augmented dataset. You can call augmentations sequentially (composition) to create positive pairs for contrastive learning.

<details>
<summary>üìå Cell Description: Group-Level Augmentation, Dataset Augmentation, and Contrastive Pair Generation</summary>

This cell defines three powerful utilities that apply augmentation at the **object level** rather than the entire dataset. Astronomical surveys usually contain multiple observations for each celestial object (identified by an object ID). Since these observations form meaningful sequences (light curves), augmentations must be applied **per object**, not randomly across the entire dataset. These utilities allow flexible, modular augmentation pipelines that respect object boundaries and scientific structure.

The functions also support **contrastive learning**, a modern machine-learning technique where the model learns by comparing two different augmented versions of the same object. This mimics the SimCLR and self-supervised learning approaches widely used in advanced data science.

---

### üîπ **1. `apply_to_group()` ‚Äì Apply Multiple Augmentations to a Single Object**
- Takes one object's time-series (one group).  
- Applies a sequence of augmentation functions.  
- Each augmentation can have its own parameters.  
- Ensures augmentations are applied cleanly and in order.  
- Returns a new, fully augmented version of that object‚Äôs sequence.

**Why it matters:**  
Astronomical light curves must be augmented **per object**. Applying augmentations at the row-level would break the scientific time-series structure.

---

### üîπ **2. `augment_dataset()` ‚Äì Apply Augmentation to the Entire Dataset Object-by-Object**
This is the **master augmentation function**.

- Detects each object using an ID column (e.g., object_id, source_id).  
- Groups the dataset so each object's light curve is processed independently.  
- Optionally sorts each object‚Äôs data by time (ensures chronological order).  
- Applies all chosen augmentations to each group.  
- Supports **sample_frac**, which controls the percentage of objects to augment.  
- Returns a fully augmented dataset containing realistic, object-level sequences.

**What it ensures:**
- Augmentations never mix data from different objects.  
- Temporal order is preserved.  
- Output remains scientifically valid and consistent with telescope behavior.  

**Why it's important:**  
Real astronomical machine learning requires understanding each object's behaviour across time, so augmentation must respect object identity and temporal sequence.

---

### üîπ **3. `make_contrastive_pair()` ‚Äì Create Two Augmented Views of the Same Object**
This function is used in **contrastive learning**, a state-of-the-art technique in deep learning.

- Takes one object's time-series.  
- Applies two *different* augmentation pipelines:  
  - **AugA** (first view)  
  - **AugB** (second view)  
- Returns two augmented versions of the same object.

In contrastive learning, the model learns that:

- The two augmented sequences represent the *same astronomical object*, even though their appearance is slightly different due to augmentation.  
- Different objects should still have different representations.

**Why it matters:**  
This is the foundation of **self-supervised representation learning**, helping the model learn reliable features even without labels.

---

### ‚≠ê **Why These Functions Are Important for the Research**
These utilities together make this pipeline extremely powerful:

- **Supports object-level data augmentation**, essential for astronomical time-series.  
- Ensures augmentations do not break object identity or scientific meaning.  
- Allows creating large numbers of augmented light curves for training.  
- Enables **contrastive learning**, a cutting-edge approach for building strong models from unlabeled data.  
- Makes the augmentation pipeline modular, reusable, and easy to extend.  
- Prepares the dataset for advanced methods like SimCLR, BYOL, or contrastive autoencoders.  
- Mimics realistic observation variations while preserving astrophysical signals.  

Overall, this block forms the *backbone* of the augmentation and contrastive learning system for astronomical machine learning.

</details>


In [6]:
def apply_to_group(df_group, funcs):
    """Apply a list of augmentation functions (each accepts and returns a DataFrame for the group)."""
    g = df_group.copy().reset_index(drop=True)
    for f, kwargs in funcs:
        g = f(g, **kwargs) if kwargs is not None else f(g)
    return g

def augment_dataset(df, id_col=None, funcs_per_group=None, sample_frac=1.0, random_state=None):
    """Apply augmentations per object id and return augmented examples.
    - `funcs_per_group` should be a list of tuples (func, kwargs) to apply to each group."""
    if funcs_per_group is None:
        return df
    if id_col is None or id_col not in df.columns:
        # treat whole DF as one sequence
        return apply_to_group(df, funcs_per_group)
    groups = df.groupby(id_col)
    out_rows = []
    rng = np.random.default_rng(random_state)
    ids = list(groups.groups.keys())
    if sample_frac < 1.0:
        k = max(1, int(len(ids) * sample_frac))
        ids = rng.choice(ids, size=k, replace=False).tolist()
    for objid in ids:
        g = groups.get_group(objid)
        g_sorted = sort_by_time(g, detect_columns(g)['time'])
        g_aug = apply_to_group(g_sorted, funcs_per_group)
        out_rows.append(g_aug)
    if len(out_rows) == 0:
        return pd.DataFrame(columns=df.columns)
    return pd.concat(out_rows, ignore_index=True)

def make_contrastive_pair(group_df, augA, augB):
    a = apply_to_group(group_df, augA)
    b = apply_to_group(group_df, augB)
    return a, b

## Example usage and saving augmented samples
Below is an example that loads the cleaned dataset, selects a small sample of objects, applies a pipeline of augmentations, and saves augmented CSVs for later training.

<details>
<summary>üìå Cell Description: Example Usage of the Full Augmentation Pipeline + Saving Contrastive Samples</summary>

This cell demonstrates how all the previously defined augmentation tools come together to generate **augmented astronomical time-series data**. It loads the cleaned dataset, detects important columns (time, flux, filter, object ID), selects a subset of objects, applies two different augmentation pipelines, and finally saves the resulting augmented datasets.

The purpose of this cell is to provide a **practical example** of how the augmentation functions are used in real workflows, especially for **contrastive learning**, where two different augmented views of the same object are required. This step prepares training data for machine-learning models that rely on self-supervised learning or contrastive objectives.

---

### üîπ **Step-by-Step Summary (Simple & Attractive)**

- **Loads the cleaned dataset**, ensuring preprocessing has been completed.  
- **Automatically detects key columns** using `detect_columns()`, such as:
  - object ID  
  - time  
  - flux  
  - filter  
- **Selects up to 50 unique objects** (if IDs exist).  
- **Defines two augmentation pipelines**, each applying different transformations:
  - **AugA**: temporal jitter + brightness scaling + Gaussian noise  
  - **AugB**: time shifting + smooth brightness warping + photometric uncertainty  
- **Applies augmentations object-by-object** using `augment_dataset()`.  
- **Generates two augmented datasets**, each containing realistic variations of the same astronomical light curves.  
- **Saves both outputs** (`augmented_A_sample.csv`, `augmented_B_sample.csv`) for later use in training models.  

---

### üîπ **Purpose of AugA and AugB (Contrastive Learning Friendly)**

- **AugA** applies small, local modifications.  
- **AugB** applies larger, more global distortions.  
- Together, they create two *different but related* views of each object.  
- Perfect for contrastive learning approaches like:  
  - SimCLR  
  - BYOL  
  - MoCo  
  - Contrastive autoencoders  

This helps the model learn the true underlying structure of astronomical light curves.

---

### ‚≠ê **Why This Cell Is Important for the Research**

This cell performs several essential functions:

1. **Demonstrates end-to-end augmentation**  
   It ties together all earlier augmentation utilities into a practical workflow.

2. **Prepares contrastive pairs for modern ML methods**  
   Contrastive learning is one of the most powerful techniques for learning good representations from unlabeled data ‚Äî highly relevant in astronomy where labels are scarce.

3. **Ensures augmentations respect object identity and time order**  
   This is crucial for maintaining scientific meaning.

4. **Generates a rich, diverse training dataset**  
   This improves:
   - model robustness  
   - generalization to new observations  
   - performance on real ZTF data containing noise and irregular sampling  

5. **Creates outputs that can be directly used for model training**  
   The saved CSV files form the basis of further ML experiments.

Overall, this cell is the **final integration step** that converts raw cleaned data into scientifically meaningful, augmented datasets ready for advanced machine-learning models.

</details>


In [7]:
# Example usage (will run if the cleaned dataset exists in workspace)
DATA_CLEANED = 'ztf_image_search_results_full_cleaned.csv'
if not os.path.exists(DATA_CLEANED):
    print(DATA_CLEANED, 'not found in workspace ‚Äî update path or run preprocessing first')
else:
    df_all = pd.read_csv(DATA_CLEANED)
    cols = detect_columns(df_all)
    print('Detected columns:', cols)
    idc = cols['id']
    if idc is None or idc not in df_all.columns:
        sample_df = df_all
    else:
        unique_ids = df_all[idc].dropna().unique()
        sel_ids = unique_ids[:50] if len(unique_ids) > 50 else unique_ids
        sample_df = df_all[df_all[idc].isin(sel_ids)].reset_index(drop=True)

    # define two augmentation pipelines for contrastive positives
    augA = [
        (lambda g, **kw: temporal_jitter(g, cols['time'], sigma_fraction=0.01, seed=42), None),
        (lambda g, **kw: magnitude_scaling(g, cols['flux'], scale_range=(0.95,1.05), seed=42), None),
        (lambda g, **kw: gaussian_noise_injection(g, cols['flux'], sigma_fraction=0.03, seed=42), None)
    ]
    augB = [
        (lambda g, **kw: random_time_shift(g, cols['time'], seed=24), None),
        (lambda g, **kw: brightness_warping(g, cols['time'], cols['flux'], n_knots=4, warp_scale=0.08, seed=24), None),
        (lambda g, **kw: photometric_uncertainty_simulation(g, cols['flux'], flux_err_col=None, seed=24), None)
    ]
    # create augmented sets (this will return concatenated groups)
    aug_set_A = augment_dataset(sample_df, id_col=cols['id'], funcs_per_group=augA, sample_frac=1.0, random_state=42)
    aug_set_B = augment_dataset(sample_df, id_col=cols['id'], funcs_per_group=augB, sample_frac=1.0, random_state=24)
    # save examples
    aug_set_A.to_csv('augmented_A_sample.csv', index=False)
    aug_set_B.to_csv('augmented_B_sample.csv', index=False)
    print('Saved augmented_A_sample.csv and augmented_B_sample.csv (sample)')

Detected columns: {'time': 'obsdate', 'flux': None, 'filt': 'filtercode', 'id': 'ipac_gid'}


ValueError: could not convert string to float: '2018-03-25 06:35:35+00'