---
title: Finding magnetic discontinuities
order: 0
---

It can be divided into two parts:

1. Finding the discontinuities, see [this notebook](./01_ids_detection.ipynb)
    - Corresponding to limited feature extraction / anomaly detection

    - Output should contain the following:
        - "tstart" and "tstop" of the event

2. Calculating the properties of the discontinuities, see [this notebook](./02_ids_properties.ipynb)
    - One can use higher time resolution data

In [None]:
# | default_exp core/pipeline

In [None]:
# | export
# | code-summary: "Import all the packages needed for the project"
import polars as pl
from discontinuitypy.detection.variance import detect_variance
from discontinuitypy.core.propeties import process_events
from space_analysis.ds.ts.io import df2ts
from loguru import logger

from datetime import timedelta

from typing import Callable

## Processing the whole dataset

Notes that the candidates only require a small portion of the data so we can compress the data to speed up the processing.

In [None]:
# | export
from beforerr.polars import filter_df_by_ranges


def compress_data_by_events(data: pl.DataFrame, events: pl.DataFrame):
    """Compress the data for parallel processing"""
    starts = events["tstart"]
    ends = events["tstop"]
    return filter_df_by_ranges(data, starts, ends)

In [None]:
# | export
def ids_finder(
    detection_df: pl.LazyFrame,  # data used for anomaly dectection (typically low cadence data)
    bcols=None,
    detect_func: Callable[..., pl.LazyFrame] = detect_variance,
    detect_kwargs: dict = {},
    extract_df: pl.LazyFrame = None,  # data used for feature extraction (typically high cadence data),
    **kwargs,
):
    if bcols is None:
        bcols = detection_df.collect_schema().names()
        bcols.remove("time")
    if len(bcols) != 3:
        logger.error("Expect 3 field components")

    detection_df = detection_df.select(bcols + ["time"])
    extract_df = extract_df or detection_df

    detection_df = detection_df.sort("time")
    extract_df = extract_df.sort("time")

    events = detect_func(detection_df, bcols=bcols, **detect_kwargs)

    data_c = compress_data_by_events(extract_df.collect(), events)
    sat_fgm = df2ts(data_c, bcols)
    ids = process_events(events, sat_fgm, **kwargs)
    return ids

wrapper function for partitioned input used in `Kedro`

In [None]:
# | export
def extract_features(
    partitioned_input: dict[str, Callable[..., pl.LazyFrame]],
    tau: float,  # in seconds, yaml input
    ts: float,  # in seconds, yaml input
    **kwargs,
) -> pl.DataFrame:
    "wrapper function for partitioned input"

    _tau = timedelta(seconds=tau)
    _ts = timedelta(seconds=ts)

    ids = pl.concat(
        [
            ids_finder(partition_load(), _tau, _ts, **kwargs)
            for partition_load in partitioned_input.values()
        ]
    )
    return ids.unique(["d_time", "t.d_start", "t.d_end"])

## Conventions

As we are dealing with multiple spacecraft, we need to be careful about naming conventions. Here are the conventions we use in this project.

-   `sat_id`: name of the spacecraft. We also use abbreviation, for example
    -   `sta` for `STEREO-A`
    -   `thb` for `ARTEMIS-B`
-   `sat_state`: state data of the spacecraft
-   `b_vl`: maximum variance vector of the magnetic field, (major eigenvector)

Data Level

-   l0: unprocessed

-   l1: cleaned data, fill null value, add useful columns

-   l2: time-averaged data

### Columns naming conventions

-   `radial_distance`: radial distance of the spacecraft, in units of $AU$

-   `plasma_speed`: solar wind plasma speed, in units of $km/s$

-   `sw_elevation`: solar wind elevation angle, in units of $\degree$

-   `sw_azimuth`: solar wind azimuth angle, in units of $\degree$

-   `v_{x,y,z}` or `sw_vel_{X,Y,Z}`: solar wind plasma speed in the *ANY* coordinate system, in units of $km/s$

    -   `sw_vel_{r,t,n}`: solar wind plasma speed in the RTN coordinate system, in units of $km/s$
    -   `sw_vel_gse_{x,y,z}`: solar wind plasma speed in the GSE coordinate system, in units of $km/s$
    -   `sw_vel_lmn_{x,y,z}`: solar wind plasma speed in the LMN coordinate system, in units of $km/s$
        -   `v_l` or `sw_vel_l`: abbreviation for `sw_vel_lmn_1`
        -   `v_mn` or `sw_vel_mn` (deprecated)

-   `plasma_density`: plasma density, in units of $1/cm^{3}$

-   `plasma_temperature`: plasma temperature, in units of $K$

-   `B_{x,y,z}`: magnetic field in *ANY* coordinate system

    -   `b_rtn_{x,y,z}` or `b_{r,t,n}`: magnetic field in the RTN coordinate system
    -   `b_gse_{x,y,z}`: magnetic field in the GSE coordinate system

-   `B_mag`: magnetic field magnitude

-   `Vl_{x,y,z}` or `b_vecL_{X,Y,Z}`: maxium variance vector of the magnetic field in *ANY* coordinate system

    -   `b_vecL_{r,t,n}`: maxium variance vector of the magnetic field in the RTN coordinate system

-   `model_b_{r,t,n}`: modelled magnetic field in the RTN coordinate system

-   `state` : *1* for *solar wind*, *0* for *non-solar wind*

-   `L_mn{_norm}`: thickness of the current sheet in MN direction, in units of $km$

-   `j0{_norm}`: current density, in units of $nA/m^2$

Notes: we recommend use unique names for each variable, for example, `plasma_speed` instead of `speed`. Because it is easier to search and replace the variable names in the code whenever necessary.

For the unit, by default we use

-   length : $km$
-   time : $s$
-   magnetic field : $nT$
-   current : $nA/m^2$

## Test

### Test feature engineering

In [None]:
# from tsflex.features import MultipleFeatureDescriptors, FeatureCollection

# from tsflex.features.integrations import catch22_wrapper
# from pycatch22 import catch22_all

In [None]:
# tau_pd = pd.Timedelta(tau)

# catch22_feats = MultipleFeatureDescriptors(
#     functions=catch22_wrapper(catch22_all),
#     series_names=bcols,  # list of signal names
#     windows = tau_pd, strides=tau_pd/2,
# )

# fc = FeatureCollection(catch22_feats)
# features = fc.calculate(data, return_df=True)  # calculate the features on your data

In [None]:
# features_pl = pl.DataFrame(features.reset_index()).sort('time')
# df = candidates_pl.join_asof(features_pl, on='time').to_pandas()

In [None]:
# profile = ProfileReport(df, title="JUNO Candidates Report")
# profile.to_file("jno.html")

### Benchmark

## Notes

### TODOs

1. Feature engineering
2. Feature selection

## Obsolete codes