# ACE Exploration

ACE (Advanced Composition Explorer) is equipped with nine scientific instruments to make comprehensive and coordinated in situ measurements. These instruments are categorized into two groups: High Resolution Spectrometers and Monitoring Instruments.

## High Resolution Spectrometers
- **CRIS** - Cosmic Ray Isotope Spectrometer
- **SIS** - Solar Isotope Spectrometer
- **ULEIS** - Ultra Low Energy Isotope Spectrometer
- **SEPICA** - Solar Energetic Particle Ionic Charge Analyzer
- **SWICS** - Solar Wind Ion Composition Spectrometer
- **SWIMS** - Solar Wind Ion Mass Spectrometer

## Monitoring Instruments
- **MAG** - Magnetic Field Monitor
- **SWEPAM** - Solar Wind Electron, Proton and Alpha Monitor
- **EPAM** - Electron, Proton and Alpha Monitor
- **SWICS** - Solar Wind Ion Composition Spectrometer

All open-source ACE data are formatted using hierarchical data format (HDF). The data are organized by instrument and by time-averaging periods. Each instrument's data are stored in separate HDF data files, and separate HDF files also contain the data from the different averaging periods. For most of the instruments, the data are averaged hourly, daily, and per 27 days (1 Bartels rotation).

## About Hierarchical Data Formats
Hierarchical Data Formats (HDF) are open source file formats that support large, complex, heterogeneous data. HDF files use a “file directory” like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer. HDF files also allow for embedding of metadata making them self-describing.

---

## Analytical Questions
How can we apply novel dimension reduction methods, such as PCA, TSNE, etc., to obtain informative solar wind in-situ data representation in low-dimensional space? How can this low-dimensional representation provide better 2D/3D visualization support than traditional dimension reduction techniques?

## Libraries and global variables

In [None]:
# Standard library imports
import sys

# Third-party imports
from contextlib import suppress
import warnings
import numpy as np
from functools import reduce

# Local application imports
sys.path.append("../src/scripts")
from utilities import (
    parse_hdf_data,
    merge_dataframes,
    missing_occurrences,
    sort_columns_except_key,
    visualize_flag,
    add_datetime_column,
)

# Set the warning filter to ignore all warnings
warnings.filterwarnings("ignore")

In [None]:
# global variables
MISSING_FLAG = -999.900

## Data Import

In [None]:
# read data
data_dir = "../data/ace/raw"
mag_df = parse_hdf_data(f"{data_dir}/MAG_data_1hr.txt")
swepam_df = parse_hdf_data(f"{data_dir}/SWEPAM_data_1hr.txt")
epam_df = parse_hdf_data(f"{data_dir}/EPAM_data_1hr.txt")
swics_df = parse_hdf_data(f"{data_dir}/SWICS_data_1day.txt")

ACE_DATASETS = [mag_df, swepam_df, epam_df, swics_df]
ACE_DATASETS_NAMES = ["MAG", "SWEPAM", "EPAM", "SWICS"]

In [None]:
# dtype conversion
for df in ACE_DATASETS:
    df[["year", "day", "hr", "min", "sec"]] = df[
        ["year", "day", "hr", "min", "sec"]
    ].astype(int)

    with suppress(KeyError):
        df['Quality'] = df['Quality'].astype(str)

In [None]:
# datetime conversion and drop redundant features
for df in ACE_DATASETS:
    add_datetime_column(df).drop(
        columns=["year", "day", "hr", "min", "sec", "fp_year", "fp_doy"],
        inplace=True,
        axis=1,
    )

## Data Cleaning

### Retain *Good* Quality data

Good data is flagged by the researchers with a value of 0. 

In [None]:
for c, (df, df_name) in enumerate(zip(ACE_DATASETS, ACE_DATASETS_NAMES)):

    with suppress(KeyError):  # not all datasets have the quality flag
        df = df[df["Quality"] == "0.0"]
        df.drop(columns=["Quality"], inplace=True, axis=1)
    ACE_DATASETS[c] = df

mag_df, swepam_df, epam_df, swics_df = ACE_DATASETS

### Join data

In [None]:
# find unique timestamps
mag_dates, swepam_dates, epam_dates, swics_dates = [
    df.datetime.unique() for df in ACE_DATASETS
]

# find the common dates for 1hr interval data
common_dates_1hr = reduce(
    np.intersect1d, (mag_dates, swepam_dates, epam_dates)
)

# find the common dates for 1day interval data
common_date_1d = reduce(
    np.intersect1d, (mag_dates, swepam_dates, epam_dates, swics_dates)
    )

print(len(common_dates_1hr))
print(len(common_date_1d))

In [None]:
# join the 1hr interval datasets
swics_i = ACE_DATASETS_NAMES.index("SWICS")
ACE_DATASETS_1HR = ACE_DATASETS[:swics_i] + ACE_DATASETS[swics_i + 1 :]
insitu_1hr_df = merge_dataframes(ACE_DATASETS_1HR, "datetime")
df = sort_columns_except_key(insitu_1hr_df, "datetime")

### Handling Missing Values

Missing data has the value of -999.900. Assert that there are no longer missing values due to dropping data labeled as not of good quality.

In [None]:
missing_rows = df[df.eq("-999.900").any(axis=1)]
missing_occurrences(df, MISSING_FLAG).sort_values(
    ascending=False, by="Missing_Count"
)
visualize_flag(df,df_name)
df.shape

## Exploratory Data Analysis

### Descriptives

In [None]:
for df,df_name in zip(ACE_DATASETS,ACE_DATASETS_NAMES):
    print(f"Dataframe: {df_name}")
    display(df.info())
    display(df.describe())
    print("\n"+('-'*20))

### Univariate Analysis

### Multivariate Analysis

## Data Transformation

### Normalization and Standardization

### Handling Outliers

## Dimensionality Reduction

## Joins
- Is there anyway to informatively join these features? 

## Data Quality Checks

### Addressing missing values

In [None]:
missing_rows = mag_df[mag_df.eq("-999.900").any(axis=1)]
mag_df = missing_occurrences(mag_df, MISSING_FLAG).sort_values(ascending=False,by="Flag_Count")


## Self-Organizing Maps