# ACE Exploration

ACE (Advanced Composition Explorer) is equipped with nine scientific instruments to make comprehensive and coordinated in situ measurements. These instruments are categorized into two groups: High Resolution Spectrometers and Monitoring Instruments.

## High Resolution Spectrometers
- **CRIS** - Cosmic Ray Isotope Spectrometer
- **SIS** - Solar Isotope Spectrometer
- **ULEIS** - Ultra Low Energy Isotope Spectrometer
- **SEPICA** - Solar Energetic Particle Ionic Charge Analyzer
- **SWICS** - Solar Wind Ion Composition Spectrometer
- **SWIMS** - Solar Wind Ion Mass Spectrometer

## Monitoring Instruments
- **MAG** - Magnetic Field Monitor
- **SWEPAM** - Solar Wind Electron, Proton and Alpha Monitor
- **EPAM** - Electron, Proton and Alpha Monitor
- **SWICS** - Solar Wind Ion Composition Spectrometer

All open-source ACE data are formatted using hierarchical data format (HDF). The data are organized by instrument and by time-averaging periods. Each instrument's data are stored in separate HDF data files, and separate HDF files also contain the data from the different averaging periods. For most of the instruments, the data are averaged hourly, daily, and per 27 days (1 Bartels rotation).

## About Hierarchical Data Formats
Hierarchical Data Formats (HDF) are open source file formats that support large, complex, heterogeneous data. HDF files use a “file directory” like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer. HDF files also allow for embedding of metadata making them self-describing.

---

## Analytical Questions
How can we apply novel dimension reduction methods, such as PCA, TSNE, etc., to obtain informative solar wind in-situ data representation in low-dimensional space? How can this low-dimensional representation provide better 2D/3D visualization support than traditional dimension reduction techniques?

## Libraries and global variables

In [56]:
# Standard library imports
import sys
import os

# Third-party imports
from contextlib import suppress
import warnings
import pandas as pd
import numpy as np
from functools import reduce
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.decomposition import PCA, KernelPCA
from sklearn.impute import IterativeImputer, KNNImputer, SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
# from keras.layers import Input, Dense
# from keras.models import Model

# Local application imports
sys.path.append("../src/scripts")
from utilities import (
    parse_hdf_data,
    merge_dataframes,
    missing_occurrences,
    sort_columns_except_key,
    visualize_flag,
    add_datetime_column,
)

# Set the warning filter to ignore all warnings
warnings.filterwarnings("ignore")

In [57]:
# global variables
MISSING_FLAG = -999.900
N_SPLITS = 4

## Data Import

In [58]:
# read data
data_dir = "../data/ace/raw"
swics_1hr_dir = f"{data_dir}/swics_1hr"
swics_2hr_dir = f"{data_dir}/swics_2hr"

mag_df = parse_hdf_data(f"{data_dir}/MAG_data_1hr.txt")
swepam_df = parse_hdf_data(f"{data_dir}/SWEPAM_data_1hr.txt")
epam_df = parse_hdf_data(f"{data_dir}/EPAM_data_1hr.txt")

swics_dfs = []
for dir in [swics_1hr_dir, swics_2hr_dir]:
    for file in os.listdir(dir):
        swics_dfs.append(parse_hdf_data(f"{dir}/{file}"))
swics_df = pd.concat(swics_dfs)

In [59]:
ACE_DATASETS = [mag_df, swepam_df, epam_df, swics_df]
ACE_DATASETS_NAMES = ["MAG", "SWEPAM", "EPAM", "SWICS"]

In [60]:
# dtype conversion
for df in ACE_DATASETS:
    df[["year", "day", "hr", "min", "sec"]] = df[
        ["year", "day", "hr", "min", "sec"]
    ].astype(int)

    with suppress(KeyError):
        df['Quality'] = df['Quality'].astype(str)

In [61]:
# datetime conversion and drop redundant features
for df in ACE_DATASETS:
    add_datetime_column(df).drop(
        columns=["year", "day", "hr", "min", "sec", "fp_year", "fp_doy"],
        inplace=True,
        axis=1,
    )

# swics_df may contain duplicate records to nature of 1.0 and 2.0 data collection
swics_df.drop_duplicates(subset="datetime", inplace=True)

## Data Cleaning

### Descriptives

In [11]:
for df, df_name in zip(ACE_DATASETS, ACE_DATASETS_NAMES):
    print(f"Dataframe: {df_name}")
    display(df.info())
    display(df.describe())
    print("\n" + ("-" * 20))

Dataframe: MAG
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231336 entries, 0 to 231335
Data columns (total 26 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   datetime       231336 non-null  datetime64[ns]
 1   ACEepoch       231336 non-null  float64       
 2   SCclock        231336 non-null  float64       
 3   Br             231336 non-null  float64       
 4   Bt             231336 non-null  float64       
 5   Bn             231336 non-null  float64       
 6   Bmag           231336 non-null  float64       
 7   Delta          231336 non-null  float64       
 8   Lambda         231336 non-null  float64       
 9   Bgse_x         231336 non-null  float64       
 10  Bgse_y         231336 non-null  float64       
 11  Bgse_z         231336 non-null  float64       
 12  Bgsm_x         231336 non-null  float64       
 13  Bgsm_y         231336 non-null  float64       
 14  Bgsm_z         231336 non-null  float

None

Unnamed: 0,ACEepoch,SCclock,Br,Bt,Bn,Bmag,Delta,Lambda,Bgse_x,Bgse_y,...,dBrms,sigma_B,fraction_good,N_vectors,pos_gse_x,pos_gse_y,pos_gse_z,pos_gsm_x,pos_gsm_y,pos_gsm_z
count,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0,...,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0,231336.0
mean,467465400.0,420595500.0,-3.871007,-3.808264,-3.784931,1.979197,-3.86975,196.062699,-3.718902,-3.781131,...,-1.521951,-3.398425,0.995905,224.077368,1482788.0,-2437.61737,1683.444487,1482804.0,-1896.171801,4651.744091
std,240411900.0,275279900.0,61.575025,61.605103,61.550178,61.91569,67.842521,124.230072,61.5844,61.606392,...,61.575131,61.43907,0.062182,13.990851,69298.38,182760.507192,106453.364131,69320.1,176318.105686,116696.83518
min,51062400.0,697.0,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,...,-999.9,-999.9,0.0,0.0,34420.0,-475680.0,-165980.0,34420.0,-463990.0,-265270.0
25%,259263900.0,208202500.0,-2.659,-2.53225,-1.317,3.87775,-18.26825,122.09075,-2.522,-2.513,...,1.165,0.156,1.0,225.0,1432400.0,-182880.0,-104232.5,1432400.0,-169330.0,-88805.0
50%,467465400.0,416404400.0,-0.1415,0.01,0.005,5.077,0.082,178.0555,0.099,-0.052,...,1.895,0.261,1.0,225.0,1480800.0,-2113.15,3099.2,1480800.0,-2782.25,10900.0
75%,675666900.0,624606200.0,2.484,2.481,1.306,6.851,17.9,300.7335,2.624,2.484,...,2.916,0.46,1.0,225.0,1538900.0,179700.0,107750.0,1538900.0,166180.0,107210.0
max,883868400.0,4294967000.0,41.12,35.083,36.638,71.993,89.107,359.998,33.287,45.473,...,40.161,28.231,1.0,226.0,1594800.0,268000.0,164090.0,1594800.0,298140.0,308840.0



--------------------
Dataframe: SWEPAM
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204768 entries, 0 to 204767
Data columns (total 30 columns):
 #   Column                        Non-Null Count   Dtype         
---  ------                        --------------   -----         
 0   datetime                      204768 non-null  datetime64[ns]
 1   ACEepoch                      204768 non-null  float64       
 2   proton_density                204768 non-null  float64       
 3   proton_temp                   204768 non-null  float64       
 4   He4toprotons                  204768 non-null  float64       
 5   proton_speed                  204768 non-null  float64       
 6   x_dot_GSE                     204768 non-null  float64       
 7   y_dot_GSE                     204768 non-null  float64       
 8   z_dot_GSE                     204768 non-null  float64       
 9   x_dot_RTN                     204768 non-null  float64       
 10  y_dot_RTN                     204768 non

None

Unnamed: 0,ACEepoch,proton_density,proton_temp,He4toprotons,proton_speed,x_dot_GSE,y_dot_GSE,z_dot_GSE,x_dot_RTN,y_dot_RTN,...,pos_gsm_z,Electron_temp,fraction_time_proton_density,fraction_time_proton_temp,fraction_time_He4toprotons,fraction_time_proton_speed,fraction_time_dot_GSE,fraction_time_dot_RTN,fraction_time_Electron_temp,weight
count,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0,...,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0,204768.0
mean,433639800.0,-3583.19262,77292.77,-4039.195095,284.306873,-561.650643,-140.401359,-144.888509,280.170499,-141.045238,...,3640.350963,-10000.0,0.580415,0.799835,0.518758,0.931895,0.930986,0.930986,0.0,32.649042
std,212801700.0,4798.960802,76444.3,4906.807287,1221.74197,1132.031778,1178.29515,1177.696664,1232.25754,1178.21703,...,115468.042694,0.0,0.45059,0.333071,0.448238,0.130196,0.131082,0.131082,0.0,25.348204
min,65059200.0,-9999.9,-9999.9,-9999.9004,-9999.9,-9999.9,-9999.9,-9999.9,-9999.9,-9999.9,...,-265280.0,-10000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,249349500.0,-9999.9,27434.75,-9999.9004,353.28,-484.3925,-16.77,-19.19,352.0375,-15.2325,...,-87177.0,-10000.0,0.0,0.8929,0.0,0.9464,0.9464,0.9464,0.0,0.0
50%,433639800.0,2.873,59121.0,0.0131,405.26,-407.06,-1.57,-5.09,403.78,0.79,...,9371.95,-10000.0,0.9123,0.9474,0.7895,0.9643,0.9643,0.9643,0.0,52.0
75%,617930100.0,5.424,110532.5,0.0322,479.88,-354.8775,14.21,9.96,478.0,15.76,...,104800.0,-10000.0,0.9643,0.9643,0.9464,0.9643,0.9643,0.9643,0.0,54.0
max,802220400.0,104.335,1156600.0,0.3454,1187.89,-226.7,290.84,358.4,1183.73,212.81,...,241120.0,-10000.0,0.9825,0.9825,0.9825,0.9825,0.9825,0.9825,0.0,56.0



--------------------
Dataframe: EPAM
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232351 entries, 0 to 232350
Data columns (total 91 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   datetime  232351 non-null  datetime64[ns]
 1   ACEepoch  232351 non-null  float64       
 2   P1        232351 non-null  float64       
 3   P2        232351 non-null  float64       
 4   P3        232351 non-null  float64       
 5   P4        232351 non-null  float64       
 6   P5        232351 non-null  float64       
 7   P6        232351 non-null  float64       
 8   P7        232351 non-null  float64       
 9   P8        232351 non-null  float64       
 10  unc_P1    232351 non-null  float64       
 11  unc_P2    232351 non-null  float64       
 12  unc_P3    232351 non-null  float64       
 13  unc_P4    232351 non-null  float64       
 14  unc_P5    232351 non-null  float64       
 15  unc_P6    232351 non-null  float64       
 16  

None

Unnamed: 0,ACEepoch,P1,P2,P3,P4,P5,P6,P7,P8,unc_P1,...,FP6,FP7,unc_E1,unc_E2,unc_E3,unc_E4,unc_FP5,unc_FP6,unc_FP7,livetime
count,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0,...,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0,232351.0
mean,470736000.0,1039.626,834.4632,-13.9947,-491.177701,-639.511981,-714.147444,46.249265,9.983937,-841.980513,...,3.289076e+31,14.952518,-980.091116,-980.091059,-980.091038,-3.066417,-3.060326,-3.04557,-3.052071,3327.123985
std,241466800.0,29672.06,33298.43,12050.79,4048.188829,2237.219437,1142.555217,631.786597,222.343196,364.647106,...,1.585427e+34,275.195568,139.337412,139.337814,139.337962,55.460276,55.460615,55.422628,55.499859,256.586417
min,52506000.0,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,...,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9
25%,261621000.0,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,0.15842,0.031849,-999.9,...,1.0812,0.078608,-999.9,-999.9,-999.9,0.01403,0.01809,0.02273,0.02249,3324.0
50%,470736000.0,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,0.22213,0.042992,-999.9,...,1.3965,0.10208,-999.9,-999.9,-999.9,0.02164,0.02831,0.04112,0.04393,3351.0
75%,679851000.0,-999.9,-999.9,-999.9,-999.9,-999.9,-999.9,1.3526,0.20875,-999.9,...,4.4477,0.37869,-999.9,-999.9,-999.9,0.0243,0.03225,0.04686,0.05021,3380.0
max,888966000.0,2781500.0,3976300.0,1072400.0,289950.0,214950.0,119990.0,67757.0,28195.0,0.189,...,7.6422e+36,32541.0,0.09325,0.1021,0.1429,0.4472,0.378,1.0,0.4264,3462.0



--------------------
Dataframe: SWICS
<class 'pandas.core.frame.DataFrame'>
Int64Index: 158627 entries, 0 to 2563
Data columns (total 70 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   datetime     158627 non-null  datetime64[ns]
 1   ACEepoch     158627 non-null  float64       
 2   nHe2         109672 non-null  float64       
 3   nHe2_err     109672 non-null  float64       
 4   vHe2         158627 non-null  float64       
 5   vthHe2       158627 non-null  float64       
 6   vHe2_err     158627 non-null  float64       
 7   vthHe2_err   158627 non-null  float64       
 8   qf_He        158627 non-null  float64       
 9   vC5          109672 non-null  float64       
 10  vthC5        109672 non-null  float64       
 11  vC5_err      109672 non-null  float64       
 12  vthC5_err    109672 non-null  float64       
 13  qf_C5        109672 non-null  float64       
 14  vO6          109672 non-null  float64       
 1

None

Unnamed: 0,ACEepoch,nHe2,nHe2_err,vHe2,vthHe2,vHe2_err,vthHe2_err,qf_He,vC5,vthC5,...,qf_NetoO,MgtoO,MgtoO_err,qf_MgtoO,SitoO,SitoO_err,qf_SitoO,O8to6,O8to6_err,qf_O8to6
count,158627.0,109672.0,109672.0,158627.0,158627.0,158627.0,158627.0,158627.0,109672.0,109672.0,...,8695.0,8695.0,8695.0,8695.0,8695.0,8695.0,8695.0,48955.0,48955.0,48955.0
mean,411259100.0,-301.459735,-301.62323,87.638222,-305.918902,-9999.9,-9999.9,-0.031186,52.110503,-349.563415,...,0.00575,-48.142825,-48.298045,0.007476,-52.722181,-52.898641,0.011041,-442.00187,-438.643672,-0.040466
std,224176200.0,1710.366953,1710.338113,1894.14628,1803.955726,3.63799e-12,3.63799e-12,0.251466,2001.469757,1918.414396,...,0.262202,693.371588,693.360769,0.273539,725.472285,725.459412,0.295457,2055.484937,2054.715604,0.197051
min,66096560.0,-9999.9,-9999.9,-9999.9,-9999.9,-9999.9,-9999.9,-1.0,-9999.9,-9999.9,...,-1.0,-9999.9,-9999.9,-1.0,-9999.9,-9999.9,-1.0,-9999.9,-9999.9,-1.0
25%,229392800.0,0.060517,0.000266,351.1,18.145,-9999.9,-9999.9,0.0,351.5375,17.26775,...,0.0,0.11504,0.002677,0.0,0.137785,0.002768,0.0,0.002583,2.2036,0.0
50%,384123100.0,0.1243,0.000386,411.94,27.646,-9999.9,-9999.9,0.0,417.625,30.825,...,0.0,0.1444,0.003699,0.0,0.16976,0.003776,0.0,0.008336,2.6644,0.0
75%,585256400.0,0.2192,0.00054,502.785,38.149,-9999.9,-9999.9,0.0,514.6125,42.32825,...,0.0,0.18658,0.005508,0.0,0.211245,0.005368,0.0,0.021736,3.4181,0.0
max,872212700.0,2.3191,0.002424,1844.9,201.0,-9999.9,-9999.9,16.0,1877.3,222.29,...,8.0,1.879,0.16378,8.0,1.4423,0.1614,8.0,1.5981,4.1661,0.0



--------------------


### Retain *Good* Quality data

Good data is flagged by the researchers with a value of 0. 

In [62]:
for c, (df, df_name) in enumerate(zip(ACE_DATASETS, ACE_DATASETS_NAMES)):

    with suppress(KeyError):  # not all datasets have the quality flag
        if df_name != "SWICS":
            df = df[df["Quality"] == "0.0"]
            df.drop(columns=["Quality"], inplace=True, axis=1)
        else:
            qf_cols = swics_df.filter(regex="^qf_").columns
            df = swics_df[
                (swics_df[qf_cols].isna() | swics_df[qf_cols].eq(0)).any(axis=1)
            ]

    ACE_DATASETS[c] = df

mag_df, swepam_df, epam_df, swics_df = ACE_DATASETS

### Join data

In [40]:
# find unique timestamps
mag_dates, swepam_dates, epam_dates, swics_dates = [
    df.datetime.unique() for df in ACE_DATASETS
]

# find the common dates for 1hr interval data
common_dates_1hr = reduce(
    np.intersect1d, (mag_dates, swepam_dates, epam_dates)
)

# find the common dates for 2hr interval data
common_date_2hr = reduce(
    np.intersect1d, (mag_dates, swepam_dates, epam_dates, swics_dates)
    )

print(len(common_dates_1hr))
print(len(common_date_2hr))

187272
139207


In [63]:
# join the 1hr to 2 hr interval datasets
insitu_df = merge_dataframes(ACE_DATASETS, "datetime")
df = sort_columns_except_key(insitu_df, "datetime")


### Handling Missing Values

Missing data has the value of -999.900. Assert that there are no longer missing values due to dropping data labeled as not of good quality.

In [None]:
df_missing = missing_occurrences(df, MISSING_FLAG).sort_values(
    ascending=False, by="Missing_Count"
)
visualize_flag(df_missing)
df.shape

#### Imputation Methods 

Univariate time series imputation methods: 
- Mean (median)
- Last Observed Carried Forward
- Linear Interpolation
- Polynomial Interpolation
- Kalman Filter
- Moving Averge
- Random

Multivariate Time Series Imputation methods. 
- K-Nearest Neighbords
- Random Forest
- Multiple Singular Spectral Analysis
- Expectation-Maximization
- Multiple Imputation with Chained Equations

In [None]:
# Prepare for imputation
numeric_df = df.select_dtypes(include=[np.number])
df = df.replace(-9999.9, np.nan)
X_incomplete = df.select_dtypes(include=[np.number])

# KNN Imputer with 3 neighbors
knn_imputer = KNNImputer(n_neighbors=3, add_indicator=True)
knn_imputed_data = knn_imputer.fit_transform(X_incomplete)
df_knn_imputed = pd.DataFrame(knn_imputed_data, columns=numeric_df.columns)
df_knn_imputed.insert(0, "datetime", df["datetime"])

# Create the Random Forest imputer
rf_imputer = IterativeImputer(
    estimator=RandomForestRegressor(), random_state=0, max_iter=10
)
rf_imputed_data = rf_imputer.fit_transform(X_incomplete)
df_knn_imputed = pd.DataFrame(rf_imputed_data, columns=numeric_df.columns)
df_knn_imputed.insert(0, "datetime", df["datetime"])

In [None]:
# Create the Random Forest imputer
rf_imputer = IterativeImputer(
    estimator=RandomForestRegressor(), random_state=0, max_iter=10
)

# Perform the imputation
imputed_data = rf_imputer.fit_transform(X_incomplete)

## Data Transformation

### Log transformation of all quantities

In [None]:
# TODO: select the subset that needs log transformation
df.apply(np.log10)

## Dimensionality Reduction

### Dimensionality Reduction Using PCA

In [None]:
# Fit PCA
pca = PCA().fit(X)

# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_ratio_, color="black")
plt.xlabel("PCA features")
plt.ylabel("variance %")
plt.xticks(features)

# Save components to a DataFrame
PCA_components = pd.DataFrame(pca.transform(X))

plt.show()

In [None]:
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
num_components = np.where(cumulative_variance > 0.95)[0][0] + 1
print("Number of components to explain 95% Variance: ", num_components)

In [None]:
# Create a PCA that will retain 2 components
pca = PCA(n_components=num_components, whiten=True)

# Conduct PCA
X_pca = pca.fit_transform(X)

# Show the new data
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)

# The transformed data has been reduced to two dimensions
df = pd.DataFrame(
    data=X_pca,
)
print(df.head())

### Dimensionality Reduction Using Kernel PCA

In [None]:
# Fit Kernel PCA with n_components=None to compute all components
kpca = KernelPCA(n_components=None, kernel="rbf")
kpca.fit(X)

# Get eigenvalues
eigenvalues = kpca.lambdas_

# Plot eigenvalues
plt.plot(eigenvalues, "bo-")
plt.xlabel("Index")
plt.ylabel("Eigenvalue")
plt.show()

In this plot, the x-axis represents the index of each component (in descending order of eigenvalue), and the y-axis represents the corresponding eigenvalue. You typically choose the number of components at the point where adding another component doesn't significantly increase the eigenvalue (the "elbow" of the plot).

### Dimensionality Reduction Using Autoencoders

In [None]:
# Define the size of the encoded representation
encoding_dim = 2  # 2-dimensional encoded representation

# Define the input layer
input_img = Input(shape=(X.shape[1],))

# Define the encoded layer
encoded = Dense(encoding_dim, activation="relu")(input_img)

# Define the decoded layer
decoded = Dense(X.shape[1], activation="sigmoid")(encoded)

# Define the autoencoder model
autoencoder = Model(input_img, decoded)

# Define the encoder model
encoder = Model(input_img, encoded)

# Define the decoder model
encoded_input = Input(shape=(encoding_dim,))
decoder_layer = autoencoder.layers[-1]
decoder = Model(encoded_input, decoder_layer(encoded_input))

# Compile the autoencoder
autoencoder.compile(optimizer="adadelta", loss="binary_crossentropy")

# Train the autoencoder
autoencoder.fit(X, X, epochs=50, batch_size=256, shuffle=True)

# Use the encoder to reduce the dimensionality of the data
X_encoded = encoder.predict(X)

print("original shape:   ", X.shape)
print("transformed shape:", X_encoded.shape)

## Self-Organizing Maps