# Exploratory Data Analysis of ACE Satellite Mission Data

## Imports and Configuration

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from datetime import date
from sklearn.impute import KNNImputer
from lets_plot import *

ModuleNotFoundError: No module named 'lets_plot'

In [None]:
# Set Seaborn style and color palette
sns.set_style("darkgrid")
sns.set_palette("Pastel1")

## Read Data

## ACE Exploration

In [None]:
print(f"Shape: {master_clean.shape}")
master_clean.info()

In [None]:
# investigate missing values
print(master_clean.isna().sum())

In [None]:

(
    ggplot()
    + geom_bar(
        aes(x="timestamp", y="Anis_Index"),
        data=master_clean,
        sampling="none",
        color="#1100FF",
        fill="#1100FF",
        stat="identity",
    )
    + ggtitle("Anisotropy Index (August 2000 - Present)")
    + scale_y_continuous()
)

Anisotropy index is no longer calculated after October 2011 (indicated as
missing value -1.00 in SWPC README file about the data). This feature will be
dropped due to missing data for over half of the observations. The magnitude of
missing data values were not highlighted in first data cleaning process since
the Anisotropy Index does not have a Status Index to indicate bad data like the
other features in the data set.


In [None]:
# Dropping Anis_Index Feature
master_clean = master_clean.drop(columns=["Anis_Index"])

In [None]:
# proportion of dataset that contains unique values
(master_clean.nunique() / len(master_clean) * 100).sort_values(ascending=False)

In [None]:
master_clean.describe()

### Univariate Analysis


In [None]:
# Identify numerical columns
ACE_NUMERICAL_COLUMNS = master_clean.select_dtypes(include=["int64", "float64"]).columns

In [None]:
# Plot distribution of each numerical feature
fig = plt.figure(figsize=(14, len(ACE_NUMERICAL_COLUMNS) * 3))
for idx, feature in enumerate(np.sort(ACE_NUMERICAL_COLUMNS), 1):
    plt.subplot(len(ACE_NUMERICAL_COLUMNS), 2, idx)
    sns.histplot(master_clean[feature], kde=True, bins=50, color="skyblue")

    # Add lines for mean and median
    plt.axvline(master_clean[feature].mean(),
                color="r", linestyle="--", label="Mean")
    plt.axvline(
        master_clean[feature].median(), color="g", linestyle="-", label="Median"
    )

    plt.title(
        f"{feature} | Skewness: {round(master_clean[feature].skew(), 2)}")
    plt.legend()  # Add a legend

# Adjust layout and show plots
plt.tight_layout()
fig.suptitle(
    "Distribution of ACE numerical features", fontsize=20, weight="bold", y=1.02
)
plt.show()

### Bivariate Analysis


In [None]:
plt.figure(figsize=(10, 6))

# Using Seaborn to create a pair plot with the specified color palette
sns.pairplot(master_clean[ACE_NUMERICAL_COLUMNS], corner=True, diag_kind="kde")

plt.title("Pair Plot for DataFrame")
plt.show()

In [None]:
# Assuming 'df' is your DataFrame
plt.figure(figsize=(15, 10))

# Using Seaborn to create a heatmap
sns.heatmap(
    master_clean[ACE_NUMERICAL_COLUMNS].corr(),
    annot=True,
    fmt=".2f",
    cmap="Pastel2",
    linewidths=2,
)

plt.title("ACE Data Correlation Heatmap")
plt.show()

Based on the correlation matrix, the following patterns were noted:

- Integral Proton Flux (`>10MeV` vs > `30 MeV`)
  - May only need one proton flux variable to include in algorithm training due to high positive correlation (0.95) if the proton flux is high at `>10MeV` it is highly likely it will also be high for the integral calculated at `>30MeV`.
- GSM Coordinates (`Bx`, `By`, `Bz`, `Bt`, `Long`, `Lat`)
  - In an attempt to reduce including confounding variables in algorithm, we may not need Longitude and Latitude coordinates as it relates to the position of activity located from Earth, while the "B" variables explicitly describe the amplitude and direction of magnetic activity occurring in the Sun's Magnetic Field in addition to locational coordinates.
- GSE Coordinates (`X`,`Y`,`Z`)
  - These coordinates are the predicted satellite locations based on location coordinates on Earth. Does not appear to have a strong relationship with key features of interest. So we can be sure the data captured is not heavily influenced by the satellite location. It is safe to drop these features during algorithm training as it only describes satellite location, and does not directly measure Solar Wind properties.

## HCS Indexes Exploration

Data provided by University of Michigan Climate & Space Sciences and Engineering, Liang Zhao, PhD

The heliospheric current sheet (HCS) is a surface separating regions of the
heliosphere where the interplanetary magnetic field points toward and away from
the sun. An electrical current flows within this surface, forming a current
sheet confined to this surface. The shape of the current sheet results from the
influence of the Sun's rotating magnetic field on the plasma in the
interplanetary medium.

It can be very challenging to evaluate activity occuring within the HCS. Dr.
Liang Zhao, a research professor at the University of Michigan, introduces two
novel parameters that evaluate the global complexity of the Sun's magnetic field
and tracking of the solar cycle:

- SD Index: The standard deviation of the latitude of the HCS
- SL Index: Integrated slope of the HCS

HCS SL and SD index was provided by Dr. Liang Zhao. Monthly average sunspot
number dating back to 1749 can be found on the
[Solar Influences Data Analysis Center](https://www.sidc.be/SILSO/infosnmtot)
website.

#### Why does this matter?

Solar activity like sunspots can be used to help predict space weather, the
state of the ionosphere, and conditions relevant radio and satellite
communications.

The sunspot cycle is a near 11-year change in the Sun's activity measured in
terms of variations in the number of observed sunspots on the Sun's surface.
Sunspots are temporary, dark spots on the sun's surface caused by concentrations
of magnetic flux that inhibit convection. Sunspots typically appear in the
active latitude regions close to the Sun's equator.

For more information on SD and SL index calculations, please read the PowerPoint
README file in the GitHub Repo.


In [None]:
# importing HCS data as dataframe

index_data = pd.read_csv(
    "/data/workspace_files/HCS_Data/HCS_parameters_update_CR2257.txt",
    engine="python",
    header=0,
    sep=",|\s+",
)
sunspot_data = pd.read_csv(
    "/data/workspace_files/HCS_Data/SN_m_tot_V2.0.csv",
    skiprows=2714,
    sep=";",
    header=0,
    names=[
        "year",
        "month",
        "fyear_CS",
        "avg_spNum",
        "sd_spNum",
        "num_obvs",
        "definitve_marker",
    ],
)

For this project, we are directed to only use fractional year (`fyear_CS`), SD
Index (`SD_70`), SL Index (`SL_70`), and monthly average sunspot number (`avg_spNum`).
We will quickly look to see if there are patterns with the data outside
of our scope.


In [None]:
index_data.describe()

In [None]:
sunspot_data.describe()

In [None]:
from lets_plot.bistro import corr
from lets_plot import *

ggplot()

(
    corr.corr_plot(index_data).tiles().build()
    + ggsize(500, 370)
    + ggtitle("HCS Index Data Correlation Matrix")
)

In [None]:
from lets_plot.bistro import corr
from lets_plot import *

ggplot()

(
    corr.corr_plot(sunspot_data).tiles().build()
    + ggsize(500, 370)
    + ggtitle("Total Monthly Sunspot Number Correlation Matrix")
)

In [None]:
from lets_plot import *

(
    ggplot()
    + geom_point(
        aes(x="fyear_CS", y="SD_70"), data=index_data, sampling="none", color="#1100FF"
    )
    + geom_point(
        aes(x="fyear_CS", y="SL_70"), data=index_data, sampling="none", color="#ff8800"
    )
    + ggtitle("SD and SL Index vs. Fractional Year")
    + scale_y_log10()
)

In [None]:
from lets_plot import *

(
    ggplot()
    + geom_point(
        aes(x="fyear_CS", y="SD_70"), data=index_data, sampling="none", color="#1100FF"
    )
    + geom_point(
        aes(x="fyear_CS", y="SL_70"), data=index_data, sampling="none", color="#ff8800"
    )
    + ggtitle("SD and SL Index vs. Fractional Year")
    + scale_y_log10()
)

In [None]:
from lets_plot import *

(
    ggplot()
    + geom_point(
        aes(x="fyear_CS", y="avg_spNum"),
        data=sunspot_data,
        sampling="none" if sunspot_data.size < 2500 else sampling_systematic(
            n=2500),
    )
    + ggtitle("Monthly Total Average Sunspot Number vs. Fractional Year")
)