# In-Situ Data Dimensionality Reduction

**Analytical Questions**

- How can we apply novel dimension reduction methods, such as PCA, TSNE, etc., to obtain informative solar wind in-situ data representation in low-dimensional space? 
- How can this low-dimensional representation provide better 2D/3D visualization support than traditional dimension reduction techniques?


For this project, we will explore 3 dimension reduction methods on the ACE Spacecraft In-situ Data:

- **Principal Component Analysis (PCA)**: PCA is a technique that transforms the data into a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
  
- **t-Distributed Stochastic Neighbor Embedding (t-SNE)**: t-SNE is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. It uses a probabilistic approach to model the similarity between points in high-dimensional and low-dimensional space.
   
- **Functional PCA (FPCA)**: This is the standard PCA process, but it emphasizes patterns of time series rather than absolute variance. This method can help visualize the variation of solar wind properties over time. 
  
Each dimension reduction method was chosen for its ability to handle highly dimensional datasets. With the exception of PCA, each dimension reduction method can also handle non-linear data. Despite PCA's limited ability to handle non-linear data, we will apply Robust Scaler which applies a linear transformation of the data. 

**Min-max normalization vs. StandardScaler vs Robust Scaler**

Min-max normalization is very sensitive to outliers and can compress our data. StandardScaler aggressively transforms the data by forcing normal distribution. From EDA we are aware that the data has a non-normal distribution and several features contain outliers.
A robust scaler is resilient to outliers and maintains the original data's relationships similar to Min-Max normalization.

The performance of each method will be assessed on _data visualization quality_. The data visualization quality reflects how well the reduced data captures the original data's patterns, clusters, and outliers. (e.g. scatter plots, heat maps, or silhouette plots).


In [10]:
# Data Processing Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import RobustScaler

# Dimension Reduction Libraries
from sklearn.decomposition import PCA
from sklearn import manifold
from fdasrsf import fPCA, time_warping, fdawarp, fdahpca

# Visualization Libraries
from dash import Dash, dcc, html, Input, Output
import plotly.express as px
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

import seaborn as sns

## Data Preparation

In [2]:
df = pd.read_csv(
    "/data/workspace_files/ACE_Data/2024-04-19_ace_master_clean_1hr.csv", sep=","
)

In [3]:
# Dropping Features we do not need in training data. Refer to notebook 02_EDA for more information.
X = df.drop(
    columns=[
        "Year",
        "Day",
        "Month",
        "Seconds_OTD",
        "gmt",
        "timestamp",
        "Anis_Index",
        ">30_MeV",
        "Long",
        "Lat",
        "X",
        "Y",
        "Z",
    ]
)
Y = df["timestamp"]

In [4]:
#Feature Normalization
scaler=RobustScaler()
normalized_train = scaler.fit(X)

print("number of features: {}".format(normalized_train.n_features_in_))
print("number of observations:{}".format(X.shape[0]))

In [5]:
normalized_Xtrain = scaler.fit_transform(X)

## Creating low dimensional data spaces

### 1. PCA

In [6]:
# PCA

pca = PCA()
X_pca = pca.fit(normalized_Xtrain)

In [7]:
# determine the number of components are needed to describe the data
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel("number of components")
plt.ylabel("cumulative explained variance")
plt.title("Principal Component Analysis (PCA)")

The curve above quantifies how much of the variance is contained within the first N-components. The curve begins the level-out around 99% cumulative explained variance (5 components). This means we can reduce our data dimension to 5 from 15 without much loss of the data. 

In [8]:
# Finding the top 3, most important features of each component
num_components = 5
pca = PCA(num_components)
X_pca = pca.fit(normalized_Xtrain)

# Putting results in dataframe
pca_results = pd.DataFrame(pca.components_, columns=X.columns)

pca_results["Component"] = pca_results.index
pca_results = pca_results.set_index("Component")
top3feat_df = pd.DataFrame(
    pca_results.columns[np.argsort(-1 * pca_results.values, axis=1)[:, :3]],
    columns=["1st", "2nd", " 3rd"],
)

top3feat_df

In [14]:
pca_results

### 2. TSNE

In [9]:
# T-Sne
tsne = manifold.TSNE(n_components=3, init="pca", random_state=42)
X_tsne = tsne.fit_transform(normalized_Xtrain)

### 3. FPCA

Using pipeline from [Medium](https://towardsdatascience.com/beyond-classic-pca-functional-principal-components-analysis-fpca-applied-to-time-series-with-python-914c058f47a0) article by Pierre-Louis Bescond 

In [11]:
# Data processing specifc to FPCA
file_path = "/data/workspace_files/ACE_Data/2024-04-19_ace_master_clean_1hr.csv"
df = pd.read_csv(file_path, sep=",")

# Dropping Features we do not need in training data. Refer to notebook 02_EDA for more information.
df = df.drop(
    columns=[
        "Year",
        "Day",
        "Month",
        "Seconds_OTD",
        "gmt",
        "Anis_Index",
        ">30_MeV",
        "Long",
        "Lat",
        "X",
        "Y",
        "Z",
    ]
)
df = df.set_index("timestamp")

In [12]:
# Feature Normalization
scaler = RobustScaler()
normalized_df = scaler.fit_transform(df)

ace_df = pd.DataFrame(normalized_df, index=df.index, columns=df.columns)

# Convert the Pandas dataframe to a Numpy array with time-series only
f = ace_df[140000:].to_numpy().astype(float)

# Create a float vector between 0 and 1 for time index
time = np.linspace(0, 1, len(f))

In [16]:
# Align time-series
warp_f = time_warping.fdawarp(f, time)
warp_f.srsf_align()

warp_f.plot()

In [17]:
# Functional Principal Components Analysis

# Define the FPCA as a vertical analysis
fPCA_analysis = fPCA.fdavpca(warp_f)

# Run the FPCA on a 5 components basis
fPCA_analysis.calc_fpca(no=5)
fPCA_analysis.plot()

## Visualizations

In [None]:
# T-sne Visualization
fig = px.scatter_3d(
    X_tsne,
    x=0,
    y=1,
    z=2,
)
fig.update_traces(marker_size=1)
fig.show()

In [None]:
# PCA Component Analysis
pca = PCA()
components = pca.fit_transform(normalized_Xtrain)
labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    components,
    labels=labels,
    dimensions=range(5),
)
fig.update_traces(diagonal_visible=False)
fig.show()

In [19]:
# PCA Component analysis Interactive
app = Dash(__name__)

app.layout = html.Div(
    [
        html.H4("Visualization of PCA's explained variance"),
        dcc.Graph(id="graph"),
        html.P("Number of components:"),
        dcc.Slider(id="slider", min=2, max=15, value=3, step=1),
    ]
)


@app.callback(Output("graph", "figure"), Input("slider", "value"))
def run_and_plot(n_components):
    # replace with your own data source
    df = normalized_Xtrain
    pca = PCA(n_components=n_components)
    components = pca.fit_transform(df)

    var = pca.explained_variance_ratio_.sum() * 100

    labels = {str(i): f"PC {i+1}" for i in range(n_components)}

    fig = px.scatter_matrix(
        components,
        dimensions=range(n_components),
        labels=labels,
        title=f"Total Explained Variance: {var:.2f}%",
    )
    fig.update_traces(diagonal_visible=False)
    return fig


app.run_server(debug=False, host="0.0.0.0", port=8890)

In [20]:
# FPCA Visualization
import plotly.graph_objects as go

# Plot of the 5 functions
fig = go.Figure()

# Add traces
fig.add_trace(go.Scatter(y=fPCA_analysis.f_pca[:, 0, 0], mode="lines", name="PC1"))
fig.add_trace(go.Scatter(y=fPCA_analysis.f_pca[:, 0, 1], mode="lines", name="PC2"))
fig.add_trace(go.Scatter(y=fPCA_analysis.f_pca[:, 0, 2], mode="lines", name="PC3"))
fig.add_trace(go.Scatter(y=fPCA_analysis.f_pca[:, 0, 3], mode="lines", name="PC4"))
fig.add_trace(go.Scatter(y=fPCA_analysis.f_pca[:, 0, 4], mode="lines", name="PC5"))


fig.update_layout(
    title_text="<b>Principal Components Analysis Functions of Jan. 2019 - Apr. 2024 Solar Wind Insitu Measurements Data</b>",
    title_x=0.5,
)

fig.show()