# 01_eda_correlation_threshold.ipynb

This notebook explores the distribution of Pearson correlation coefficients across CMAPSS sensor time-windows, so we can choose an appropriate threshold τ for graph construction.


In [None]:
# Cell 1: Imports and setup

# Standard libraries
import numpy as np
import matplotlib.pyplot as plt

# Torch Geometric data loader for our custom function
from phase2_data_loader import load_cmapss

# Ensure plots display inline in the notebook
%matplotlib inline


: 

## 1. Load CMAPSS Sliding Windows

We’ll load a subset of windows from the FD003 dataset:
- `window_size=30`: each sample is 30 time-steps long  
- `stride=30`: non-overlapping batches for speed during EDA  
- We then keep only the first 50 windows to make plotting fast


In [None]:
# Cell 2: Load a small subset of windows for exploratory analysis

# Path to the CMAPSS data directory (adjust if needed)
DATA_DIR = 'data/CMAPSS'

# Load windows: returns a list of `torch_geometric.data.Data` objects
windows = load_cmapss(
    data_dir=DATA_DIR,
    dataset='FD003',
    window_size=30,
    stride=30
)

# Keep only the first 50 windows for speed
windows = windows[:50]

print(f"Loaded {len(windows)} windows, each with shape {windows[0].x.shape}")


## 2. Compute Pairwise Sensor Correlations

For each window:
1. Extract the sensor feature matrix `x` of shape `(window_size, num_sensors)`.  
2. Compute the Pearson correlation matrix across sensor channels.  
3. Collect the upper-triangle (off-diagonal) correlations for histogramming.


In [None]:
# Cell 3: Compute correlations

all_corrs = []  # will store all pairwise correlations across windows

for data in windows:
    # `data.x` is a torch.Tensor of shape [window_size, num_sensors]
    x = data.x.numpy()  # convert to NumPy for correlation
    corr_matrix = np.corrcoef(x.T)  # shape [num_sensors, num_sensors]

    # Extract only the off-diagonal (i < j) entries
    triu_i, triu_j = np.triu_indices_from(corr_matrix, k=1)
    pairwise_corrs = corr_matrix[triu_i, triu_j]

    all_corrs.extend(pairwise_corrs)

all_corrs = np.array(all_corrs)
print(f"Total correlations collected: {all_corrs.shape[0]}")


## 3. Plot Distribution of |Correlation|

We take the absolute value of correlations—since both strong positive and strong negative relationships are informative.  
We then plot a histogram to see where most sensor-sensor correlations lie.


In [None]:
# Cell 4: Plot histogram

# Compute absolute correlations
abs_corrs = np.abs(all_corrs)

plt.figure(figsize=(8, 4))
plt.hist(abs_corrs, bins=30, edgecolor='black')
plt.title('Distribution of |Pearson Correlations| Across CMAPSS Windows')
plt.xlabel('|Correlation|')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()


## 4. Choose Threshold τ

From the histogram, identify a τ that captures the top ~10% of |corr| values.  
Let’s compute the 90th percentile to guide our choice.


In [None]:
# Cell 5: Compute percentiles

for p in [75, 80, 85, 90, 95]:
    val = np.percentile(abs_corrs, p)
    print(f"{p}th percentile of |corr|: {val:.3f}")


**Interpretation:**  
- E.g., if the 90th percentile is around 0.80, setting τ = 0.8 will retain only the strongest 10% of sensor‐sensor edges.  

You can now use this τ in your `build_graph(..., threshold=τ)` function during graph construction.
