# Visualize Data

## Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex datasets while preserving as much variance (information) as possible.  
It works by finding new axes, called **principal components (PCs)**, which represent directions of maximum variance in the data. These components are linear combinations of the original features.

### How PCA Works
1. The data is standardized (mean-centered and scaled).
2. PCA computes the covariance matrix and extracts its eigenvectors and eigenvalues.
3. The eigenvectors define the new axes (principal components), and the eigenvalues indicate how much variance each axis explains.
4. The data is then projected onto these new components to obtain lower-dimensional representations.

### Interpretation of PC1 and PC2
- **PC1 (Principal Component 1)**  
  The first principal component captures the **largest amount of variance** in the dataset. It represents the most dominant underlying pattern across all features.  
  For example, if your mel features dominate PC1, it means that variations in the mel spectrum are the primary source of difference between samples.

- **PC2 (Principal Component 2)**  
  The second component captures the **next largest amount of variance** that is **orthogonal** (independent) to PC1.  
  It often reveals secondary patterns or contrasts that PC1 cannot describe, such as variations related to rhythm or timbre when analyzing audio.

In [None]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load your dataset
df = pd.read_csv("features_renamed.csv")

label_col = "label"  # adjust if needed

# --- select numeric feature columns ---
numeric_cols = df.select_dtypes(include=["number"]).columns.tolist()

if label_col in numeric_cols:
    numeric_cols.remove(label_col)

X = df[numeric_cols].values

# --- scale numeric features ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- perform PCA ---
n_components = 10
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_scaled)

# --- explained variance summary ---
explained = pca.explained_variance_ratio_
cum_explained = np.cumsum(explained)

print("\n=== PCA EXPLAINED VARIANCE ===")
for i, var in enumerate(explained, start=1):
    print(f"PC{i:>2}: {var*100:6.2f}% (cumulative {cum_explained[i-1]*100:6.2f}%)")

# --- feature loadings (influence per original variable) ---
loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f"PC{i+1}" for i in range(n_components)],
    index=numeric_cols
)

# show top contributing features for PC1 and PC2
print("\n=== TOP 10 FEATURES CONTRIBUTING TO PC1 ===")
print(loadings["PC1"].abs().sort_values(ascending=False).head(10))

print("\n=== TOP 10 FEATURES CONTRIBUTING TO PC2 ===")
print(loadings["PC2"].abs().sort_values(ascending=False).head(10))

# --- optional: aggregate variance by feature group prefix ---
group_prefixes = ["mfcc_", "chroma_", "mel_", "contrast_", "tonnetz_"]
group_contrib = {}

for prefix in group_prefixes:
    cols = [c for c in numeric_cols if c.startswith(prefix)]
    if cols:
        total = loadings.loc[cols, "PC1"].abs().sum() + loadings.loc[cols, "PC2"].abs().sum()
        group_contrib[prefix.rstrip("_")] = total

if group_contrib:
    print("\n=== GROUP CONTRIBUTIONS (|loadings| summed over PC1+PC2) ===")
    for g, val in sorted(group_contrib.items(), key=lambda x: x[1], reverse=True):
        print(f"{g:10s}: {val:.4f}")

# --- per-label mean PCA coordinates (for cluster separation insight) ---
if label_col in df.columns:
    pca_df = pd.DataFrame(X_pca[:, :2], columns=["PC1", "PC2"])
    pca_df[label_col] = df[label_col]
    print("\n=== PCA MEANS PER LABEL (first 2 PCs) ===")
    print(pca_df.groupby(label_col)[["PC1", "PC2"]].mean())
    



# 1. Explained variance

| Component | Variance (%) | Cumulative (%) |
| --------- | ------------ | -------------- |
| PC1       | 11.41        | 11.41          |
| PC2       | 5.08         | 16.49          |
| PC3       | 4.41         | 20.91          |
| PC4–PC10  | each ≈2–3    | 35.64 total    |

## Interpretation:

- The first principal component explains **~11 % of total variance**, and the first 10 components together explain **only ~36 %**.

- That means your dataset’s variance is spread across many orthogonal directions — **it’s highly complex and non‐redundant**.

- This is typical for **audio embeddings** (e.g. Mel and MFCCs) because each coefficient captures distinct spectral information.

- So: there is **no dominant single low‐dimensional structure**; class separations are likely subtle and nonlinear.


# 2. Feature loadings (which features drive the variance)

**PC1 dominated by Mel features**
`mel_105–116` are the strongest contributors.
These are middle–high Mel filter bands.
Interpretation:
→ The first principal axis mostly measures overall energy/variance in the **Mel spectrogram region**, not rhythm or timbre directly.

**PC2 mixes MFCC and Mel + some Contrast**
Top features include:

- `mfcc_3 (a spectral shape coefficient, roughly spectral slope/brightness)

- several `mel_6x–7x` filters (lower–mid band energy)

- and `contrast_4`, `contrast_5` (spectral contrast between peaks/valleys)

Interpretation:
→ The second component represents a blend of **spectral shape + low–mid Mel energy**, possibly correlating with timbral brightness or instrumentation differences between dance styles.

# 3. Group contribution summary

| Group    | Combined absolute loadings (PC1+PC2) | Relative importance     |
| -------- | ------------------------------------ | ----------------------- |
| **Mel**  | 24.42                                | overwhelmingly dominant |
| MFCC     | 1.81                                 | minor                   |
| Chroma   | 1.38                                 | small                   |
| Contrast | 1.17                                 | small                   |
| Tonnetz  | 0.55                                 | smallest                |


## Interpretation:

- The **Mel features dominate** your dataset’s global variance; most variation comes from raw spectral energy distribution, not high‐level harmonic descriptors.

- MFCC, Chroma, Contrast, and Tonnetz are much smaller contributors to the first two variance directions.

- This doesn’t mean they’re unimportant for classification — only that they vary less globally across the dataset. In fact, those smaller groups might encode class‐specific cues that PCA (being unsupervised) doesn’t emphasize.

# 4. Label separation in PCA space

Mean coordinates:

| Label       | PC1   | PC2   |
| ----------- | ----- | ----- |
| discofox    | 2.15  | 1.70  |
| jive        | 2.01  | 0.11  |
| samba       | 0.48  | 2.38  |
| chacha      | 0.82  | 0.84  |
| salsa       | 0.72  | -1.31 |
| rumba       | -0.87 | -0.71 |
| foxtrott    | 0.10  | 0.17  |
| quickstep   | -0.21 | -0.38 |
| viennawaltz | -2.19 | -1.46 |
| slowwaltz   | -6.21 | -2.99 |


## Interpretation:

- Clear gradient along PC1:

    - **Positive side (high PC1)** → upbeat, rhythmically strong dances (discofox, jive, samba, chacha).

    - **Negative side (low PC1)** → slower, smoother waltzes.
    → PC1 seems to correspond to tempo or rhythmic energy.

- PC2 separates some Latin styles (samba high, salsa low) and shows secondary contrast likely tied to **timbre or instrumentation density**.

- So PCA actually captures musical style structure reasonably well.

# 5. Summary of what this tells you

| Aspect                 | Observation                               | Implication                                                                 |
| ---------------------- | ----------------------------------------- | --------------------------------------------------------------------------- |
| Variance distribution  | Highly spread; no few components dominate | Deep models will need many latent factors                                   |
| Dominant feature group | Mel features                              | Ensure Mel scaling/normalization is appropriate; consider balancing         |
| Class structure        | Visible grouping by tempo/energy          | PCA captures meaningful musical axis                                        |
| PCA as preprocessing   | Not ideal for deep learning input         | Use raw standardized features; let the network learn nonlinear combinations |


## Import Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

## Group Data by Label

In [None]:
df = pd.read_csv('features.csv')

feature_cols = [c for c in df.columns if c not in ["filename", "label"]]

feature_groups = {
        "mfcc": [c for c in feature_cols if c.startswith("mfcc_")],
        "chroma": [c for c in feature_cols if c.startswith("chroma_")],
        "mel": [c for c in feature_cols if c.startswith("mel_")],
        "contrast": [c for c in feature_cols if c.startswith("contrast_")],
        "tonnetz": [c for c in feature_cols if c.startswith("tonnetz_")],
        "tempogram": [c for c in feature_cols if c.startswith("tempogram_")],
        "rms": [c for c in feature_cols if c.startswith("rms_")],
        "spectral_flux": [c for c in feature_cols if c.startswith("spectral_flux_")],
        "onset": [c for c in feature_cols if c.startswith("onset_strength_")],
        "tempo": [c for c in feature_cols if c == "tempo_bpm"],
}

target_col = "label" if "label" in df.columns else None

sns.set(style="whitegrid", context="notebook")

if target_col is None:
    raise ValueError("No 'label' column found in dataset.")
feature_groups

# Visualization Grid

In [None]:
for name, cols in feature_groups.items():
    if len(cols) == 0:
        print(f"Skipping {name}: No columns found.")
        continue

    X_group = df[cols].to_numpy()
    n_features = X_group.shape[1]

    # Case 1: PCA possible (two or more features)
    if n_features >= 2:
        pca = PCA(n_components=2)
        X_pca = pca.fit_transform(X_group)
        df_plot = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
        df_plot['label'] = df[target_col]

        fig = plt.figure(figsize=(18, 10))
        fig.suptitle(f"PCA Visualization for {name} Features", fontsize=18, y=1.02)

        # Boxplots
        ax1 = plt.subplot2grid((2, 3), (0, 0))
        sns.boxplot(x='label', y='PC1', data=df_plot, ax=ax1)
        ax1.set_title('Boxplot of PC1')
        ax1.tick_params(axis='x', rotation=45)

        ax2 = plt.subplot2grid((2, 3), (0, 1))
        sns.boxplot(x='label', y='PC2', data=df_plot, ax=ax2)
        ax2.set_title('Boxplot of PC2')
        ax2.tick_params(axis='x', rotation=45)

        # Histograms
        ax3 = plt.subplot2grid((2, 3), (0, 2))
        sns.histplot(data=df_plot, x="PC1", hue="label", bins=30, kde=True, element="step", ax=ax3)
        ax3.set_title('Histogram of PC1')

        ax4 = plt.subplot2grid((2, 3), (1, 0))
        sns.histplot(data=df_plot, x="PC2", hue="label", bins=30, kde=True, element="step", ax=ax4)
        ax4.set_title('Histogram of PC2')

        # Scatter
        ax5 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
        sns.scatterplot(x='PC1', y='PC2', data=df_plot,
                        hue='label', palette="tab10", ax=ax5)
        ax5.set_title(f'Scatter Plot (PC1 vs PC2): {name} Features')
        ax5.legend(title="Label", bbox_to_anchor=(1.05, 1), loc='upper left')

        plt.tight_layout()
        plt.show()
        continue

    # Case 2: Single feature (e.g., BPM)
    if n_features == 1:
        feature = cols[0]
        df_plot = pd.DataFrame({
            "value": df[feature],
            "label": df[target_col]
        })

        fig = plt.figure(figsize=(18, 10))
        fig.suptitle(f"Single Feature Visualization for {name} ({feature})", fontsize=18, y=1.02)

        # Boxplot
        ax1 = plt.subplot2grid((2, 3), (0, 0))
        sns.boxplot(x="label", y="value", data=df_plot, ax=ax1)
        ax1.set_title(f"Boxplot of {feature}")
        ax1.tick_params(axis="x", rotation=45)

        # Histogram
        ax2 = plt.subplot2grid((2, 3), (0, 1))
        sns.histplot(data=df_plot, x="value", hue="label", bins=30,
                     kde=True, element="step", ax=ax2)
        ax2.set_title(f"Histogram of {feature}")

        # Scatter not meaningful → replace with stripplot
        ax3 = plt.subplot2grid((2, 3), (1, 0), colspan=2)
        sns.stripplot(x="label", y="value", data=df_plot, ax=ax3, jitter=0.2)
        ax3.set_title(f"Distribution by Label: {feature}")
        ax3.tick_params(axis="x", rotation=45)

        plt.tight_layout()
        plt.show()

In [None]:
combined_pca = pd.DataFrame()
combined_pca[target_col] = df[target_col]

# Perform PCA for each feature group
for name, cols in feature_groups.items():
    if len(cols) == 0:
        print(f"Skipping {name}: No columns found.")
        continue

    X_group = df[cols].to_numpy()
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X_group)

    # Add to combined DataFrame
    combined_pca[f"{name}_PC1"] = X_pca[:, 0]
    combined_pca[f"{name}_PC2"] = X_pca[:, 1]

# Generate pairplot for all PC components
selected_cols = [col for col in combined_pca.columns if col != target_col]

g = sns.pairplot(
    combined_pca,
    vars=selected_cols,
    hue=target_col,
    diag_kind="kde",
    palette="tab10",
    corner=True,                  # shows only lower triangle for clarity
    plot_kws=dict(alpha=0.6, s=35, edgecolor="none")
)

g.fig.suptitle("Combined PCA Pairplot Across All Parent Feature Groups", fontsize=18, y=1.02)
plt.tight_layout()
plt.show()