<a href="https://colab.research.google.com/github/GemmaGorey/Dissertation/blob/main/Similarity_Analysis_Step_By_Step.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Similarity Analysis: Audio vs Lyrics Features

This notebook performs similarity analysis between audio and lyrics features using your trained MODEL 4.

## Methodology
- Uses the same environment setup as MODEL 4
- Uses the same variable names as MODEL 4
- References preprocessed data from your dissertation folder

## Code Markings
- ‚úÖ **EXISTING CODE** = Same as MODEL 4 (copy-paste directly)
- ‚≠ê **NEW CODE** = Similarity analysis code (you need to add this)

---

# STEP 1: Environment Setup

## ‚úÖ EXISTING CODE (Same as MODEL 4)

This code is identical to MODEL 4 - just copy-paste it.

In [None]:
# ‚úÖ EXISTING CODE - Install condacolab
!pip install -q condacolab
import condacolab
condacolab.install()

In [None]:
# ‚úÖ EXISTING CODE - Create environment.yml and build environment
yaml_content = """
name: dissertation
channels:
  - pytorch
  - conda-forge
dependencies:
  - python=3.11
  - pytorch=2.2.2
  - torchvision=0.17.2
  - torchaudio
  - librosa
  - numpy<2
  - pandas
  - jupyter
  - wandb
"""

with open('environment.yml', 'w') as f:
    f.write(yaml_content)

print("environment.yml file created successfully.")
print("\nCreating environment")

!mamba env create -f environment.yml --quiet && echo -e "\n'dissertation' environment is ready to use."

---
# STEP 2: Import Libraries and Setup

## ‚úÖ EXISTING CODE (Same as MODEL 4)

In [None]:
# ‚úÖ EXISTING CODE - Clone GitHub repo and mount Google Drive
print("‚è≥ Cloning GitHub repository...")
!git clone https://github.com/GemmaGorey/Dissertation.git
print("Repository cloned.")

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# ‚úÖ EXISTING CODE - Import standard libraries (from MODEL 4)
import pandas as pd
import librosa
import os
import numpy as np
import matplotlib.pyplot as plt
import librosa.display
from transformers import AutoTokenizer
from tqdm.auto import tqdm
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from transformers import AutoModel
import torch.optim as optim
import subprocess

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
print("Tokenizer loaded.")

## ‚≠ê NEW CODE - Additional imports for similarity analysis

In [None]:
# ‚≠ê NEW CODE - Additional imports for similarity analysis
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cross_decomposition import CCA
from scipy.stats import pearsonr
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

print("‚úì Similarity analysis libraries loaded.")

---
# STEP 3: Define Model Classes

## ‚úÖ EXISTING CODE (Same as MODEL 4)

These are the exact same classes from MODEL 4.

In [None]:
# ‚úÖ EXISTING CODE - MER_Dataset class (from MODEL 4)
class MER_Dataset(Dataset):
    """ Custom PyTorch Dataset for loading MER data. """
    def __init__(self, annotations_df, tokenizer):
        """ Creation of the Dataset from the dataframe (predefined splits in MERGE dataset) """
        self.annotations = annotations_df
        self.tokenizer = tokenizer

    def __len__(self):
        """
        Function to return the total number of songs in the dataset.
        """
        return len(self.annotations)

    def __getitem__(self, index):
        """
        Function to get a song from the dataset.
        """
        song_info = self.annotations.iloc[index]

        spectrogram_path = song_info['spectrogram_path']
        lyrics_path = song_info['lyrics_path']
        valence = song_info['valence']
        arousal = song_info['arousal']

        # Change spectrogram into a tensor
        spectrogram = np.load(spectrogram_path)
        spectrogram_tensor = torch.from_numpy(spectrogram).float()
        spectrogram_tensor = spectrogram_tensor.unsqueeze(0)  # Adding a "channel" dimension for CNN

        # Load the lyric tokens
        encoded_lyrics = torch.load(lyrics_path, weights_only=False)
        input_ids = encoded_lyrics['input_ids'].squeeze(0)
        attention_mask = encoded_lyrics['attention_mask'].squeeze(0)

        labels = torch.tensor([valence, arousal], dtype=torch.float32)

        return spectrogram_tensor, input_ids, attention_mask, labels

print("‚úì MER_Dataset class defined.")

In [None]:
# ‚úÖ EXISTING CODE - AttentionModule class (from MODEL 4)
class AttentionModule(nn.Module):
    def __init__(self, feature_dim):
        super(AttentionModule, self).__init__()
        '''
        Attention mechanism to weight the importance of different features
        '''
        self.attention = nn.Sequential(
            nn.Linear(feature_dim, feature_dim // 4),  # input is 64 will map to 16
            nn.ReLU(),
            nn.Linear(feature_dim // 4, feature_dim),  # reverts back to 64 from 16
            nn.Sigmoid()
        )

    def forward(self, x):
        # x shape: [batch_size, 64]
        attention_weights = self.attention(x)  # [batch_size, 64]
        weighted_features = x * attention_weights  # Element-wise multiplication
        return weighted_features

print("‚úì AttentionModule class defined.")

In [None]:
# ‚úÖ EXISTING CODE - VGGish_Audio_Model class (from MODEL 4)
class VGGish_Audio_Model(nn.Module):
    '''
    A VGG-style model for the audio tower.
    V1.2 implements true VGG-style blocks with multiple convolutions per block.
    '''
    def __init__(self):
        super(VGGish_Audio_Model, self).__init__()
        
        self.features = nn.Sequential(
            # Block 1 - 2 convolutions
            nn.Conv2d(1, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 2 - 2 convolutions
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 3 - 2 convolutions
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 4 - 2 convolutions
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((1, 1))
        )

        self.dropout1 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(512, 256)
        self.relu1 = nn.ReLU(inplace=True)
        self.dropout2 = nn.Dropout(0.5)
        self.attention = AttentionModule(256)
        self.fc2 = nn.Linear(256, 64)  # Final feature vector size should be 64

    def forward(self, x):
        x = self.features(x)
        # Flatten the features for the classifier
        x = x.view(x.size(0), -1)
        x = self.dropout1(x)
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.dropout2(x)
        x = self.attention(x)
        x = self.fc2(x)
        return x

print("‚úì VGGish_Audio_Model class defined.")

In [None]:
# ‚úÖ EXISTING CODE - BimodalClassifier class (from MODEL 4)
class BimodalClassifier(nn.Module):
    """
    The final bimodal model.
    """
    def __init__(self):
        super(BimodalClassifier, self).__init__()

        # Initiate audio tower
        self.audio_tower = VGGish_Audio_Model()

        # Use transformer for lyrics (using bert base uncased)
        self.lyrics_tower = AutoModel.from_pretrained('bert-base-uncased')
        for param in self.lyrics_tower.parameters():
            param.requires_grad = False

        # Define feature sizes
        AUDIO_FEATURES_OUT = 64
        LYRICS_FEATURES_OUT = 768
        COMBINED_FEATURES = AUDIO_FEATURES_OUT + LYRICS_FEATURES_OUT

        self.classifier_head = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(in_features=COMBINED_FEATURES, out_features=100),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(in_features=100, out_features=2)  # 2 Outputs for Valence and Arousal
        )

    def forward(self, x_audio, input_ids, attention_mask):
        # Process audio input
        audio_features = self.audio_tower(x_audio)

        # Get lyric features
        lyrics_outputs = self.lyrics_tower(input_ids=input_ids, attention_mask=attention_mask)

        # Use the embedding of the [CLS] token as the feature vector for whole lyrics
        lyrics_features = lyrics_outputs.last_hidden_state[:, 0, :]

        # Combine the features from both towers
        combined_features = torch.cat((audio_features, lyrics_features), dim=1)

        # Pass the combined features to the final classifier head
        output = self.classifier_head(combined_features)

        return output

print("‚úì BimodalClassifier class defined.")

## ‚≠ê NEW CODE - Modified BimodalClassifier to return intermediate features

We need to extract audio and lyrics features BEFORE they're combined. This requires a small modification.

In [None]:
# ‚≠ê NEW CODE - Add a method to extract features (modification to BimodalClassifier)
# We'll add this method to the existing model after loading

def get_features(self, x_audio, input_ids, attention_mask):
    """
    Extract audio and lyrics features separately (before fusion).
    Returns: (audio_features, lyrics_features, predictions)
    """
    # Process audio input
    audio_features = self.audio_tower(x_audio)  # [batch_size, 64]

    # Get lyric features
    lyrics_outputs = self.lyrics_tower(input_ids=input_ids, attention_mask=attention_mask)
    lyrics_features = lyrics_outputs.last_hidden_state[:, 0, :]  # [batch_size, 768]

    # Combine features and get predictions
    combined_features = torch.cat((audio_features, lyrics_features), dim=1)
    predictions = self.classifier_head(combined_features)

    return audio_features, lyrics_features, predictions

# We'll add this method to the model after loading
print("‚úì Feature extraction method defined (will be added to model later).")

---
# STEP 4: Load Data

## ‚úÖ EXISTING CODE (Same as MODEL 4)

This loads data exactly as in MODEL 4.

In [None]:
# ‚úÖ EXISTING CODE - Data loading (from MODEL 4)
print("Starting data transfer from Google Drive to local Colab storage...")

# Get paths for old file location and new colab one
gdrive_zip_path = '/content/drive/MyDrive/dissertation/merge_dataset_zipped.zip'
local_storage_path = '/content/local_dissertation_data/'
local_zip_path = os.path.join(local_storage_path, 'merge_dataset_zipped.zip')
os.makedirs(local_storage_path, exist_ok=True)

# Copy zip file from Drive to Colab
print("Copying single archive file from Google Drive...")
!rsync -ah --progress "{gdrive_zip_path}" "{local_storage_path}"

# Get total number of files for progress
total_files = int(subprocess.check_output(f"zipinfo -1 {local_zip_path} | wc -l", shell=True))

# Unzip the file
print("Extracting files locally")
!unzip -o "{local_zip_path}" -d "{local_storage_path}" | tqdm --unit=files --total={total_files} > /dev/null

print("Data transfer and extraction complete.")

In [None]:
# ‚úÖ EXISTING CODE - Load master data and update paths (from MODEL 4)
local_output_path = os.path.join(local_storage_path, 'merge_dataset/output_from_code/')
master_file_path = os.path.join(local_output_path, 'master_processed_file_list.csv')
master_df = pd.read_csv(master_file_path)

# Checking the valence and arousal range in the dataset
print(f"\nValence range in data: [{master_df['valence'].min()}, {master_df['valence'].max()}]")
print(f"Arousal range in data: [{master_df['arousal'].min()}, {master_df['arousal'].max()}]")
print(f"Valence mean: {master_df['valence'].mean():.4f}, std: {master_df['valence'].std():.4f}")
print(f"Arousal mean: {master_df['arousal'].mean():.4f}, std: {master_df['arousal'].std():.4f}")
print(f"Total samples in master_df: {len(master_df)}")

# Update the paths in the csv
print("\nUpdating dataframe paths to use fast local storage...")
gdrive_output_path = '/content/drive/MyDrive/dissertation/output_from_code/'
master_df['spectrogram_path'] = master_df['spectrogram_path'].str.replace(gdrive_output_path, local_output_path, regex=False)
master_df['lyrics_path'] = master_df['lyrics_path'].str.replace(gdrive_output_path, local_output_path, regex=False)
print("Dataframe paths updated.")

In [None]:
# ‚úÖ EXISTING CODE - Load train/val/test splits (from MODEL 4)
local_split_folder_path = os.path.join(local_storage_path, 'merge_dataset/MERGE_Bimodal_Complete/tvt_dataframes/tvt_70_15_15/')
train_split_df = pd.read_csv(os.path.join(local_split_folder_path, 'tvt_70_15_15_train_bimodal_complete.csv'))
val_split_df = pd.read_csv(os.path.join(local_split_folder_path, 'tvt_70_15_15_validate_bimodal_complete.csv'))
test_split_df = pd.read_csv(os.path.join(local_split_folder_path, 'tvt_70_15_15_test_bimodal_complete.csv'))
print("\nSplit files loaded from local storage.")

# Merge the files
id_column_name = 'song_id'
train_split_df.rename(columns={'Song': id_column_name}, inplace=True)
val_split_df.rename(columns={'Song': id_column_name}, inplace=True)
test_split_df.rename(columns={'Song': id_column_name}, inplace=True)

train_df = pd.merge(master_df, train_split_df, on=id_column_name)
val_df = pd.merge(master_df, val_split_df, on=id_column_name)
test_df = pd.merge(master_df, test_split_df, on=id_column_name)

# Checking no files are lost in merging
print("\nChecking data")

if len(train_df) == len(train_split_df):
    print("\nTraining split: Merge successful. All songs accounted for.")
else:
    print(f"\nWARNING: Training split lost {len(train_split_df) - len(train_df)} songs during merge.")

if len(val_df) == len(val_split_df):
    print("Validation split: Merge successful. All songs accounted for.")
else:
    print(f"WARNING: Validation split lost {len(val_split_df) - len(val_df)} songs during merge.")

if len(test_df) == len(test_split_df):
    print("Test split: Merge successful. All songs accounted for.")
else:
    print(f"WARNING: Test split lost {len(test_split_df) - len(test_df)} songs during merge.")

# Check length
expected_train_len = 1552
expected_val_len = 332
expected_test_len = 332

assert len(train_df) == expected_train_len, f"Expected {expected_train_len} training samples, but found {len(train_df)}"
assert len(val_df) == expected_val_len, f"Expected {expected_val_len} validation samples, but found {len(val_df)}"
assert len(test_df) == expected_test_len, f"Expected {expected_test_len} test samples, but found {len(test_df)}"

print(f"\nFinal dataset lengths are correct: Train({len(train_df)}), Val({len(val_df)}), Test({len(test_df)})")
print("Data Check Complete")

## ‚≠ê NEW CODE - Choose which dataset to analyze

You can analyze train, validation, or test set. We'll use test set by default.

In [None]:
# ‚≠ê NEW CODE - Select dataset for similarity analysis
# Choose which split to analyze: train_df, val_df, or test_df
# We use test_df by default (the same set used for MODEL 4 evaluation)

analysis_df = test_df.copy()  # Change this to train_df or val_df if needed

print(f"\n‚úì Selected dataset for similarity analysis: TEST SET")
print(f"  Total songs to analyze: {len(analysis_df)}")
print(f"  Song IDs: {analysis_df[id_column_name].head(10).tolist()}...")

In [None]:
# ‚úÖ EXISTING CODE - Create datasets and dataloaders (from MODEL 4)
train_dataset = MER_Dataset(annotations_df=train_df, tokenizer=tokenizer)
val_dataset = MER_Dataset(annotations_df=val_df, tokenizer=tokenizer)
test_dataset = MER_Dataset(annotations_df=test_df, tokenizer=tokenizer)

BATCH_SIZE = 16
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print("\nDataLoaders created successfully.")

---
# STEP 5: Load Trained Model

## ‚úÖ EXISTING CODE (Similar to MODEL 4, but loading saved model)

In [None]:
# ‚úÖ EXISTING CODE - Check GPU availability (from MODEL 4)
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available. Using CUDA device.")
else:
    raise RuntimeError("Error: No GPU found. This script requires a GPU to run.")

In [None]:
# ‚úÖ EXISTING CODE - Initialize model (from MODEL 4)
model = BimodalClassifier()
model.to(device)
print("Model initialized.")

In [None]:
# ‚úÖ EXISTING CODE - Load trained model weights
# This loads the model you saved in MODEL 4

model_path = '/content/drive/MyDrive/dissertation/bimodal_regression_model.pth'
model.load_state_dict(torch.load(model_path, map_location=device))
model.eval()  # Set to evaluation mode

print(f"‚úì Model loaded successfully from: {model_path}")
print("‚úì Model set to evaluation mode.")

## ‚≠ê NEW CODE - Add feature extraction method to model

In [None]:
# ‚≠ê NEW CODE - Add the get_features method to the loaded model
import types

# Add the method we defined earlier to the model instance
model.get_features = types.MethodType(get_features, model)

print("‚úì Feature extraction method added to model.")
print("‚úì Model is ready for similarity analysis.")

---
# STEP 6: Extract Features from Dataset

## ‚≠ê NEW CODE - Extract audio and lyrics features

This extracts features for all songs in the selected dataset.

In [None]:
# ‚≠ê NEW CODE - Function to extract features from the dataset

def extract_features_from_dataset(model, dataloader, device):
    """
    Extract audio and lyrics features for all songs in the dataloader.
    
    Args:
        model: Trained BimodalClassifier with get_features method
        dataloader: DataLoader (test_loader, train_loader, or val_loader)
        device: torch device (cuda or cpu)
    
    Returns:
        Dictionary containing:
        - audio_features: [N, 64] numpy array
        - lyrics_features: [N, 768] numpy array
        - predictions: [N, 2] numpy array (valence, arousal)
        - ground_truth: [N, 2] numpy array (true valence, arousal)
    """
    print("\n" + "="*70)
    print("EXTRACTING FEATURES FROM DATASET")
    print("="*70)
    
    # Initialize lists to store results
    audio_features_list = []
    lyrics_features_list = []
    predictions_list = []
    ground_truth_list = []
    
    # Set model to evaluation mode
    model.eval()
    
    # Extract features without computing gradients
    with torch.no_grad():
        for spectrogram_batch, input_ids_batch, attention_mask_batch, labels_batch in tqdm(dataloader, desc="Extracting features"):
            # Move data to device
            spectrogram_batch = spectrogram_batch.to(device)
            input_ids_batch = input_ids_batch.to(device)
            attention_mask_batch = attention_mask_batch.to(device)
            
            # Extract features using our new method
            audio_feat, lyrics_feat, preds = model.get_features(
                spectrogram_batch, 
                input_ids_batch, 
                attention_mask_batch
            )
            
            # Move to CPU and convert to numpy
            audio_features_list.append(audio_feat.cpu().numpy())
            lyrics_features_list.append(lyrics_feat.cpu().numpy())
            predictions_list.append(preds.cpu().numpy())
            ground_truth_list.append(labels_batch.cpu().numpy())
    
    # Concatenate all batches
    audio_features = np.concatenate(audio_features_list, axis=0)      # [N, 64]
    lyrics_features = np.concatenate(lyrics_features_list, axis=0)    # [N, 768]
    predictions = np.concatenate(predictions_list, axis=0)            # [N, 2]
    ground_truth = np.concatenate(ground_truth_list, axis=0)          # [N, 2]
    
    # Print summary
    print(f"\n‚úì Feature extraction complete!")
    print(f"  Total songs processed: {len(audio_features)}")
    print(f"  Audio features shape:  {audio_features.shape}")
    print(f"  Lyrics features shape: {lyrics_features.shape}")
    print(f"  Predictions shape:     {predictions.shape}")
    print(f"  Ground truth shape:    {ground_truth.shape}")
    
    return {
        'audio_features': audio_features,
        'lyrics_features': lyrics_features,
        'predictions': predictions,
        'ground_truth': ground_truth
    }

print("‚úì Feature extraction function defined.")

In [None]:
# ‚≠ê NEW CODE - Extract features from test set
# This will take a few minutes depending on dataset size

features_dict = extract_features_from_dataset(model, test_loader, device)

# Store variable names that match MODEL 4
audio_features = features_dict['audio_features']
lyrics_features = features_dict['lyrics_features']
predictions = features_dict['predictions']
ground_truth = features_dict['ground_truth']

print("\n‚úì Features stored in variables:")
print("  - audio_features")
print("  - lyrics_features")
print("  - predictions")
print("  - ground_truth")

---
---
# SIMILARITY ANALYSIS SECTION
---
---

All code below is ‚≠ê **NEW CODE** for similarity analysis.

You now have:
- `audio_features`: [332, 64] - Audio features for each song
- `lyrics_features`: [332, 768] - Lyrics features for each song
- `predictions`: [332, 2] - Predicted valence and arousal
- `ground_truth`: [332, 2] - True valence and arousal

We'll perform 3 similarity analyses:
1. **Cosine Similarity**
2. **Canonical Correlation Analysis (CCA)**
3. **Cross-Modal Retrieval**

---
# METHOD 1: Cosine Similarity

## ‚≠ê NEW CODE

**What it measures**: Angular similarity between feature vectors (-1 to 1)

**Key metric**: Diagonal of cross-modal matrix = how similar is each song's audio to its OWN lyrics

In [None]:
# ‚≠ê NEW CODE - Compute cosine similarity

def compute_cosine_similarity_analysis(audio_features, lyrics_features):
    """
    Compute pairwise cosine similarities.
    """
    print("\n" + "="*70)
    print("METHOD 1: COSINE SIMILARITY ANALYSIS")
    print("="*70)
    
    # Audio-to-audio similarity [N, N]
    # Entry [i,j] = similarity between audio_i and audio_j
    audio_sim = cosine_similarity(audio_features, audio_features)
    
    # Lyrics-to-lyrics similarity [N, N]
    lyrics_sim = cosine_similarity(lyrics_features, lyrics_features)
    
    # CROSS-MODAL similarity [N, N]
    # Entry [i,j] = similarity between audio_i and lyrics_j
    # KEY METRIC: Diagonal = audio_i vs lyrics_i (same song)
    cross_modal_sim = cosine_similarity(audio_features, lyrics_features)
    
    # Extract diagonal (self-similarity)
    self_similarity = np.diag(cross_modal_sim)
    
    # Extract off-diagonal (cross-song similarity)
    mask = np.ones_like(cross_modal_sim, dtype=bool)
    np.fill_diagonal(mask, False)
    cross_song_sim = cross_modal_sim[mask]
    
    # Print results
    print(f"\n1. SELF-SIMILARITY (audio vs own lyrics):")
    print(f"   Mean:  {self_similarity.mean():.4f}")
    print(f"   Std:   {self_similarity.std():.4f}")
    print(f"   Range: [{self_similarity.min():.4f}, {self_similarity.max():.4f}]")
    
    print(f"\n2. CROSS-SONG SIMILARITY (audio_i vs lyrics_j, i‚â†j):")
    print(f"   Mean:  {cross_song_sim.mean():.4f}")
    print(f"   Std:   {cross_song_sim.std():.4f}")
    
    print(f"\n3. WITHIN-MODALITY SIMILARITY:")
    print(f"   Audio-to-audio mean:   {audio_sim[mask].mean():.4f}")
    print(f"   Lyrics-to-lyrics mean: {lyrics_sim[mask].mean():.4f}")
    
    # Interpretation
    print(f"\n4. INTERPRETATION:")
    if self_similarity.mean() > 0.7:
        print(f"   ‚úì STRONG alignment: Audio and lyrics are highly similar")
    elif self_similarity.mean() > 0.5:
        print(f"   ‚úì MODERATE alignment: Some similarity between audio and lyrics")
    else:
        print(f"   ! WEAK alignment: Audio and lyrics encode different information")
    
    return audio_sim, lyrics_sim, cross_modal_sim, self_similarity

# Run analysis
audio_sim, lyrics_sim, cross_modal_sim, self_sim = compute_cosine_similarity_analysis(
    audio_features,
    lyrics_features
)

In [None]:
# ‚≠ê NEW CODE - Visualize similarity matrices

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot audio similarity
im1 = axes[0].imshow(audio_sim, cmap='coolwarm', vmin=0, vmax=1, aspect='auto')
axes[0].set_title('Audio-to-Audio Similarity', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Song Index')
axes[0].set_ylabel('Song Index')
plt.colorbar(im1, ax=axes[0], fraction=0.046)

# Plot lyrics similarity
im2 = axes[1].imshow(lyrics_sim, cmap='coolwarm', vmin=0, vmax=1, aspect='auto')
axes[1].set_title('Lyrics-to-Lyrics Similarity', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Song Index')
axes[1].set_ylabel('Song Index')
plt.colorbar(im2, ax=axes[1], fraction=0.046)

# Plot cross-modal similarity (KEY PLOT)
im3 = axes[2].imshow(cross_modal_sim, cmap='coolwarm', vmin=0, vmax=1, aspect='auto')
axes[2].set_title('Audio-to-Lyrics Cross-Modal Similarity', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Lyrics Index')
axes[2].set_ylabel('Audio Index')
plt.colorbar(im3, ax=axes[2], fraction=0.046)

plt.tight_layout()
plt.show()

# Plot histogram of self-similarity
plt.figure(figsize=(10, 6))
plt.hist(self_sim, bins=30, color='steelblue', alpha=0.7, edgecolor='black')
plt.axvline(self_sim.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {self_sim.mean():.3f}')
plt.xlabel('Cosine Similarity (audio vs own lyrics)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Self-Similarity Scores', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

---
# METHOD 2: Canonical Correlation Analysis (CCA)

## ‚≠ê NEW CODE

**What it does**: Finds linear transformations that maximize correlation

**Key insight**: Discovers shared latent dimensions (e.g., "energy" in both modalities)

In [None]:
# ‚≠ê NEW CODE - Perform CCA analysis

def perform_cca_analysis(audio_features, lyrics_features, n_components=10):
    """
    Canonical Correlation Analysis between audio and lyrics.
    """
    print("\n" + "="*70)
    print("METHOD 2: CANONICAL CORRELATION ANALYSIS (CCA)")
    print("="*70)
    
    # Initialize CCA
    cca = CCA(n_components=n_components, max_iter=1000)
    
    # Fit CCA to learn transformations
    cca.fit(audio_features, lyrics_features)
    
    # Transform features to canonical space
    audio_canonical, lyrics_canonical = cca.transform(audio_features, lyrics_features)
    
    # Compute correlation for each canonical component
    correlations = []
    for i in range(n_components):
        corr, _ = pearsonr(audio_canonical[:, i], lyrics_canonical[:, i])
        correlations.append(corr)
    
    correlations = np.array(correlations)
    
    # Print results
    print(f"\nCanonical correlations (n={n_components}):")
    for i, corr in enumerate(correlations):
        print(f"  Component {i+1}: {corr:.4f}")
    
    print(f"\nSummary statistics:")
    print(f"  Mean correlation: {correlations.mean():.4f}")
    print(f"  Max correlation:  {correlations.max():.4f}")
    print(f"  Std:              {correlations.std():.4f}")
    
    # Interpretation
    print(f"\nINTERPRETATION:")
    if correlations[0] > 0.7:
        print(f"  ‚úì STRONG shared structure: First component correlation = {correlations[0]:.3f}")
    elif correlations[0] > 0.5:
        print(f"  ‚úì MODERATE shared structure: Some shared latent dimensions")
    else:
        print(f"  ! LIMITED shared structure: Modalities may be complementary")
    
    return cca, correlations, audio_canonical, lyrics_canonical

# Run CCA
cca_model, cca_corrs, audio_can, lyrics_can = perform_cca_analysis(
    audio_features,
    lyrics_features,
    n_components=10
)

In [None]:
# ‚≠ê NEW CODE - Visualize CCA results

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart of canonical correlations
axes[0].bar(range(1, len(cca_corrs) + 1), cca_corrs, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].axhline(y=0.5, color='red', linestyle='--', linewidth=2, label='Moderate (0.5)')
axes[0].axhline(y=0.7, color='darkred', linestyle='--', linewidth=2, label='Strong (0.7)')
axes[0].set_xlabel('Canonical Component', fontsize=12)
axes[0].set_ylabel('Correlation Coefficient', fontsize=12)
axes[0].set_title('Canonical Correlations', fontsize=14, fontweight='bold')
axes[0].set_ylim([0, 1])
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# Scatter plot of first canonical component
axes[1].scatter(audio_can[:, 0], lyrics_can[:, 0], alpha=0.6, s=50, color='purple', edgecolors='black', linewidth=0.5)
axes[1].set_xlabel('Audio Canonical Component 1', fontsize=12)
axes[1].set_ylabel('Lyrics Canonical Component 1', fontsize=12)
axes[1].set_title(f'First Canonical Component (r={cca_corrs[0]:.3f})', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# Add correlation line
z = np.polyfit(audio_can[:, 0], lyrics_can[:, 0], 1)
p = np.poly1d(z)
axes[1].plot(audio_can[:, 0], p(audio_can[:, 0]), "r--", linewidth=2, label='Linear fit')
axes[1].legend()

plt.tight_layout()
plt.show()

---
# METHOD 3: Cross-Modal Retrieval

## ‚≠ê NEW CODE

**What it does**: Tests if audio features can retrieve matching lyrics

**Key metric**: Top-K accuracy (% of times correct match is in top K)

In [None]:
# ‚≠ê NEW CODE - Cross-modal retrieval analysis

def cross_modal_retrieval_analysis(audio_features, lyrics_features, top_k=5):
    """
    Perform cross-modal retrieval task.
    """
    print("\n" + "="*70)
    print("METHOD 3: CROSS-MODAL RETRIEVAL ANALYSIS")
    print("="*70)
    
    # Compute cross-modal similarity matrix
    sim_matrix = cosine_similarity(audio_features, lyrics_features)
    n_samples = len(audio_features)
    
    # Audio ‚Üí Lyrics retrieval
    audio_to_lyrics_top_k = np.argsort(sim_matrix, axis=1)[:, ::-1][:, :top_k]
    
    # Check if correct match is in top-k
    audio_to_lyrics_hits = []
    for i in range(n_samples):
        if i in audio_to_lyrics_top_k[i]:
            audio_to_lyrics_hits.append(1)
        else:
            audio_to_lyrics_hits.append(0)
    
    audio_to_lyrics_acc = np.mean(audio_to_lyrics_hits)
    
    # Lyrics ‚Üí Audio retrieval
    lyrics_to_audio_top_k = np.argsort(sim_matrix.T, axis=1)[:, ::-1][:, :top_k]
    
    lyrics_to_audio_hits = []
    for i in range(n_samples):
        if i in lyrics_to_audio_top_k[i]:
            lyrics_to_audio_hits.append(1)
        else:
            lyrics_to_audio_hits.append(0)
    
    lyrics_to_audio_acc = np.mean(lyrics_to_audio_hits)
    
    # Top-1 (exact match)
    audio_to_lyrics_top1 = np.argmax(sim_matrix, axis=1)
    top1_acc = np.mean(audio_to_lyrics_top1 == np.arange(n_samples))
    
    # Top-10
    if n_samples >= 10:
        audio_to_lyrics_top_10 = np.argsort(sim_matrix, axis=1)[:, ::-1][:, :10]
        top10_hits = [i in audio_to_lyrics_top_10[i] for i in range(n_samples)]
        top10_acc = np.mean(top10_hits)
    else:
        top10_acc = None
    
    # Print results
    print(f"\n1. RETRIEVAL ACCURACY:")
    print(f"   Audio ‚Üí Lyrics (Top-{top_k}): {audio_to_lyrics_acc:.2%}")
    print(f"   Lyrics ‚Üí Audio (Top-{top_k}): {lyrics_to_audio_acc:.2%}")
    
    print(f"\n2. ADDITIONAL METRICS:")
    print(f"   Top-1 accuracy (exact match):  {top1_acc:.2%}")
    if top10_acc:
        print(f"   Top-10 accuracy:               {top10_acc:.2%}")
    
    # Interpretation
    print(f"\n3. INTERPRETATION:")
    if audio_to_lyrics_acc > 0.5:
        print(f"   ‚úì GOOD alignment: Audio features predict matching lyrics well")
    elif audio_to_lyrics_acc > 0.2:
        print(f"   ‚úì MODERATE alignment: Some predictive power")
    else:
        print(f"   ! WEAK alignment: Limited cross-modal predictability")
    
    print(f"\n   Meaning: {audio_to_lyrics_acc:.1%} of the time, given a song's audio,")
    print(f"   the correct lyrics are in the top-{top_k} most similar lyrics.")
    
    return {
        'audio_to_lyrics_acc': audio_to_lyrics_acc,
        'lyrics_to_audio_acc': lyrics_to_audio_acc,
        'top1_acc': top1_acc,
        'top10_acc': top10_acc
    }

# Run retrieval analysis
retrieval_results = cross_modal_retrieval_analysis(
    audio_features,
    lyrics_features,
    top_k=5
)

In [None]:
# ‚≠ê NEW CODE - Visualize retrieval accuracy at different k values

k_values = [1, 2, 3, 5, 10, 20]
accuracies = []

sim_matrix = cosine_similarity(audio_features, lyrics_features)
n_samples = len(audio_features)

# Compute accuracy for different k values
for k in k_values:
    if k <= n_samples:
        top_k_indices = np.argsort(sim_matrix, axis=1)[:, ::-1][:, :k]
        hits = [i in top_k_indices[i] for i in range(n_samples)]
        accuracies.append(np.mean(hits))
    else:
        accuracies.append(None)

# Plot
plt.figure(figsize=(10, 6))
valid_k = [k for k, acc in zip(k_values, accuracies) if acc is not None]
valid_acc = [acc for acc in accuracies if acc is not None]

plt.plot(valid_k, valid_acc, marker='o', linewidth=2, markersize=8, color='steelblue')
plt.xlabel('Top-K', fontsize=12)
plt.ylabel('Retrieval Accuracy', fontsize=12)
plt.title('Cross-Modal Retrieval Accuracy (Audio ‚Üí Lyrics)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.ylim([0, 1])

# Add value labels
for k, acc in zip(valid_k, valid_acc):
    plt.text(k, acc + 0.03, f'{acc:.2%}', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

---
# COMPREHENSIVE SUMMARY REPORT

## ‚≠ê NEW CODE

In [None]:
# ‚≠ê NEW CODE - Generate comprehensive summary report

def generate_comprehensive_report(self_sim, cca_corrs, retrieval_results, analysis_df):
    """
    Generate final summary report.
    """
    print("\n\n")
    print("="*80)
    print(" "*20 + "COMPREHENSIVE SIMILARITY REPORT")
    print("="*80)
    
    print(f"\nüìä DATASET SUMMARY")
    print(f"{'-'*80}")
    print(f"   Total songs analyzed: {len(analysis_df)}")
    print(f"   Audio feature dim:    64")
    print(f"   Lyrics feature dim:   768")
    
    print(f"\nüìè METHOD 1: COSINE SIMILARITY")
    print(f"{'-'*80}")
    print(f"   Self-similarity (audio vs own lyrics):")
    print(f"     Mean: {self_sim.mean():.4f}")
    print(f"     Std:  {self_sim.std():.4f}")
    
    if self_sim.mean() > 0.7:
        print(f"   ‚úì STRONG alignment between audio and lyrics")
    elif self_sim.mean() > 0.5:
        print(f"   ‚úì MODERATE alignment")
    else:
        print(f"   ! WEAK alignment - modalities encode different aspects")
    
    print(f"\nüîó METHOD 2: CANONICAL CORRELATION ANALYSIS")
    print(f"{'-'*80}")
    print(f"   Top 3 canonical correlations:")
    for i in range(min(3, len(cca_corrs))):
        print(f"     Component {i+1}: {cca_corrs[i]:.4f}")
    
    if cca_corrs[0] > 0.7:
        print(f"   ‚úì STRONG shared latent structure")
    elif cca_corrs[0] > 0.5:
        print(f"   ‚úì MODERATE shared structure")
    else:
        print(f"   ! LIMITED shared structure - complementary modalities")
    
    print(f"\nüîç METHOD 3: CROSS-MODAL RETRIEVAL")
    print(f"{'-'*80}")
    print(f"   Audio ‚Üí Lyrics (Top-5): {retrieval_results['audio_to_lyrics_acc']:.2%}")
    print(f"   Lyrics ‚Üí Audio (Top-5): {retrieval_results['lyrics_to_audio_acc']:.2%}")
    print(f"   Top-1 exact match:      {retrieval_results['top1_acc']:.2%}")
    
    if retrieval_results['audio_to_lyrics_acc'] > 0.5:
        print(f"   ‚úì GOOD cross-modal predictability")
    elif retrieval_results['audio_to_lyrics_acc'] > 0.2:
        print(f"   ‚úì MODERATE predictability")
    else:
        print(f"   ! WEAK predictability")
    
    print(f"\nüí° OVERALL CONCLUSION")
    print(f"{'-'*80}")
    
    # Overall assessment
    scores = [
        self_sim.mean(),
        cca_corrs[0],
        retrieval_results['audio_to_lyrics_acc']
    ]
    avg_score = np.mean(scores)
    
    if avg_score > 0.6:
        print(f"   The audio and lyrics features show STRONG similarity and alignment.")
        print(f"   They encode related semantic information and have predictive power.")
    elif avg_score > 0.4:
        print(f"   The audio and lyrics features show MODERATE similarity.")
        print(f"   They share some structure but also encode unique information.")
    else:
        print(f"   The audio and lyrics features show LIMITED similarity.")
        print(f"   They appear to encode COMPLEMENTARY rather than redundant information.")
        print(f"   This suggests both modalities contribute unique value to emotion prediction.")
    
    print("\n" + "="*80)

# Generate report
generate_comprehensive_report(self_sim, cca_corrs, retrieval_results, analysis_df)

---
# SAVE RESULTS

## ‚≠ê NEW CODE

In [None]:
# ‚≠ê NEW CODE - Save all results to Google Drive
import json

# Create output directory
output_dir = '/content/drive/MyDrive/dissertation/similarity_analysis_results/'
os.makedirs(output_dir, exist_ok=True)

# Save similarity matrices
np.save(os.path.join(output_dir, 'audio_similarity_matrix.npy'), audio_sim)
np.save(os.path.join(output_dir, 'lyrics_similarity_matrix.npy'), lyrics_sim)
np.save(os.path.join(output_dir, 'cross_modal_similarity_matrix.npy'), cross_modal_sim)

# Save CCA results
np.save(os.path.join(output_dir, 'cca_correlations.npy'), cca_corrs)
np.save(os.path.join(output_dir, 'audio_canonical.npy'), audio_can)
np.save(os.path.join(output_dir, 'lyrics_canonical.npy'), lyrics_can)

# Save extracted features
np.save(os.path.join(output_dir, 'audio_features.npy'), audio_features)
np.save(os.path.join(output_dir, 'lyrics_features.npy'), lyrics_features)

# Create summary CSV with per-song similarity scores
results_df = analysis_df[[id_column_name, 'valence', 'arousal']].copy()
results_df['self_similarity'] = self_sim
results_df['valence_predicted'] = predictions[:, 0]
results_df['arousal_predicted'] = predictions[:, 1]
results_df.to_csv(os.path.join(output_dir, 'similarity_summary.csv'), index=False)

# Save metrics summary as JSON
metrics = {
    'dataset': 'test_set',
    'n_songs': len(analysis_df),
    'mean_self_similarity': float(self_sim.mean()),
    'std_self_similarity': float(self_sim.std()),
    'cca_correlation_1': float(cca_corrs[0]),
    'cca_mean_correlation': float(cca_corrs.mean()),
    'retrieval_audio_to_lyrics': float(retrieval_results['audio_to_lyrics_acc']),
    'retrieval_lyrics_to_audio': float(retrieval_results['lyrics_to_audio_acc']),
    'retrieval_top1': float(retrieval_results['top1_acc'])
}

with open(os.path.join(output_dir, 'metrics_summary.json'), 'w') as f:
    json.dump(metrics, f, indent=2)

print(f"‚úì Results saved to: {output_dir}")
print(f"\nFiles created:")
print(f"  - audio_similarity_matrix.npy")
print(f"  - lyrics_similarity_matrix.npy")
print(f"  - cross_modal_similarity_matrix.npy")
print(f"  - cca_correlations.npy")
print(f"  - audio_canonical.npy")
print(f"  - lyrics_canonical.npy")
print(f"  - audio_features.npy")
print(f"  - lyrics_features.npy")
print(f"  - similarity_summary.csv")
print(f"  - metrics_summary.json")

---
---
# ‚úÖ ANALYSIS COMPLETE!
---
---

## Summary of What You Have

You've now completed similarity analysis using 3 complementary methods:

1. **Cosine Similarity**: Measured how similar audio and lyrics features are
2. **CCA**: Found shared latent dimensions between modalities
3. **Cross-Modal Retrieval**: Tested if audio can predict lyrics

## Next Steps for Your Dissertation

1. **Interpret Results**: Look at the metrics and visualizations
2. **Write Analysis Section**: Use the comprehensive report as a starting point
3. **Consider Siamese Networks** (optional): See the guidance below

All results are saved to your Google Drive for later analysis!

---
---
# OPTIONAL: Siamese Networks for Improved Similarity
---
---

## What is a Siamese Network?

Your current analysis uses features optimized for **emotion prediction** (valence/arousal). A Siamese network would learn features specifically optimized for **cross-modal similarity**.

### Key Differences:

| Aspect | Current Analysis | Siamese Network |
|--------|------------------|------------------|
| **Features** | From emotion prediction model | Learned for similarity |
| **Training** | No additional training | Requires training (~4-6 hours) |
| **Retrieval** | Baseline (e.g., 20-40%) | Improved (e.g., 60-80%) |
| **Complexity** | Simple (done!) | Moderate (~2-3 days work) |

### What Would You Gain?

1. **Better retrieval accuracy**: Features explicitly trained to match audio-lyrics pairs
2. **Learned embeddings**: New embedding space optimized for similarity
3. **Novel contribution**: Shows you can design and implement new architectures

### Implementation Complexity

**Effort**: MODERATE (2-3 days, ~250 new lines of code)

**What you'd reuse from MODEL 4**:
- ‚úÖ Your VGGish audio model
- ‚úÖ Your BERT lyrics model
- ‚úÖ Your data loading pipeline
- ‚úÖ Your preprocessing code

**What you'd need to add**:
- Projection heads (map to shared 256-dim space)
- InfoNCE loss function (~20 lines)
- Training loop (~100 lines)
- Evaluation code (~50 lines)

### Recommendation

**For your dissertation, I recommend doing BOTH:**

1. **Chapter Section 1**: "Similarity Analysis of Emotion Features" (what you just did)
   - Shows baseline similarity with existing features
   - Quick to complete ‚úÖ

2. **Chapter Section 2**: "Learning Similarity-Optimized Embeddings" (optional Siamese network)
   - Shows improved similarity with learned features
   - Demonstrates advanced ML skills

This creates a strong narrative: **analyze ‚Üí design improvement ‚Üí demonstrate success**

---

**Want me to create the Siamese network code?** Let me know and I'll provide:
- Complete implementation adapted to MODEL 4
- Training script using your existing setup
- Evaluation code for comparison