# Audio Feature Extraction and Visualization

This notebook demonstrates the extraction and visualization of audio features for emotion recognition using the CREMA-D dataset.

We'll explore:
1. Audio preprocessing techniques
2. Feature extraction methods
3. Visualization of features across different emotions
4. Feature distribution analysis

In [15]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import librosa
import librosa.display
from IPython.display import Audio
from tqdm.notebook import tqdm
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Add the src directory to the path
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from src.utils import load_config
from src.cremad_loader import CREMADDataLoader
from src.preprocessing import AudioPreprocessor
from src.features import FeatureExtractor

# Set some plotting parameters
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook", font_scale=1.5)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

## 1. Load the Dataset

In [16]:
# Load configuration
config = load_config('../config.yaml')
print("Configuration loaded successfully")

# Create data loader
data_loader = CREMADDataLoader('../config.yaml')

# Load a subset of the dataset (stratified by emotion)
metadata_df, audio_data = data_loader.load_dataset(limit=60, stratify_by='emotion')

print(f"Loaded {len(metadata_df)} audio samples")

2025-04-09 14:24:30,402 - src.cremad_loader - INFO - Found 0 audio files


Configuration loaded successfully


Processing metadata: 0it [00:00, ?it/s]


Loaded 0 audio samples


In [17]:
# Display dataset statistics
print("\nEmotion distribution:")
emotion_counts = metadata_df['emotion'].value_counts()
print(emotion_counts)

print("\nIntensity distribution:")
intensity_counts = metadata_df['intensity'].value_counts()
print(intensity_counts)

print("\nGender distribution:")
gender_counts = metadata_df['gender'].value_counts()
print(gender_counts)

# Plot emotion distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=metadata_df, x='emotion', palette='viridis')
plt.title('Emotion Distribution')
plt.xlabel('Emotion')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


Emotion distribution:


KeyError: 'emotion'

## 2. Audio Preprocessing

Let's apply our preprocessing pipeline to one sample from each emotion and visualize the results.

In [None]:
# Initialize preprocessor
preprocessor = AudioPreprocessor(config_path=os.path.join('..', 'config.yaml'))

# Get one sample from each emotion
emotion_samples = {}
for emotion in metadata_df['emotion'].unique():
    sample_idx = metadata_df[metadata_df['emotion'] == emotion].index[0]
    emotion_samples[emotion] = {
        'metadata': metadata_df.loc[sample_idx],
        'audio_data': audio_data[sample_idx]
    }

In [None]:
# Preprocess samples and visualize
for emotion, sample in emotion_samples.items():
    y, sr = sample['audio_data']
    
    # Preprocess audio
    y_preprocessed = preprocessor.preprocess_audio(y, sr)
    
    # Plot original and preprocessed waveforms
    plt.figure(figsize=(15, 6))
    
    plt.subplot(2, 1, 1)
    librosa.display.waveshow(y, sr=sr)
    plt.title(f"Original Waveform - {emotion.capitalize()}")
    
    plt.subplot(2, 1, 2)
    librosa.display.waveshow(y_preprocessed, sr=sr)
    plt.title(f"Preprocessed Waveform - {emotion.capitalize()}")
    
    plt.tight_layout()
    plt.show()
    
    # Listen to original and preprocessed audio
    print(f"Original audio - {emotion.capitalize()}:")
    display(Audio(y, rate=sr))
    
    print(f"Preprocessed audio - {emotion.capitalize()}:")
    display(Audio(y_preprocessed, rate=sr))

## 3. Feature Extraction

Let's extract features from our samples and examine them.

In [None]:
# Initialize feature extractor
feature_extractor = FeatureExtractor(config_path=os.path.join('..', 'config.yaml'))

In [None]:
# Extract and visualize raw features for each emotion
for emotion, sample in emotion_samples.items():
    y, sr = sample['audio_data']
    
    # Preprocess audio
    y_preprocessed = preprocessor.preprocess_audio(y, sr)
    
    # Extract features
    features = feature_extractor.extract_all_features(y_preprocessed, sr)
    
    # Plot MFCC features
    plt.figure(figsize=(12, 8))
    
    plt.subplot(3, 1, 1)
    librosa.display.specshow(features['mfcc'][:13], x_axis='time')
    plt.colorbar()
    plt.title(f"MFCC Features - {emotion.capitalize()}")
    
    if features['mfcc'].shape[0] > 13:
        plt.subplot(3, 1, 2)
        librosa.display.specshow(features['mfcc'][13:26], x_axis='time')
        plt.colorbar()
        plt.title(f"MFCC Delta Features - {emotion.capitalize()}")
        
        plt.subplot(3, 1, 3)
        librosa.display.specshow(features['mfcc'][26:], x_axis='time')
        plt.colorbar()
        plt.title(f"MFCC Delta-Delta Features - {emotion.capitalize()}")
    
    plt.tight_layout()
    plt.show()
    
    # Plot prosodic features
    if features['prosodic'].size > 0:
        plt.figure(figsize=(15, 5))
        for i in range(features['prosodic'].shape[0]):
            plt.plot(features['prosodic'][i], label=f"Feature {i+1}")
        plt.title(f"Prosodic Features - {emotion.capitalize()}")
        plt.xlabel("Time Frame")
        plt.ylabel("Value")
        plt.legend()
        plt.tight_layout()
        plt.show()
    
    # Plot spectral features
    if features['spectral'].size > 0:
        plt.figure(figsize=(15, 5))
        for i in range(features['spectral'].shape[0]):
            plt.plot(features['spectral'][i], label=f"Feature {i+1}")
        plt.title(f"Spectral Features - {emotion.capitalize()}")
        plt.xlabel("Time Frame")
        plt.ylabel("Value")
        plt.legend()
        plt.tight_layout()
        plt.show()

## 4. Extract Feature Statistics

Now let's extract statistical features from all our samples.

In [None]:
# Preprocess all audio samples
preprocessed_audio = []
for y, sr in tqdm(audio_data, desc="Preprocessing audio"):
    y_preprocessed = preprocessor.preprocess_audio(y, sr)
    preprocessed_audio.append((y_preprocessed, sr))

In [None]:
# Extract features from all preprocessed samples
all_feature_vectors = []
for y, sr in tqdm(preprocessed_audio, desc="Extracting features"):
    # Extract features
    features = feature_extractor.extract_all_features(y, sr)
    
    # Compute statistics
    statistics = feature_extractor.compute_feature_statistics(features)
    
    # Concatenate all statistics into a single feature vector
    feature_vectors = []
    for feature_name in sorted(statistics.keys()):
        feature_vectors.append(statistics[feature_name])
    
    if feature_vectors:
        feature_vector = np.concatenate(feature_vectors)
        all_feature_vectors.append(feature_vector)
    else:
        all_feature_vectors.append(np.array([]))

# Convert to numpy array
X = np.array(all_feature_vectors)
print(f"Feature matrix shape: {X.shape}")

## 5. Feature Visualization

Let's visualize the features using dimensionality reduction techniques.

In [None]:
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create DataFrame with PCA results and emotion labels
pca_df = pd.DataFrame({
    'pca1': X_pca[:, 0],
    'pca2': X_pca[:, 1],
    'emotion': metadata_df['emotion'].values,
    'gender': metadata_df['gender'].values,
    'intensity': metadata_df['intensity'].values
})

# Plot PCA results by emotion
plt.figure(figsize=(12, 8))
sns.scatterplot(data=pca_df, x='pca1', y='pca2', hue='emotion', palette='viridis', s=100)
plt.title('PCA of Audio Features by Emotion')
plt.xlabel(f'Principal Component 1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'Principal Component 2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Apply t-SNE for better visualization
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
X_tsne = tsne.fit_transform(X_scaled)

# Create DataFrame with t-SNE results and emotion labels
tsne_df = pd.DataFrame({
    'tsne1': X_tsne[:, 0],
    'tsne2': X_tsne[:, 1],
    'emotion': metadata_df['emotion'].values,
    'gender': metadata_df['gender'].values,
    'intensity': metadata_df['intensity'].values
})

# Plot t-SNE results by emotion
plt.figure(figsize=(12, 8))
sns.scatterplot(data=tsne_df, x='tsne1', y='tsne2', hue='emotion', palette='viridis', s=100)
plt.title('t-SNE of Audio Features by Emotion')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Plot t-SNE results by intensity
plt.figure(figsize=(12, 8))
sns.scatterplot(data=tsne_df, x='tsne1', y='tsne2', hue='intensity', palette='plasma', s=100)
plt.title('t-SNE of Audio Features by Intensity')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Plot t-SNE results by gender
plt.figure(figsize=(12, 8))
sns.scatterplot(data=tsne_df, x='tsne1', y='tsne2', hue='gender', palette='Set1', s=100)
plt.title('t-SNE of Audio Features by Gender')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

## 6. Feature Importance Analysis

Let's examine how the features contribute to the principal components.

In [None]:
# Apply PCA with more components for analysis
n_components = min(10, X_scaled.shape[1])
pca_analysis = PCA(n_components=n_components)
pca_analysis.fit(X_scaled)

# Plot explained variance ratio
plt.figure(figsize=(10, 6))
plt.bar(range(1, n_components+1), pca_analysis.explained_variance_ratio_)
plt.plot(range(1, n_components+1), np.cumsum(pca_analysis.explained_variance_ratio_), marker='o', linestyle='-', color='r')
plt.title('Explained Variance by Principal Components')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.xticks(range(1, n_components+1))
plt.axhline(y=0.9, color='k', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

print(f"Cumulative explained variance: {np.cumsum(pca_analysis.explained_variance_ratio_)}")

In [None]:
# Create a dataset with all features and metadata
feature_dataset = pd.DataFrame(X_scaled)
feature_dataset['emotion'] = metadata_df['emotion'].values
feature_dataset['gender'] = metadata_df['gender'].values
feature_dataset['intensity'] = metadata_df['intensity'].values

# Let's look at the distribution of features by emotion
# We'll examine the first few important features
top_features = 5
plt.figure(figsize=(15, 12))

for i in range(top_features):
    plt.subplot(top_features, 1, i+1)
    sns.boxplot(data=feature_dataset, x='emotion', y=i)
    plt.title(f'Distribution of Feature {i+1} by Emotion')
    plt.xlabel('Emotion')
    plt.ylabel(f'Feature {i+1} Value')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 7. Audio Feature Correlations

Let's examine the correlations between our extracted features.

In [None]:
# Calculate correlation matrix for the first 20 features
n_features = min(20, X_scaled.shape[1])
corr_matrix = pd.DataFrame(X_scaled[:, :n_features]).corr()

# Plot correlation matrix
plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

## 8. Prepare Complete Feature Dataset for Training

Let's prepare the complete dataset for model training.

In [None]:
# Create the complete feature dataset
feature_dataset = pd.DataFrame(X_scaled)
for col in ['actor_id', 'emotion', 'intensity', 'gender', 'sentence_id']:
    if col in metadata_df.columns:
        feature_dataset[col] = metadata_df[col].values

# Save the dataset
features_dir = os.path.join('..', 'data', 'features')
os.makedirs(features_dir, exist_ok=True)
feature_dataset.to_pickle(os.path.join(features_dir, 'full_features.pkl'))

print(f"Saved feature dataset with shape {feature_dataset.shape}")

In [None]:
# Display sample of the dataset
feature_dataset[['emotion', 'intensity', 'gender']].head()

## 9. Summary and Observations

Based on our feature extraction and visualization, we can make several observations:

1. **Preprocessing Effects**: Our preprocessing pipeline successfully normalizes the audio and reduces noise, resulting in cleaner waveforms.

2. **Feature Differentiation**: The PCA and t-SNE visualizations show some clustering of emotions, indicating that our features contain information useful for emotion classification.

3. **Feature Importance**: The first few principal components capture a significant portion of the variance in our dataset, suggesting that dimensionality reduction might be beneficial for model training.

4. **Emotion Patterns**: Certain emotions like 'angry' and 'happy' appear to have distinct feature patterns compared to 'neutral' and 'sad' emotions.

5. **Gender and Intensity Effects**: The visualizations suggest that gender and intensity level also influence the acoustic features, which could be important confounding variables to consider during model development.

6. **Feature Correlations**: Many of our extracted features show high correlation, indicating redundancy in the feature set. Feature selection or dimensionality reduction will likely improve model performance.

7. **Feature Distribution**: The boxplots show that some features have different distributions across emotions, which is promising for classification tasks.

These insights will guide our approach to model development in the next phase of the project.

## Next Steps

1. Implement feature selection techniques to identify the most discriminative features
2. Train baseline models (SVM, Random Forest, XGBoost) on the extracted features
3. Perform hyperparameter optimization to improve model performance
4. Evaluate models using cross-validation and various metrics
5. Analyze model results to gain insights into emotion recognition patterns