# NISQA_TRAIN_SIM Dataset Demo

This notebook provides an exploratory data analysis (EDA) and audio demonstration of the **NISQA_TRAIN_SIM** dataset.

## Dataset Overview
Based on the provided readme, this dataset contains simulated distortions applied to speech files. It is used to train models for multidimensional speech quality prediction.

- **Files:** 10,000 training files
- **Distortions:** AWGN, MNRU, DNS-Challenge Noise, Filters, Clipping, Codecs (AMR, EVS, Opus, etc.), and Packet Loss.
- **Targets:** 
    - `mos`: Overall Mean Opinion Score
    - `noi`: Noisiness
    - `col`: Coloration
    - `dis`: Discontinuity
    - `loud`: Loudness

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd
from pathlib import Path

# Set plot style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Configuration and Data Loading

We will load the main metadata file: `NISQA_TRAIN_SIM_file.csv`.

In [None]:
# Define paths based on your directory structure
BASE_DIR = Path("data/raw/NISQA_Corpus/NISQA_TRAIN_SIM")
CSV_PATH = BASE_DIR / "NISQA_TRAIN_SIM_file.csv"

# Check if file exists
if not CSV_PATH.exists():
    print(f"Error: CSV not found at {CSV_PATH}. Please check the path.")
else:
    # Load the dataframe
    df = pd.read_csv(CSV_PATH)
    print(f"Dataset loaded successfully with {len(df)} rows and {len(df.columns)} columns.")

## 2. Data Inspection
Let's look at the available columns and the first few rows. The dataset contains both metadata (distortion parameters) and target labels (MOS scores).

In [None]:
pd.set_option('display.max_columns', None)
display(df.head())

## 3. Distribution of Target Scores (MOS)

The dataset provides an overall quality score (`mos`) and four dimension-specific scores. 
- `mos`: Mean Opinion Score (1-5)
- `noi`: Noisiness
- `col`: Coloration
- `dis`: Discontinuity
- `loud`: Loudness

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Overall MOS Distribution
sns.histplot(df['mos'], bins=30, kde=True, ax=axes[0], color='skyblue')
axes[0].set_title('Distribution of Overall MOS')
axes[0].set_xlabel('MOS (Mean Opinion Score)')

# Plot 2: Dimensions Boxplot
dimensions = df[['noi', 'col', 'dis', 'loud']]
sns.boxplot(data=dimensions, ax=axes[1], palette="Set2")
axes[1].set_title('Distribution of Speech Quality Dimensions')
axes[1].set_ylim(1, 5)

plt.tight_layout()
plt.show()

## 4. Analyzing Distortion Effects

The dataset includes various simulated distortions. Let's analyze how specific parameters affect the MOS.

### 4.1 Impact of Codecs
We compare the MOS distribution across different primary codecs (`codec1`).

In [None]:
# Filter for rows where a codec was actually applied (codec1 is not NaN/Empty)
# Note: Adjust filtering based on actual empty values in your CSV (e.g., 'none', nan, or 0)
if 'codec1' in df.columns:
    plt.figure(figsize=(14, 6))
    
    # Get top codecs by frequency to avoid clutter
    top_codecs = df['codec1'].value_counts().index[:10]
    df_codecs = df[df['codec1'].isin(top_codecs)]
    
    sns.boxplot(x='codec1', y='mos', data=df_codecs, palette='viridis')
    plt.title('Impact of Different Codecs on MOS')
    plt.xticks(rotation=45)
    plt.show()

### 4.2 Impact of White Noise (SNR)
We look at the relationship between White Gaussian Noise SNR (`wbgn_snr`) and the Noisiness (`noi`) and Overall (`mos`) scores.

In [None]:
# Filter data where White Background Noise was applied (snr != 0 or valid)
if 'wbgn_snr' in df.columns:
    # Assuming 0 or very high number might mean 'no noise', checking distribution first
    # Let's plot only entries where noise is present
    
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x='wbgn_snr', y='mos', data=df, alpha=0.5, label='Overall MOS')
    sns.scatterplot(x='wbgn_snr', y='noi', data=df, alpha=0.5, color='orange', label='Noisiness Score')
    
    plt.title('SNR vs Quality Scores')
    plt.xlabel('White Noise SNR (dB)')
    plt.ylabel('Score')
    plt.legend()
    plt.show()

## 5. Audio Playback Demo

Compare the **Reference** (Clean) audio with the **Degraded** (Simulated) audio side-by-side.

**Note:** This requires the audio files to be present in the `deg/` and `ref/` subfolders.

In [None]:
def listen_to_sample(row_index=None):
    """
    Selects a sample and displays audio players for Ref and Deg files.
    """
    if row_index is None:
        sample = df.sample(1).iloc[0]
    else:
        sample = df.iloc[row_index]
        
    # Construct full paths
    # The CSV columns 'filepath_deg' and 'filepath_ref' include the dataset folder name (NISQA_TRAIN_SIM/...). 
    # So we join them with the parent of BASE_DIR to avoid duplication: data/raw/NISQA_Corpus + NISQA_TRAIN_SIM/...
    
    path_deg = BASE_DIR.parent / str(sample['filepath_deg'])
    path_ref = BASE_DIR.parent / str(sample['filepath_ref'])

    print(f"--- File Info ---")
    print(f"File Name: {sample['file']}")
    print(f"Condition: {sample['con_description']}")
    print(f"MOS: {sample['mos']:.2f} | Noi: {sample['noi']:.2f} | Dis: {sample['dis']:.2f} | Col: {sample['col']:.2f}")
    
    print("\nðŸŽ§ Reference (Clean):")
    if path_ref.exists():
        display(ipd.Audio(filename=path_ref))
    else:
        print(f"File not found: {path_ref}")

    print("\nðŸŽ§ Degraded (Simulated):")
    if path_deg.exists():
        display(ipd.Audio(filename=path_deg))
    else:
        print(f"File not found: {path_deg}")

# Run the player with a random sample
listen_to_sample()

In [None]:
# You can also listen to a specific file by index if you found an interesting outlier in the plots
# listen_to_sample(0) 