![image.png](https://i.imgur.com/a3uAqnb.png)

# 🎛️ 1D CNN for Tajweed Audio Classification

In this notebook, we will implement a **1D Convolutional Neural Network (CNN)** for classifying **Tajweed rules** in Quranic recitation audio.

**1D CNN** is a powerful architecture for processing sequential data like audio waveforms. Unlike traditional 2D CNNs used for images, 1D CNNs operate directly on the raw audio signal, learning temporal patterns and features that distinguish different Tajweed rules.

This notebook demonstrates end-to-end audio classification using PyTorch and torchaudio.

## 📌 **What is 1D CNN for Audio?**

**1D CNN** processes audio data by applying convolutions along the time dimension, making it ideal for:
- Raw audio waveform analysis
- Temporal pattern recognition
- Feature extraction from sequential data
- Real-time audio processing

### 🔹 **Key Concepts:**
1️⃣ **Raw Waveform Processing**: Direct analysis of audio samples without spectrograms

2️⃣ **Temporal Convolutions**: 1D kernels slide across time dimension

3️⃣ **Hierarchical Features**: Early layers capture local patterns, deeper layers learn complex temporal structures

4️⃣ **End-to-End Learning**: No manual feature engineering required

### 🔹 **1D CNN vs Traditional Audio Processing:**
| **1D CNN** | **Traditional Methods** |
|------------|-------------------------|
| Raw waveform input | Hand-crafted features (MFCC, spectrograms) |
| Automatic feature learning | Manual feature engineering |
| End-to-end optimization | Multi-stage pipeline |
| Temporal pattern recognition | Frequency-domain analysis |
| GPU-accelerated | Often CPU-based |

## 🎯 **Tajweed Classification Task**

**Tajweed** refers to the rules governing pronunciation during Quranic recitation. Our task is to classify audio segments according to different Tajweed rules based on the acoustic patterns in the recitation.

## 🎛️ Configuration & Imports

Below we set all hyperparameters in one place (`CONFIG`), and import the libraries you'll need.

**Key Parameters Explained:**
- **`SAMPLE_RATE`**: 16kHz - Standard rate for speech processing, balancing quality and computational efficiency
- **`MAX_SEC`**: 4.0 seconds - Fixed duration for all audio clips to ensure consistent input size
- **`MAX_LEN`**: 64,000 samples - Calculated as SAMPLE_RATE × MAX_SEC
- **`BATCH`**, **`EPOCHS`**, **`LR`**: Training hyperparameters optimized for audio classification
- **`K_FOLDS`**: 5-fold cross-validation for robust model evaluation

We use:
- **torchaudio** for audio I/O and preprocessing
- **torchmetrics** for F1-score calculation (important for potentially imbalanced classes)
- **tqdm** for training progress visualization

In [None]:
from IPython.display import clear_output
# run this if you are working on colab(disgusting 🤮)
!pip install torchmetrics

clear_output()


In [None]:
import kagglehub

path = kagglehub.dataset_download("mohammad2012191/tajweed-dataset")

print("Path to dataset files:", path)

In [None]:
from pathlib import Path
import random, os
import numpy as np, pandas as pd
import torch
import torch.nn as nn, torch.nn.functional as F
import torchaudio
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import StratifiedKFold
from torchmetrics.classification import F1Score
from tqdm import tqdm
import matplotlib.pyplot as plt

# 🎲 Set random seeds for reproducibility across all libraries
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

# 🎵 Audio processing parameters
SAMPLE_RATE = 16_000  # 16kHz sampling rate - optimal for speech
MAX_SEC     = 4.0     # Fixed duration for all audio clips
MAX_LEN     = int(SAMPLE_RATE * MAX_SEC)  # 64,000 samples per clip

# 🏋️ Training hyperparameters
BATCH    = 32   # Batch size - balance between memory usage and gradient stability
EPOCHS   = 3   # Number of training epochs
LR       = 3e-4 # Learning rate - conservative value for stable training
K_FOLDS  = 5    # Number of cross-validation folds

# 📁 Dataset paths
WORK_DIR  = Path(path)
TRAIN_CSV = WORK_DIR/"train.csv"  # CSV with audio file IDs and labels
TRAIN_DIR = WORK_DIR/"train"      # Directory containing .wav files

# 🖥️ Device configuration - use GPU if available for faster training
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"🖥️ Using device: {DEVICE}")

## 🔍 Exploratory Data Analysis (EDA)

Understanding your dataset is crucial for successful audio classification. Let's explore:

1. **Reciter Distribution**: Different speakers may have varying vocal characteristics
2. **Tajweed Rule Distribution**: Class imbalance can affect model performance
3. **Audio Characteristics**: Understanding the diversity in our audio data

**Why EDA matters for audio:**
- **Speaker Variability**: Different reciters have unique vocal characteristics
- **Class Imbalance**: Some Tajweed rules might be more common than others
- **Audio Quality**: Consistent preprocessing ensures fair comparison across samples

In [None]:
# 📊 Load training data and perform exploratory analysis
train_df = pd.read_csv(TRAIN_CSV)

print(f"📈 Dataset Overview:")
print(f"Total samples: {len(train_df):,}")
print(f"Number of reciters: {train_df.sheikh_name.nunique()}")
print(f"Number of Tajweed rules: {train_df.label_name.nunique()}")
print(f"\n📋 Tajweed rules: {sorted(train_df.label_name.unique())}")

plt.figure(figsize=(12,5))

plt.subplot(1, 2, 1)
reciter_counts = train_df.sheikh_name.value_counts()
reciter_counts.plot.bar(color='skyblue', alpha=0.8)
plt.title("Distribution of Reciters", fontsize=14, pad=15)
plt.xlabel("Reciter (Sheikh)")
plt.ylabel("Number of Audio Samples")
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

plt.subplot(1, 2, 2)
rule_counts = train_df.label_name.value_counts()
rule_counts.plot.bar(color='lightcoral', alpha=0.8)
plt.title("Distribution of Tajweed Rules", fontsize=14, pad=15)
plt.xlabel("Tajweed Rule")
plt.ylabel("Number of Audio Samples")
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 🎙️ Dataset & Audio Preprocessing

Our PyTorch Dataset class handles the complete audio preprocessing pipeline:

### 🔹 **Audio Processing Steps:**
1. **Loading**: Read `.wav` files using torchaudio
2. **Resampling**: Ensure consistent 16kHz sampling rate
3. **Mono Conversion**: Convert stereo to mono for simpler processing
4. **Length Normalization**: Pad short clips, trim long clips to exactly 4 seconds
5. **Augmentation**: Apply random transformations during training

### 🔹 **Audio Augmentations Explained:**
- **Volume Adjustment**: ±6dB gain changes simulate different recording conditions
- **Frequency Masking**: Random frequency bands are masked to improve robustness
- **Time Masking**: Random time segments are masked to prevent overfitting

These augmentations are inspired by SpecAugment but adapted for raw waveforms, helping the model generalize better to unseen audio conditions.

In [None]:
from sklearn.preprocessing import LabelEncoder

# 🏷️ Encode text labels to integers for model training
# LabelEncoder converts string labels to consecutive integers (0, 1, 2, ...)
le = LabelEncoder().fit(train_df["label_name"])
train_df["y"] = le.transform(train_df["label_name"])

print(f"🏷️ Label Encoding Mapping:")
for i, label in enumerate(le.classes_):
    count = (train_df["y"] == i).sum()
    print(f"  {i}: {label} ({count:,} samples)")

print(f"\n✅ Labels encoded successfully! Shape: {train_df['y'].shape}")

In [None]:
from pathlib import Path
import random, torch
import torch.nn.functional as F
import torchaudio
from torch.utils.data import Dataset
from torchvision.transforms import Compose, RandomApply
from torchaudio.transforms import Vol, FrequencyMasking, TimeMasking

# 🎵 Define audio augmentation pipeline
# These augmentations help the model generalize to different recording conditions
audio_aug = Compose([
    # 🔊 Volume augmentation: ±6dB gain variation (50% probability)
    # Simulates different microphone distances and recording levels
    RandomApply([Vol(gain=6.0, gain_type='db')], p=0.5),

    # 🎼 Frequency masking: Hide random frequency bands (50% probability)
    # Helps model focus on important frequency ranges, reduces overfitting
    RandomApply([FrequencyMasking(freq_mask_param=30)], p=0.5),

    # ⏰ Time masking: Hide random time segments (50% probability)
    # Simulates brief audio dropouts, improves temporal robustness
    RandomApply([TimeMasking(time_mask_param=80)], p=0.5),
])

class TajweedDataset(Dataset):
    """
    🎙️ PyTorch Dataset for Tajweed Audio Classification

    This dataset handles loading, preprocessing, and augmentation of audio files
    for training a 1D CNN on Tajweed rule classification.

    Key Features:
    - Automatic resampling to target sample rate
    - Length normalization (pad/trim to fixed duration)
    - Optional augmentations for improved generalization
    - Flexible return format for training vs inference

    Args:
        df (pd.DataFrame): DataFrame with 'id' column and optionally 'y' (labels)
        folder (str/Path): Directory containing audio files named as '{id}.wav'
        transforms (callable): Audio augmentation pipeline (None for inference)
        return_id (bool): If True, return (audio, id) for inference mode
                         If False, return (audio, label) for training mode
    """

    def __init__(self, df, folder, transforms=None, return_id=False):
        self.df = df.reset_index(drop=True)
        self.folder = Path(folder)
        self.transforms = transforms
        self.return_id = return_id

        print(f"📁 Dataset initialized: {len(self.df)} samples from {self.folder}")

    def _pad_trim(self, wav: torch.Tensor) -> torch.Tensor:
        """
        🔧 Normalize audio length to exactly MAX_LEN samples

        - If too short: Zero-pad at the end
        - If too long: Center-crop to target length

        This ensures all inputs have identical shape for batch processing.
        """
        current_len = wav.size(1)

        if current_len < MAX_LEN:
            # Pad with zeros at the end
            padding = MAX_LEN - current_len
            return F.pad(wav, (0, padding))
        elif current_len > MAX_LEN:
            # Center crop to target length
            start_idx = (current_len - MAX_LEN) // 2
            return wav[:, start_idx:start_idx + MAX_LEN]
        else:
            return wav

    def __getitem__(self, idx):
        row = self.df.iloc[idx]

        # 📂 Load audio file
        audio_path = self.folder / f"{row['id']}.wav"

        try:
            # Load waveform and sample rate
            wav, sr = torchaudio.load(str(audio_path))

            # 🔄 Resample if necessary to ensure consistent sample rate
            if sr != SAMPLE_RATE:
                wav = torchaudio.functional.resample(wav, sr, SAMPLE_RATE)

            # 🎛️ Apply augmentations if provided (only during training)
            if self.transforms is not None:
                wav = self.transforms(wav)

            # 🎵 Convert to mono by averaging channels
            wav = wav.mean(dim=0, keepdim=True)

            # 📏 Normalize length to exactly MAX_LEN samples
            wav = self._pad_trim(wav)

            # 🔄 Return format depends on mode
            if self.return_id:
                # Inference mode: return audio and file ID
                return wav, row["id"]
            else:
                # Training mode: return audio and encoded label
                return wav, torch.tensor(row["y"], dtype=torch.long)

        except Exception as e:
            print(f"❌ Error loading {audio_path}: {e}")
            # Return zeros if file loading fails
            zero_wav = torch.zeros(1, MAX_LEN)
            if self.return_id:
                return zero_wav, row["id"]
            else:
                return zero_wav, torch.tensor(0, dtype=torch.long)

    def __len__(self):
        return len(self.df)

## 🔊 Audio Waveform Visualization

Visualizing raw audio waveforms helps us understand:
- **Amplitude Patterns**: How loud different segments are
- **Temporal Structure**: How audio evolves over time  
- **Variability**: Differences between samples and Tajweed rules
- **Quality Check**: Ensuring our preprocessing works correctly

Unlike spectrograms which show frequency content, raw waveforms display the actual signal our 1D CNN will process.

In [None]:
# 🖼️ Visualize sample waveforms to understand our data

# Select 4 random samples with different labels for diversity
sample_data = train_df.groupby('label_name').head(1).head(4)
sample_dataset = TajweedDataset(sample_data, TRAIN_DIR, transforms=None)

plt.figure(figsize=(15, 10))

for i in range(len(sample_dataset)):
    wav, label_idx = sample_dataset[i]
    row = sample_dataset.df.iloc[i]

    file_id = row["id"]
    label_name = row["label_name"]
    sheikh_name = row["sheikh_name"]

    # Create time axis for x-axis (in seconds)
    time_axis = np.linspace(0, MAX_SEC, len(wav.squeeze()))

    plt.subplot(2, 2, i+1)
    plt.plot(time_axis, wav.squeeze().numpy(), color='steelblue', alpha=0.8)
    plt.title(f"{file_id}\nRule: {label_name}\nSheikh: {sheikh_name}",
              fontsize=10, pad=10)
    plt.xlabel("Time (seconds)")
    plt.ylabel("Amplitude")
    plt.grid(True, alpha=0.3)
    plt.xlim(0, MAX_SEC)

    # Add some statistics
    max_amp = wav.abs().max().item()
    mean_amp = wav.abs().mean().item()
    plt.text(0.02, 0.95, f"Max: {max_amp:.3f}\nMean: {mean_amp:.3f}",
             transform=plt.gca().transAxes, fontsize=8,
             bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8))

plt.suptitle("🎵 Sample Tajweed Audio Waveforms", fontsize=16, y=0.98)
plt.tight_layout()
plt.show()

## 🏗️ 1D CNN Architecture

Our 1D CNN is specifically designed for audio classification with the following key components:

### 🔹 **Architecture Design Principles:**

1. **Progressive Downsampling**: Each conv layer reduces temporal resolution while increasing channel depth
2. **Batch Normalization**: Stabilizes training and accelerates convergence  
3. **ReLU Activation**: Introduces non-linearity for complex pattern learning
4. **Global Average Pooling**: Reduces overfitting compared to fully connected layers
5. **Moderate Depth**: 4 conv layers balance capacity with computational efficiency

### 🔹 **Layer-by-Layer Breakdown:**
- **Input**: (batch, 1, 64000) - Raw mono audio waveform
- **Conv1**: 1→32 channels, kernel=11, stride=2 → (batch, 32, 32000)
- **Conv2**: 32→64 channels, kernel=11, stride=2 → (batch, 64, 16000)  
- **Conv3**: 64→128 channels, kernel=11, stride=2 → (batch, 128, 8000)
- **Conv4**: 128→256 channels, kernel=11, stride=2 → (batch, 256, 4000)
- **GlobalAvgPool**: → (batch, 256, 1)
- **Classifier**: 256 → num_classes

### 🔹 **Why These Choices:**
- **Kernel Size 11**: Captures meaningful temporal patterns in audio
- **Stride 2**: Aggressive downsampling reduces computation while preserving information
- **Channel Progression**: 1→32→64→128→256 follows common CNN scaling patterns

In [None]:
class CNN1D(nn.Module):
    """
    🏗️ 1D Convolutional Neural Network for Audio Classification

    This architecture processes raw audio waveforms using temporal convolutions
    to learn hierarchical features for Tajweed rule classification.

    Architecture Flow:
    Raw Audio → Conv Blocks → Global Pooling → Classification

    Key Features:
    - Direct waveform processing (no spectrograms needed)
    - Progressive feature extraction through conv layers
    - Batch normalization for stable training
    - Global average pooling for translation invariance
    - Compact classifier head
    """

    def __init__(self, n_classes):
        super().__init__()

        # 🔧 Feature extraction backbone
        self.features = nn.Sequential(
            # 🔵 Block 1: 1 → 32 channels (kernel=11, stride=2)
            # Captures basic temporal patterns in raw audio
            nn.Conv1d(in_channels=1, out_channels=32, kernel_size=11, stride=2, padding=5),
            nn.BatchNorm1d(32),  # Normalize for stable training
            nn.ReLU(inplace=True),  # Non-linear activation

            # 🟡 Block 2: 32 → 64 channels
            # Learns combinations of basic patterns
            nn.Conv1d(32, 64, kernel_size=11, stride=2, padding=5),
            nn.BatchNorm1d(64),
            nn.ReLU(inplace=True),

            # 🟠 Block 3: 64 → 128 channels
            # Captures mid-level temporal structures
            nn.Conv1d(64, 128, kernel_size=11, stride=2, padding=5),
            nn.BatchNorm1d(128),
            nn.ReLU(inplace=True),

            # 🔴 Block 4: 128 → 256 channels
            # High-level feature representation
            nn.Conv1d(128, 256, kernel_size=11, stride=2, padding=5),
            nn.BatchNorm1d(256),
            nn.ReLU(inplace=True),

            # 🌐 Global Average Pooling
            # Reduces each channel to single value, provides translation invariance
            nn.AdaptiveAvgPool1d(1)
        )

        # 🎯 Classification head
        self.classifier = nn.Linear(256, n_classes)

        # 📊 Print model info
        total_params = sum(p.numel() for p in self.parameters())
        print(f"🏗️ Model created with {total_params:,} parameters")

    def forward(self, x):
        """
        Forward pass through the 1D CNN

        Args:
            x: Input tensor of shape (batch_size, 1, sequence_length)

        Returns:
            Logits tensor of shape (batch_size, n_classes)
        """
        # Extract features through conv blocks
        features = self.features(x)  # (batch, 256, 1)

        # Remove length dimension and apply classifier
        features = features.squeeze(-1)  # (batch, 256)
        logits = self.classifier(features)  # (batch, n_classes)

        return logits

# 🧪 Test model with dummy input to verify shapes
n_classes = len(train_df.label_name.unique())
model = CNN1D(n_classes)

# Create dummy input batch
dummy_input = torch.randn(2, 1, MAX_LEN)  # 2 samples for testing
with torch.no_grad():
    dummy_output = model(dummy_input)

print(f"✅ Model test successful!")
print(f"📥 Input shape: {dummy_input.shape}")
print(f"📤 Output shape: {dummy_output.shape}")
print(f"🎯 Number of classes: {n_classes}")

## 🔄 Training with K-Fold Cross-Validation

We use **5-fold stratified cross-validation** to ensure robust model evaluation:

### 🔹 **Why K-Fold CV for Audio?**
1. **Robust Evaluation**: Reduces dependence on specific train/test splits
2. **Speaker Independence**: Helps ensure model generalizes across different reciters
3. **Class Balance**: Stratified splitting maintains label distribution in each fold
4. **Statistical Significance**: Multiple folds provide confidence intervals

### 🔹 **Training Process per Fold:**
1. **Data Split**: 80% train, 20% validation (stratified by Tajweed rule)
2. **Augmentation**: Apply audio augmentations only to training data
3. **Optimization**: Adam optimizer with cosine annealing learning rate schedule
4. **Evaluation**: Macro-F1 score (equal weight to all classes)
5. **Checkpointing**: Save best model based on validation F1

### 🔹 **Key Metrics:**
- **Macro-F1**: Average F1 across all classes (handles class imbalance)
- **Loss Tracking**: Monitor training convergence
- **Best Model**: Saved when validation F1 improves

In [None]:
# Initialize cross-validation splitter
skf = StratifiedKFold(n_splits=K_FOLDS, shuffle=True, random_state=SEED)
val_scores = []  # Store validation F1 scores for each fold

# 📊 Track training progress
all_fold_results = []

for fold, (tr_idx, va_idx) in enumerate(skf.split(train_df, train_df.label_name), 1):



    # Training loader with augmentations
    tr_dataset = TajweedDataset(train_df.iloc[tr_idx], TRAIN_DIR,
                               transforms=audio_aug, return_id=False)
    tr_dl = DataLoader(tr_dataset, batch_size=BATCH, shuffle=True,
                      num_workers=2, pin_memory=True)

    # Validation loader without augmentations
    va_dataset = TajweedDataset(train_df.iloc[va_idx], TRAIN_DIR,
                               transforms=None, return_id=False)
    va_dl = DataLoader(va_dataset, batch_size=BATCH, shuffle=False,
                      num_workers=2, pin_memory=True)

    # 🏗️ Initialize model and training components
    model = CNN1D(n_classes=len(train_df.label_name.unique())).to(DEVICE)
    optimizer = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)
    criterion = nn.CrossEntropyLoss()

    # 📊 F1 score metric for evaluation (macro averaging for balanced evaluation)
    f1_metric = F1Score(task="multiclass",
                       num_classes=len(train_df.label_name.unique()),
                       average="macro").to(DEVICE)

    # 🏆 Track best performance
    best_val_f1 = 0.0
    fold_train_losses = []
    fold_val_f1s = []

    # 🏋️ Training loop for current fold
    for epoch in range(1, EPOCHS + 1):

        # 🔥 TRAINING PHASE
        model.train()
        epoch_train_losses = []

        train_pbar = tqdm(tr_dl, desc=f"Fold {fold} Epoch {epoch:2d}/{EPOCHS} [Train]")
        for batch_idx, (audio_batch, label_batch) in enumerate(train_pbar):
            # Move to device
            audio_batch = audio_batch.to(DEVICE, non_blocking=True)
            label_batch = label_batch.to(DEVICE, non_blocking=True)

            # Forward pass
            optimizer.zero_grad()
            logits = model(audio_batch)
            loss = criterion(logits, label_batch)

            # Backward pass
            loss.backward()
            optimizer.step()

            # Track loss
            batch_loss = loss.item()
            epoch_train_losses.append(batch_loss)
            fold_train_losses.append(batch_loss)

            # Update progress bar
            train_pbar.set_postfix({
                'Loss': f'{batch_loss:.4f}',
                'LR': f'{optimizer.param_groups[0]["lr"]:.2e}'
            })

        # 📊 VALIDATION PHASE
        def evaluate_model(dataloader, metric):
            """Helper function to evaluate model on given dataloader"""
            model.eval()
            metric.reset()
            total_loss = 0.0
            num_batches = 0

            with torch.no_grad():
                for audio_batch, label_batch in dataloader:
                    audio_batch = audio_batch.to(DEVICE, non_blocking=True)
                    label_batch = label_batch.to(DEVICE, non_blocking=True)

                    # Forward pass
                    logits = model(audio_batch)
                    loss = criterion(logits, label_batch)

                    # Update metrics
                    predictions = logits.softmax(dim=1)
                    metric.update(predictions, label_batch)
                    total_loss += loss.item()
                    num_batches += 1

            return metric.compute().item(), total_loss / num_batches

        # Evaluate on validation set
        val_f1, val_loss = evaluate_model(va_dl, f1_metric)
        fold_val_f1s.append(val_f1)

        # Update learning rate
        scheduler.step()

        # 🏆 Save best model
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            checkpoint_path = f"fold{fold}_best.pt"
            torch.save({
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'epoch': epoch,
                'val_f1': val_f1,
                'fold': fold
            }, checkpoint_path)
            print(f"🏆 New best model saved! Val F1: {val_f1:.4f}")

        # 📊 Epoch summary
        avg_train_loss = np.mean(epoch_train_losses)
        current_lr = optimizer.param_groups[0]['lr']



    # 📈 Fold summary
    val_scores.append(best_val_f1)
    fold_results = {
        'fold': fold,
        'best_val_f1': best_val_f1,
        'train_losses': fold_train_losses,
        'val_f1s': fold_val_f1s
    }
    all_fold_results.append(fold_results)


In [None]:
print(f"📊 Individual Fold Results:")
for i, score in enumerate(val_scores, 1):
    print(f"  Fold {i}: {score:.4f}")

mean_cv_score = np.mean(val_scores)
std_cv_score = np.std(val_scores)

print(f"\n🎯 Final Cross-Validation Results:")
print(f"📈 Mean F1 Score: {mean_cv_score:.4f} ± {std_cv_score:.4f}")
print(f"📊 Best Single Fold: {max(val_scores):.4f}")
print(f"📉 Worst Single Fold: {min(val_scores):.4f}")

In [None]:
# 🎨 Visualize training progress
plt.figure(figsize=(15, 5))

# Plot 1: F1 scores across folds
plt.subplot(1, 3, 1)
plt.bar(range(1, K_FOLDS + 1), val_scores, color='skyblue', alpha=0.8)
plt.axhline(mean_cv_score, color='red', linestyle='--', alpha=0.8,
           label=f'Mean: {mean_cv_score:.4f}')
plt.title('Validation F1 Scores by Fold')
plt.xlabel('Fold')
plt.ylabel('F1 Score')
plt.legend()
plt.grid(axis='y', alpha=0.3)

# Plot 2: Training loss progression (last fold)
plt.subplot(1, 3, 2)
if all_fold_results:
    last_fold_losses = all_fold_results[-1]['train_losses']
    # Smooth the losses for better visualization
    window_size = max(1, len(last_fold_losses) // 50)
    if len(last_fold_losses) > window_size:
        smoothed_losses = np.convolve(last_fold_losses,
                                    np.ones(window_size)/window_size, mode='valid')
        plt.plot(smoothed_losses, color='orange', alpha=0.8)
    else:
        plt.plot(last_fold_losses, color='orange', alpha=0.8)
plt.title(f'Training Loss (Fold {K_FOLDS})')
plt.xlabel('Training Step')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)

# Plot 3: Validation F1 progression (last fold)
plt.subplot(1, 3, 3)
if all_fold_results:
    last_fold_val_f1s = all_fold_results[-1]['val_f1s']
    plt.plot(range(1, len(last_fold_val_f1s) + 1), last_fold_val_f1s,
             color='green', marker='o', alpha=0.8)
plt.title(f'Validation F1 (Fold {K_FOLDS})')
plt.xlabel('Epoch')
plt.ylabel('F1 Score')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## 🎊 Training Results & Model Performance

### 🔹 **Understanding the Results:**

**Cross-Validation F1 Scores**: Our model achieved consistent performance across all folds, indicating:
- **Robustness**: Model generalizes well to different data splits
- **Stability**: Low variance between folds suggests reliable training
- **Balance**: Macro-F1 ensures all Tajweed rules are learned equally

### 🔹 **What Makes This Approach Effective:**

1. **Direct Waveform Processing**: No feature engineering required - the CNN learns optimal representations
2. **Temporal Pattern Recognition**: 1D convolutions capture time-dependent acoustic patterns
3. **Data Augmentation**: Audio transformations improve generalization to new recording conditions
4. **Cross-Validation**: Ensures robust evaluation across different speaker combinations

### 🔹 **Next Steps for Improvement:**

🚀 **Model Enhancements:**
- **Deeper Networks**: Add more convolutional layers
- **Residual Connections**: Skip connections for better gradient flow
- **Attention Mechanisms**: Focus on important temporal regions
- **Multi-Scale Processing**: Parallel paths with different kernel sizes

🎵 **Audio Processing:**
- **Longer Clips**: Process 8-10 second segments for more context
- **Mel-Spectrograms**: Combine with frequency-domain features
- **Advanced Augmentations**: Pitch shifting, speed changes, noise addition

📊 **Training Strategies:**
- **Transfer Learning**: Pre-train on larger speech datasets
- **Ensemble Methods**: Combine predictions from multiple folds
- **Class Balancing**: Address any remaining class imbalance issues



# Contributed by: Mohamed Eltayeb (He did everything) - Edited by: Ali Habibullah (I added comments and markdowns 😶‍🌫️)
