### Importing Libraries and Dependencies

This section imports all the necessary libraries and modules required for the project:

- **Core libraries**: `numpy`, `pandas`, `os`, `sys` for data manipulation and system interactions.
- **Path and progress utilities**: `Path` from `pathlib` and `tqdm` for handling paths and progress bars.
- **Audio processing**: `torchaudio` and `librosa` for audio loading and transformations.
- **IPython display**: `ipd` for audio playback in the notebook.
- **Data handling**: `datasets` for managing and loading datasets with caching enabled.
- **Machine learning tools**: `scikit-learn` for splitting data into training and testing sets.
- **Deep learning frameworks**:
  - `torch` and `torch.nn` for creating and managing neural networks.
  - Transformers library for leveraging the Wav2Vec2 processor and feature extractor.
- **Modeling utilities**:
  - Data class structures for model outputs.
  - Training-related components like `Trainer` and `TrainingArguments`.
  - Loss functions like `CrossEntropyLoss` for classification tasks.
- **Version checks**:
  - Ensures compatibility with `torch` versions and integrates with NVIDIA's AMP for mixed-precision training.

These imports lay the groundwork for the project, enabling the integration of Wav2Vec2 for language identification from audio files.


In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import torchaudio
from sklearn.model_selection import train_test_split
import os
import sys
from transformers import AutoConfig, Wav2Vec2Processor, Wav2Vec2FeatureExtractor 
import librosa
import IPython.display as ipd
from datasets import load_dataset
from datasets import set_caching_enabled
import torch
from dataclasses import dataclass
from typing import Optional, Tuple
import torch
from transformers.file_utils import ModelOutput
from transformers import Trainer
from transformers import TrainingArguments
from transformers import EvalPrediction
from typing import Dict, List, Optional, Union, Any
import torch.nn as nn
from transformers.models.wav2vec2.modeling_wav2vec2 import (
    Wav2Vec2PreTrainedModel,
    Wav2Vec2Model
)
from torch.nn import CrossEntropyLoss
from packaging import version
from transformers import (
    Trainer,
    is_apex_available,
)

if is_apex_available():
    from apex import amp

if version.parse(torch.__version__) >= version.parse("1.6"):
    _is_native_amp_available = True
    from torch.cuda.amp import autocast

### Data Collection and Organization

In this section, we organize the dataset for language identification:

1. **Dataset Directory**:
   - The variable `data_dir` points to the directory containing audio files. Here, the dataset is expected to be organized with subdirectories named after the language labels.

2. **Data Structure**:
   - A list named `data` is initialized to store metadata for each audio file, including:
     - `path`: Full file path to the `.wav` audio file.
     - `label`: The language label derived from the subdirectory name.

3. **Iterating Through Files**:
   - `os.walk()` recursively traverses the `data_dir` directory.
   - For each `.wav` file encountered:
     - The `path` is constructed using `os.path.join`.
     - The `label` is extracted from the parent directory name.
     - These details are stored as a dictionary in the `data` list.

4. **Progress Tracking**:
   - `tqdm` provides a progress bar to visualize the traversal of the directory.

This process prepares a structured dataset with audio file paths and their respective language labels, which will be used for further processing and training.


In [2]:
data_dir = "/kaggle/input/languages"

data = []

for root, dirs, files in tqdm(os.walk(data_dir)):
    for file in files:
        if file.endswith(".wav"):  
            path = os.path.join(root, file)
            name = os.path.splitext(file)[0]  
            label = os.path.basename(root)
            data.append({
                "path": path,
                "label": label
            })

9it [00:42,  4.69s/it]


### Creating and Previewing the Dataset

1. **Creating a DataFrame**:
   - The collected metadata is converted into a Pandas DataFrame using `pd.DataFrame(data)`. This provides a tabular structure, making it easier to manage and analyze the dataset.

2. **Previewing the Data**:
   - `df.head()` displays the first five rows of the DataFrame, allowing for a quick inspection of its structure and content. The expected columns are:
     - `path`: File path of the audio sample.
     - `label`: Language label associated with the audio sample.

This step ensures that the dataset is correctly structured and ready for further processing.


In [3]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,path,label
0,/kaggle/input/languages/japanese/japanese_3123...,japanese
1,/kaggle/input/languages/japanese/japanese_3220...,japanese
2,/kaggle/input/languages/japanese/japanese_2391...,japanese
3,/kaggle/input/languages/japanese/japanese_3613...,japanese
4,/kaggle/input/languages/japanese/japanese_2229...,japanese


### Dataset Label Inspection and Distribution Analysis

1. **Listing Unique Labels**:
   - `df["label"].unique()` identifies and displays the unique language labels present in the dataset. This helps verify the categories included for language identification.

2. **Analyzing Label Distribution**:
   - `df.groupby("label").count()[["path"]]` calculates the count of audio files for each label (language).
   - This step provides insights into the dataset balance, helping to identify any underrepresented or overrepresented languages.

Understanding the label distribution is crucial for designing a robust and fair training process.


In [4]:
print("Labels: ", df["label"].unique())
print()
df.groupby("label").count()[["path"]]

Labels:  ['japanese' 'hindi' 'gujarati' 'russian' 'german' 'sanskrit' 'italian'
 'spanish']



Unnamed: 0_level_0,path
label,Unnamed: 1_level_1
german,5000
gujarati,5000
hindi,5000
italian,5000
japanese,5000
russian,5000
sanskrit,5000
spanish,5000


### Splitting the Dataset into Training and Testing Sets

1. **Defining the Save Path**:
   - `save_path` specifies the directory where the train and test DataFrames will be saved as `.csv` files.

2. **Splitting the Dataset**:
   - `train_test_split()` splits the DataFrame into training and testing sets:
     - `test_size=0.2` allocates 20% of the data for testing.
     - `stratify=df["label"]` ensures that the label distribution is maintained in both splits.
   - `random_state=4` ensures reproducibility of the split.

3. **Resetting Indexes**:
   - The `reset_index(drop=True)` method resets the indices of the training and testing DataFrames, making them continuous and independent of the original DataFrame.

4. **Saving the Splits**:
   - Both DataFrames are saved as tab-separated `.csv` files in the specified directory:
     - `train.csv`: Contains training data.
     - `test.csv`: Contains testing data.

5. **Shape Verification**:
   - The dimensions of `train_df` and `test_df` are printed to confirm the split and verify the number of samples in each set.

This step prepares the training and testing datasets for model training and evaluation, ensuring an organized workflow.


In [None]:
save_path = "/kaggle/working/"

train_df, test_df = train_test_split(df, test_size=0.2, random_state=4, stratify=df["label"])

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)


print(train_df.shape)
print(test_df.shape)

(32000, 2)
(8000, 2)


### Loading the Dataset into Hugging Face's `datasets` Library

1. **Specifying Data Files**:
   - A dictionary `data_files` is created to specify the paths of the training and validation `.csv` files:
     - `"train"`: Path to the training dataset.
     - `"validation"`: Path to the testing dataset.

2. **Loading the Dataset**:
   - `load_dataset()` reads the `.csv` files into a `datasets.DatasetDict` object:
     - The `delimiter="\t"` parameter ensures that the tab-separated format of the files is correctly parsed.
   - The loaded datasets are split into:
     - `train_dataset`: Training dataset.
     - `eval_dataset`: Validation (testing) dataset.

3. **Dataset Summary**:
   - `print(train_dataset)` and `print(eval_dataset)` provide a summary of the training and validation datasets, including the number of samples and the column structure.

This step integrates the preprocessed data with the Hugging Face `datasets` library, enabling streamlined data manipulation and compatibility with the training pipeline.


In [None]:
data_files = {
    "train": "/kaggle/working/train.csv", 
    "validation": "/kaggle/working/test.csv",
}

dataset = load_dataset("csv", data_files=data_files, delimiter="\t", )
train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

print(train_dataset)
print(eval_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['path', 'label'],
    num_rows: 32000
})
Dataset({
    features: ['path', 'label'],
    num_rows: 8000
})


### Preparing Dataset Columns and Labels for Model Training

1. **Defining Input and Output Columns**:
   - `input_column`: Specifies the column containing the paths to audio files (features) for model input.
   - `output_column`: Specifies the column containing the language labels (targets) for model training.

2. **Extracting Unique Labels**:
   - `train_dataset.unique(output_column)` retrieves all unique language labels from the training dataset.
   - `label_list.sort()` ensures the labels are sorted in ascending order for consistency.

3. **Counting Labels**:
   - `num_labels` calculates the total number of unique labels, which is used to configure the model's output layer for classification.

This step organizes the dataset's structure and prepares label-related metadata required for model training and evaluation.


In [8]:
input_column = "path"
output_column = "label"
label_list = train_dataset.unique(output_column)
label_list.sort()  
num_labels = len(label_list)

### Configuring the Wav2Vec2 Model and Pooling Strategy

1. **Model Name or Path**:
   - `model_name_or_path` specifies the pre-trained Wav2Vec2 model to be used:
     - `"facebook/wav2vec2-base-100k-voxpopuli"` is a Wav2Vec2 model trained on the VoxPopuli dataset, tailored for speech processing tasks.

2. **Pooling Mode**:
   - `pooling_mode` determines how the output embeddings from the Wav2Vec2 model are aggregated:
     - `"mean"`: The embeddings are averaged across the sequence length, resulting in a fixed-size representation.

These configurations define the backbone model and feature aggregation method for the language identification task.


In [9]:
model_name_or_path = "facebook/wav2vec2-base-100k-voxpopuli"
pooling_mode = "mean" 

### Initializing the Model Configuration

1. **Loading Pre-trained Model Configuration**:
   - `AutoConfig.from_pretrained()` initializes the configuration of the pre-trained Wav2Vec2 model using:
     - `model_name_or_path`: The specified Wav2Vec2 model path or identifier.

2. **Customizing Configuration for Classification**:
   - `num_labels`: Sets the number of unique labels in the dataset.
   - `label2id`: Maps each label to a unique integer ID.
   - `id2label`: Maps each integer ID back to its corresponding label.
   - `finetuning_task`: Specifies the fine-tuning task as `"wav2vec2_clf"`, indicating this model will be fine-tuned for classification.

3. **Adding Pooling Mode**:
   - `setattr(config, 'pooling_mode', pooling_mode)`: Dynamically adds the `pooling_mode` attribute to the configuration, defining how the model will aggregate its output embeddings.

This step customizes the pre-trained Wav2Vec2 model for the language identification task, ensuring that the configuration aligns with the dataset and task requirements.


In [10]:
config = AutoConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    label2id={label: i for i, label in enumerate(label_list)},
    id2label={i: label for i, label in enumerate(label_list)},
    finetuning_task="wav2vec2_clf",
)
setattr(config, 'pooling_mode', pooling_mode)

config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

### Initializing the Feature Extractor and Target Sampling Rate

1. **Loading the Wav2Vec2 Feature Extractor**:
   - `Wav2Vec2FeatureExtractor.from_pretrained()` loads the feature extractor associated with the pre-trained Wav2Vec2 model. This extractor is responsible for preprocessing the audio input, converting it into the appropriate format for the model.

2. **Target Sampling Rate**:
   - `feature_extractor.sampling_rate` retrieves the target sampling rate of the pre-trained model.
   - The target sampling rate represents the frequency at which audio samples are expected to be processed by the model.

3. **Printing the Sampling Rate**:
   - The target sampling rate is printed to provide insight into the audio preprocessing requirements.

This step ensures that the audio data is processed at the correct sampling rate, aligning with the expectations of the pre-trained model.


In [11]:
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path,)
target_sampling_rate = feature_extractor.sampling_rate
print(f"The target sampling rate: {target_sampling_rate}")

preprocessor_config.json:   0%|          | 0.00/213 [00:00<?, ?B/s]

The target sampling rate: 16000


### Defining Data Preprocessing Functions

1. **`speech_file_to_array_fn()`**:
   - This function converts an audio file to a NumPy array:
     - `torchaudio.load(path)` loads the audio file from the specified `path` and returns the audio as a tensor (`speech_array`) along with its sampling rate.
     - `torchaudio.transforms.Resample(sampling_rate, target_sampling_rate)` creates a resampler to convert the audio's sampling rate to the target sampling rate.
     - The audio tensor is then resampled and converted into a NumPy array, which is returned as `speech`.

2. **`label_to_id()`**:
   - This function maps a language label to its corresponding ID from the `label_list`:
     - If the label is found in the list, its index is returned; otherwise, it returns `-1`.

3. **`preprocess_function()`**:
   - This function applies preprocessing to the dataset:
     - **Speech Data**: A list of processed audio arrays is created by applying `speech_file_to_array_fn()` to each audio file path in `examples[input_column]`.
     - **Labels**: A list of label IDs is created by applying `label_to_id()` to each language label in `examples[output_column]`.
     - The `feature_extractor` is applied to the speech data to extract features, with the sampling rate set to the target value.
     - The labels are added to the result as a new key `"labels"`, forming the final processed output.

This set of functions prepares the dataset by converting audio files to a suitable format and encoding labels for training.


In [12]:
def speech_file_to_array_fn(path):
    speech_array, sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(sampling_rate, target_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech 

def label_to_id(label, label_list):

    if len(label_list) > 0:
        return label_list.index(label) if label in label_list else -1

    return label 

def preprocess_function(examples):
    speech_list = [speech_file_to_array_fn(path) for path in examples[input_column]]
    target_list = [label_to_id(label, label_list) for label in examples[output_column]]

    result = feature_extractor(speech_list, sampling_rate=target_sampling_rate)
    result["labels"] = list(target_list)

    return result

### Limiting the Dataset Size for Training and Evaluation

1. **Defining Maximum Samples**:
   - `max_samples = 3200` sets the upper limit for the number of samples to be used from both the training and evaluation datasets. This helps manage memory usage and training time.

2. **Selecting a Subset of the Dataset**:
   - `train_dataset.select(range(max_samples))` selects the first `max_samples` samples from the training dataset.
   - `eval_dataset.select(range(max_samples))` similarly selects the first `max_samples` samples from the evaluation dataset.

This step ensures that only a manageable portion of the dataset is used, allowing for more efficient experimentation and model training.


In [13]:
max_samples = 3200
train_dataset = train_dataset.select(range(max_samples))
eval_dataset = eval_dataset.select(range(max_samples))

### Applying Preprocessing to the Datasets

1. **Preprocessing the Training Dataset**:
   - `train_dataset.map(preprocess_function, batch_size=100, batched=True)` applies the `preprocess_function` to the training dataset.
     - `batch_size=100` processes the data in batches of 100 examples, improving efficiency.
     - `batched=True` ensures that the `preprocess_function` operates on batches of data, rather than individual examples.

2. **Preprocessing the Evaluation Dataset**:
   - Similarly, `eval_dataset.map(preprocess_function, batch_size=100, batched=True)` applies the same preprocessing to the evaluation dataset.

This step preprocesses both datasets, extracting features from the audio files and encoding the labels, making the data ready for model training and evaluation.


In [None]:
train_dataset = train_dataset.map(
    preprocess_function,
    batch_size=100,
    batched=True
)

eval_dataset = eval_dataset.map(
    preprocess_function,
    batch_size=100,
    batched=True
)

Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

2024-04-27 07:20:59.881816: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-27 07:20:59.881950: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-27 07:21:00.031945: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

### Defining the Custom Output Class for the Speech Classifier

1. **`SpeechClassifierOutput` Class**:
   - This class extends `ModelOutput` from the Hugging Face Transformers library to define the structure of the output returned by the speech classification model.
   
2. **Attributes**:
   - `loss`: An optional attribute to store the loss value (used during training).
   - `logits`: The model's raw output logits (predictions before applying softmax) for each input, which is crucial for classification tasks.
   - `hidden_states`: An optional tuple of hidden states from intermediate layers of the model. These can be useful for understanding the model's internal representation.
   - `attentions`: An optional tuple of attention weights from the model's attention layers, useful for analyzing how the model attends to different parts of the input.

This custom output class allows for structured and organized handling of model predictions and internal states, facilitating model evaluation and analysis.


In [None]:
@dataclass
class SpeechClassifierOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    logits: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[torch.FloatTensor]] = None

### Defining the Wav2Vec2 Model for Speech Classification

1. **`Wav2Vec2ClassificationHead` Class**:
   - This class defines the classification head that will be added on top of the pre-trained Wav2Vec2 model:
     - `dense`: A fully connected layer with the same size as the hidden states of the model.
     - `dropout`: A dropout layer to prevent overfitting, with a dropout rate taken from the model configuration.
     - `out_proj`: A linear layer projecting the hidden states to the final classification logits, with the number of output units corresponding to the number of labels.

2. **`Wav2Vec2ForSpeechClassification` Class**:
   - This class extends `Wav2Vec2PreTrainedModel` and combines the pre-trained Wav2Vec2 model with the `Wav2Vec2ClassificationHead`:
     - `self.wav2vec2`: The pre-trained Wav2Vec2 model that processes the audio input.
     - `self.classifier`: The classification head added on top of the Wav2Vec2 model.
     - `self.freeze_feature_extractor()`: This method freezes the feature extractor's parameters to avoid updating them during training.

3. **Pooling Strategy (`merged_strategy`)**:
   - This function aggregates the hidden states from the model's outputs using the specified pooling method (`mean`, `sum`, or `max`):
     - **"mean"**: Averages the hidden states along the sequence dimension.
     - **"sum"**: Sums the hidden states along the sequence dimension.
     - **"max"**: Takes the maximum hidden state along the sequence dimension.

4. **Forward Method**:
   - The `forward()` method defines how the model processes inputs:
     - It first passes the input values through the Wav2Vec2 model to get the hidden states.
     - Then, it applies the selected pooling strategy (`mean`, `sum`, or `max`) to these hidden states.
     - The pooled features are passed through the classification head to produce logits.
     - If labels are provided, the loss is computed using `CrossEntropyLoss`.

5. **Output**:
   - If `return_dict=False`, the method returns the logits (and optionally the loss).
   - If `return_dict=True`, it returns a `SpeechClassifierOutput` containing the loss, logits, hidden states, and attentions, providing a detailed output for further analysis.

This model architecture is designed to fine-tune Wav2Vec2 for speech classification tasks, integrating the pre-trained speech processing capabilities with a custom classification head and loss computation.


In [None]:
class Wav2Vec2ClassificationHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x


class Wav2Vec2ForSpeechClassification(Wav2Vec2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.pooling_mode = config.pooling_mode
        self.config = config

        self.wav2vec2 = Wav2Vec2Model(config)
        self.classifier = Wav2Vec2ClassificationHead(config)

        self.init_weights()

    def freeze_feature_extractor(self):
        self.wav2vec2.feature_extractor._freeze_parameters()

    def merged_strategy(
            self,
            hidden_states,
            mode="mean"
    ):
        if mode == "mean":
            outputs = torch.mean(hidden_states, dim=1)
        elif mode == "sum":
            outputs = torch.sum(hidden_states, dim=1)
        elif mode == "max":
            outputs = torch.max(hidden_states, dim=1)[0]
        else:
            raise Exception(
                "The pooling method hasn't been defined! Your pooling mode must be one of these ['mean', 'sum', 'max']")

        return outputs

    def forward(
            self,
            input_values,
            attention_mask=None,
            output_attentions=None,
            output_hidden_states=None,
            return_dict=None,
            labels=None,
    ):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        outputs = self.wav2vec2(
            input_values,
            attention_mask=attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = outputs[0]
        hidden_states = self.merged_strategy(hidden_states, mode=self.pooling_mode)
        logits = self.classifier(hidden_states)

        loss = None
        if labels is not None:
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SpeechClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

### Defining the Data Collator for CTC with Padding

1. **`DataCollatorCTCWithPadding` Class**:
   - This class defines a custom data collator for preparing batches of audio data and labels for training, specifically for Connectionist Temporal Classification (CTC) tasks.
   - The collator handles padding of the input sequences and labels to ensure consistent batch sizes.

2. **Attributes**:
   - `feature_extractor`: The Wav2Vec2 feature extractor, used for padding the input sequences.
   - `padding`: Controls whether padding should be applied (`True`, `False`, or a string like `'max_length'`).
   - `max_length`: The maximum length for padding input sequences.
   - `max_length_labels`: The maximum length for padding the labels.
   - `pad_to_multiple_of`: Optional parameter to pad sequences to a multiple of a given value.
   - `pad_to_multiple_of_labels`: Optional parameter to pad labels to a multiple of a given value.

3. **`__call__()` Method**:
   - This method is invoked when the collator is called during batching:
     - `input_features`: Extracts and prepares the `input_values` from each feature in the dataset.
     - `label_features`: Collects the label data from the features.
     - `d_type`: Determines the data type for the labels (`torch.long` for integer labels or `torch.float` for continuous labels).
     - `batch`: Uses the `feature_extractor` to pad the `input_values` to the desired length and format.
     - The labels are converted into a PyTorch tensor and added to the batch.

4. **Output**:
   - The method returns a dictionary containing:
     - `input_values`: The padded input audio features.
     - `labels`: The padded label tensor.

This data collator ensures that audio inputs and their corresponding labels are properly padded, making them suitable for training with models that require fixed-length input sequences, such as the Wav2Vec2-based model.


In [None]:
@dataclass
class DataCollatorCTCWithPadding:
    feature_extractor: Wav2Vec2FeatureExtractor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [feature["labels"] for feature in features]

        d_type = torch.long if isinstance(label_features[0], int) else torch.float

        batch = self.feature_extractor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        batch["labels"] = torch.tensor(label_features, dtype=d_type)

        return batch


### Creating the Data Collator Instance

1. **`DataCollatorCTCWithPadding` Instance**:
   - `data_collator = DataCollatorCTCWithPadding(feature_extractor=feature_extractor, padding=True)` creates an instance of the `DataCollatorCTCWithPadding` class.
   
2. **Parameters**:
   - `feature_extractor`: The `Wav2Vec2FeatureExtractor` is passed to the data collator, enabling it to pad input sequences accordingly.
   - `padding=True`: This ensures that padding is applied to the input sequences. The padding will be done automatically based on the maximum sequence length in the batch or according to other specified parameters.

This `data_collator` is then used during training to prepare batches of input data, handling padding and ensuring that each batch has a consistent size, making it ready for processing by the model.


In [18]:
data_collator = DataCollatorCTCWithPadding(feature_extractor=feature_extractor, padding=True)

### Defining the Metric Computation Function

1. **`compute_metrics` Function**:
   - This function calculates the accuracy metric for model evaluation.
   - It is designed to be used with the Hugging Face `Trainer` API to evaluate the model's performance on the validation or test dataset.

2. **Parameters**:
   - `p: EvalPrediction`: The function takes an `EvalPrediction` object, which contains the model's predictions and the true labels.

3. **Steps**:
   - `preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions`: This line extracts the predictions. If the predictions are in a tuple (e.g., logits and hidden states), it selects the first element, which is the logits.
   - `preds = np.argmax(preds, axis=1)`: This applies the `argmax` function along the axis of the logits to get the predicted class labels (the class with the highest probability).
   - `accuracy = (preds == p.label_ids).astype(np.float32).mean().item()`: This compares the predicted labels with the true labels (`p.label_ids`), calculates the accuracy, and converts it into a scalar value using `.item()`.

4. **Output**:
   - The function returns a dictionary containing the calculated accuracy value:
     - `{"accuracy": accuracy}`

This function is used during model evaluation to compute the accuracy of predictions, providing a key metric for model performance.


In [None]:
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)

    return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

### Loading the Pre-trained Wav2Vec2 Model for Speech Classification

1. **`Wav2Vec2ForSpeechClassification.from_pretrained()`**:
   - This method is used to load a pre-trained Wav2Vec2 model with a custom speech classification head.
   
2. **Parameters**:
   - `model_name_or_path`: Specifies the name or path of the pre-trained model to load. In this case, it points to `"facebook/wav2vec2-base-100k-voxpopuli"`, a Wav2Vec2 model trained on a large corpus of spoken language.
   - `config`: The configuration object that defines model settings, such as the number of labels and pooling mode. This configuration is passed to ensure the model is correctly adapted to your task.

3. **Result**:
   - The method returns a model that consists of the Wav2Vec2 feature extractor and a custom classification head for speech classification. This model is now ready for fine-tuning on your specific dataset.

By calling this method, you initialize a pre-trained Wav2Vec2 model, which can now be used for further training, fine-tuning, or evaluation on the speech classification task.


In [None]:
model = Wav2Vec2ForSpeechClassification.from_pretrained(
    model_name_or_path,
    config=config,
)

pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
Some weights of Wav2Vec2ForSpeechClassification were not initialized from the model checkpoint at facebook/wav2vec2-base-100k-voxpopuli and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Freezing the Feature Extractor

1. **`model.freeze_feature_extractor()`**:
   - This method freezes the parameters of the Wav2Vec2 feature extractor, preventing it from being updated during training.
   - Freezing the feature extractor is commonly done in transfer learning when you want to fine-tune only the classification head of the model, while keeping the pre-trained weights of the feature extractor fixed.
   
2. **Purpose**:
   - Freezing the feature extractor helps to retain the learned features from the pre-trained model (such as the low-level audio features) and prevents overfitting when there is limited labeled data.
   - This step reduces the computational load, as the feature extractor does not require gradient calculations, speeding up training.

3. **Result**:
   - After this method is called, only the parameters of the classification head will be updated during training, which helps focus learning on the specific classification task without affecting the pre-trained audio features.


In [None]:
model.freeze_feature_extractor()

### Setting Up the Training Arguments

1. **`TrainingArguments`**:
   - This class is used to define the training configuration for the Hugging Face `Trainer` API, specifying the parameters for model training and evaluation.

2. **Parameters**:
   - `output_dir`: The directory where the model checkpoints and logs will be saved. In this case, it is set to `"/kaggle/working"`.
   - `per_device_train_batch_size`: The batch size used during training on each device (GPU or CPU). Set to `4` here.
   - `per_device_eval_batch_size`: The batch size used during evaluation on each device. Also set to `4`.
   - `gradient_accumulation_steps`: Number of steps to accumulate gradients before updating the model parameters. This effectively increases the batch size by accumulating gradients over multiple steps, and it's set to `2` here.
   - `evaluation_strategy`: Defines when to run evaluation during training. Set to `"steps"`, meaning evaluation will occur every few steps as specified by `eval_steps`.
   - `num_train_epochs`: The number of training epochs. Here, it is set to `1.0` (just one epoch of training).
   - `fp16`: A flag indicating whether to use 16-bit floating-point precision (mixed precision training), which can speed up training and reduce memory usage. Set to `True`.
   - `save_steps`: Defines the frequency (in terms of steps) to save model checkpoints. Here, it is set to every `10` steps.
   - `eval_steps`: Defines the frequency (in terms of steps) to run evaluation during training. Set to `10` steps.
   - `logging_steps`: Defines the frequency (in terms of steps) to log training information. Here, it is set to `10` steps.
   - `learning_rate`: The learning rate for the optimizer. Set to `1e-4` (0.0001).
   - `save_total_limit`: The maximum number of model checkpoints to keep. Old checkpoints will be deleted to ensure that only the last two are saved during training.
   
3. **Result**:
   - These settings help control how the training process will proceed, including how often to save and evaluate the model, the batch size, and the learning rate. The arguments are passed to the `Trainer` to train and evaluate the model effectively.


In [None]:
training_args = TrainingArguments(
    output_dir="/kaggle/working",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=1.0,
    fp16=True,
    save_steps=10,
    eval_steps=10,
    logging_steps=10,
    learning_rate=1e-4,
    save_total_limit=2,
)

### Printing the Model Save Location

1. **`print("Model saved at:", training_args.output_dir)`**:
   - This line outputs the directory where the trained model and its checkpoints will be saved during the training process.
   
2. **Purpose**:
   - It ensures that the user knows where to find the saved models after training. This information is useful when tracking the output of the model or when further processing the results.
   
3. **Result**:
   - The output will be the path to the directory specified by `output_dir` in the `TrainingArguments`. In this case, it will print:
     - `"Model saved at: /kaggle/working"`


In [None]:
print("Model saved at:", training_args.output_dir)

Model saved at: /kaggle/working


### Custom Trainer Class: CTCTrainer

1. **`CTCTrainer`**:
   - This class extends the `Trainer` class from Hugging Face and overrides the `training_step` method to implement custom training behavior, especially for models with a Connectionist Temporal Classification (CTC) loss.

2. **Parameters**:
   - The `training_step` method is called during each training iteration to compute and backpropagate the loss.

3. **Key Modifications**:
   - **Input Preparation**: `inputs = self._prepare_inputs(inputs)` ensures that inputs are properly formatted for the model.
   - **Automatic Mixed Precision (AMP)**: If AMP is enabled, the loss is calculated with `autocast()` to perform mixed-precision training, reducing memory usage and speeding up training. The loss is then scaled back with `self.scaler.scale(loss).backward()`.
   - **Gradient Accumulation**: If `gradient_accumulation_steps > 1`, the loss is divided by the number of accumulation steps. This helps simulate larger batch sizes without increasing memory usage.
   - **Backpropagation**: Depending on the training setup (AMP, Apex, DeepSpeed, or standard PyTorch), the loss is backpropagated using the appropriate method to avoid overflow or instability during training.

4. **Purpose**:
   - The custom `CTCTrainer` class ensures that the model is trained using the specified loss function while supporting advanced training techniques like gradient accumulation and automatic mixed precision.
   
5. **Result**:
   - The method returns the loss after backpropagation, which is detached from the computation graph to avoid unnecessary tracking of gradients.


In [None]:
class CTCTrainer(Trainer):
    def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
        model.train()
        inputs = self._prepare_inputs(inputs)

        if self.use_amp:
            with autocast():
                loss = self.compute_loss(model, inputs)
        else:
            loss = self.compute_loss(model, inputs)

        if self.args.gradient_accumulation_steps > 1:
            loss = loss / self.args.gradient_accumulation_steps

        if self.use_amp:
            self.scaler.scale(loss).backward()
        elif self.use_apex:
            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                scaled_loss.backward()
        elif self.deepspeed:
            self.deepspeed.backward(loss)
        else:
            loss.backward()

        return loss.detach()


### Initializing the Trainer

1. **`Trainer`**:
   - The `Trainer` class from Hugging Face is used to handle the training and evaluation process in a more efficient way, abstracting much of the complexity involved in model training.
   
2. **Parameters**:
   - `model`: The model to be trained, in this case, a custom `Wav2Vec2ForSpeechClassification` model.
   - `data_collator`: The data collator used to process and pad the inputs to the model, defined earlier as `DataCollatorCTCWithPadding`.
   - `args`: The training arguments, passed from the `TrainingArguments` object that defines how the training will proceed (batch size, evaluation strategy, etc.).
   - `compute_metrics`: A function to compute evaluation metrics (e.g., accuracy) after each evaluation phase, which uses the `compute_metrics` function defined earlier.
   - `train_dataset`: The training dataset, which contains the training examples (audio files and labels).
   - `eval_dataset`: The evaluation dataset, containing the validation examples.
   - `tokenizer`: The tokenizer or feature extractor used to preprocess the input data, in this case, `Wav2Vec2FeatureExtractor`, which handles the preprocessing of audio data.

3. **Purpose**:
   - The `Trainer` object simplifies the process of training a model, handling many of the common steps like forward/backward passes, logging, and evaluation. It uses the provided datasets, model, and arguments to run training and evaluation loops.
   
4. **Result**:
   - The `trainer` object is now ready to perform training and evaluation using the provided datasets and configurations. The model will be trained based on the parameters defined in the `TrainingArguments`.


In [None]:
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=feature_extractor,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,Accuracy
10,2.0784,2.101141,0.1275
20,2.0828,2.108113,0.126875
30,2.0638,2.119702,0.128125
40,2.0449,2.111104,0.128125
50,2.1043,2.097295,0.128125
60,2.0697,2.090898,0.128125
70,2.0606,2.089742,0.128125
80,2.1049,2.093284,0.128125
90,2.129,2.08718,0.128125
100,2.0883,2.08587,0.128125


wandb: Network error (ReadTimeout), entering retry loop.


TrainOutput(global_step=400, training_loss=2.0884689331054687, metrics={'train_runtime': 21568.909, 'train_samples_per_second': 0.148, 'train_steps_per_second': 0.019, 'total_flos': 4.2624103378027565e+17, 'train_loss': 2.0884689331054687, 'epoch': 1.0})

In [None]:
output_dir = "/kaggle/working/trained_model"

In [None]:
os.makedirs(output_dir, exist_ok=True)

### Saving the Model

1. **`trainer.save_model(output_dir)`**:
   - This method saves the trained model to the specified directory after training or during any point in the training process.
   
2. **Parameters**:
   - `output_dir`: The directory where the model will be saved. This directory is specified in the `TrainingArguments` (in this case, it is `"/kaggle/working"`).
   
3. **Purpose**:
   - The purpose of this line is to persist the trained model so that it can be reused later for inference or fine-tuning. This step is crucial for saving the model's weights and configurations after training.
   
4. **Result**:
   - The trained model will be saved to the specified directory, and this model can later be loaded using `from_pretrained` for further use, such as making predictions or fine-tuning on other datasets.


In [None]:
trainer.save_model(output_dir)

In [None]:
type(model)

__main__.Wav2Vec2ForSpeechClassification