# Automatic Speech Recognition Project - Part 5
After preprocessing and ensuring that the data is in the correct format, the next step is to train the model. At this stage, we fine-tune the Whisper model to adapt it for transcribing the Luo language. Fine-tuning involves adjusting the pre-trained Whisper model's weights using the Dholuo dataset, allowing the model to learn the specific linguistic patterns, phonetic characteristics, and vocabulary unique to the Luo language. By fine-tuning the model on this language-specific data, we aim to improve its transcription accuracy and performance for Luo language audio inputs, ensuring that the model becomes more proficient in recognizing and transcribing spoken Luo effectively.

## Step 1: Load the necessary Libraries


In [None]:
#Libraries
!pip install transformers datasets jiwer openai-whisper torch torchvision torchaudio streamlit
!apt-get install ffmpeg
!pip install pydub

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [None]:
#Loading necessary libraries
from transformers import WhisperForConditionalGeneration, WhisperProcessor, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import Dataset, Audio
from jiwer import wer

In [None]:
!pip install wandb
import wandb

# Login to WandB using your API key
wandb.login(key="46f9029915ef8194eb4bc9ba2e7d8d85e79ecaf1")





True

In [None]:
# Load Whisper model and processor
model_name = "openai/whisper-base"  # Change to "openai/whisper-large" for better accuracy
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

In [None]:
# Import necessary libraries
from datasets import Dataset
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 2: Load the Preprocesed Datset

In [None]:
from datasets import load_from_disk

#load dataset from disk
train_dataset = load_from_disk("/content/drive/My Drive/ASR/preprocessed_train")
test_dataset = load_from_disk("/content/drive/My Drive/ASR/preprocessed_test")

#verify train_dataset and test_dataset
print(train_dataset)
print(test_dataset)

# Step 4: Verify the result
print("Head of train_dataset:", train_dataset[:1])
print("Head of test_dataset:", test_dataset[:1])

Dataset({
    features: ['input_features', 'labels'],
    num_rows: 2498
})
Dataset({
    features: ['input_features', 'labels'],
    num_rows: 734
})
Head of train_dataset: {'input_features': tensor([[[-0.6821, -0.6821, -0.6821,  ..., -0.6821, -0.6821, -0.6821],
         [-0.6821, -0.6821, -0.6821,  ..., -0.6821, -0.6821, -0.6821],
         [-0.6821, -0.6821, -0.6821,  ..., -0.5034, -0.6821, -0.6821],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]]]), 'labels': tensor([[50258, 50363, 20106,     6,   389, 29319,   281,   297, 39754,   826,
          1735,   274,    71, 18501,   257, 19515, 10390,    84,  6120,  8550,
             6,    68,   281, 44299,   287,  5827,    78,  6051, 40904, 50257,
         50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
         50257, 50257, 50257, 50257, 5

In [None]:
print(type(train_dataset[0]["input_features"]))
print(type(train_dataset[0]["labels"]))


<class 'torch.Tensor'>
<class 'torch.Tensor'>


## Step 3: Train the Model
In order to train the model effectively, several important steps need to be completed:

1. **Custom Data Collator**:
   - A custom data collator is required to handle the batching of input data during training. This collator is responsible for efficiently packing the audio features and their corresponding transcriptions into batches that can be processed by the model. It ensures that the data is correctly padded or truncated, maintaining consistency in sequence lengths across the batch, which is crucial for proper model training.

2. **Setting Training Arguments**:
   - Training arguments define the configuration and parameters for training the model, such as the learning rate, batch size, number of epochs, gradient accumulation steps, and other hyperparameters. These settings are crucial for controlling the training process, determining how the model learns, and optimizing its performance over time. By fine-tuning these arguments, we can ensure that the model converges effectively and avoids issues like overfitting or underfitting.

3. **Training the Model**:
   - With the custom data collator and training arguments in place, the next step is to begin the actual training process. During training, the model learns from the input data by adjusting its weights to minimize the loss function, which measures how accurately the model’s predictions match the ground truth transcriptions. This iterative process continues for the specified number of epochs, gradually improving the model’s ability to transcribe audio data in the Luo language.

In [None]:
##Data Collator
def custom_data_collator(batch):
    # Extract input_features and labels directly as tensors from the batch
    input_features = [example["input_features"] for example in batch]
    labels = [example["labels"] for example in batch]

    # Pad input_features and labels to the maximum length in the batch
    input_features_padded = pad_sequence(input_features, batch_first=True)
    labels_padded = pad_sequence(labels, batch_first=True, padding_value=-100)  # Use -100 for ignored tokens

    # Create attention masks for input_features
    attention_mask = torch.ones(input_features_padded.size(), dtype=torch.float32)
    attention_mask[input_features_padded == 0] = 0  # Zero out padding positions

    return {
        "input_features": input_features_padded,
        "labels": labels_padded,
        "attention_mask": attention_mask,
    }

In [None]:
##Training Arguments
import torch
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="/content/drive/My Drive/ASR/whisper_finetune",
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    per_device_train_batch_size=8,  # Reduce batch size if memory issues occur
    per_device_eval_batch_size=8,   # Same here for evaluation batch size
    gradient_accumulation_steps=4,  # Increase gradient accumulation to simulate larger batch size
    num_train_epochs=5,
    learning_rate=1e-4,
    predict_with_generate=True,
    generation_max_length=128,
    save_total_limit=2,
    fp16=True,  # Keep mixed precision enabled for performance
    lr_scheduler_type="linear",  # Add learning rate scheduler to improve fine-tuning
)



In [None]:
#Define the trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=custom_data_collator,  # Use the custom collator
)


In [None]:
#train the model
trainer.train()


  input_features = [torch.tensor(example["input_features"]) for example in batch]
  labels = [torch.tensor(example["labels"]) for example in batch]


ValueError: Whisper expects the mel input features to be of length 3000, but found 810. Make sure to pad the input mel features to 3000.