# **Fine-Tuning Whisper-Small for Medical Speech Recognition**

## **Project Overview**

This project focuses on fine-tuning OpenAI’s Whisper-small model for automatic speech recognition (ASR) in the medical domain. The goal is to improve transcription accuracy for medical conversations, reports, and patient dialogues.

This project is a **non-commercial initiative** developed as part of my **portfolio** to showcase my skills in **machine learning, natural language processing (NLP), and automatic speech recognition (ASR)**. The goal of this project is to demonstrate my ability to fine-tune state-of-the-art models like OpenAI's **Whisper** for a specific **domain—medical speech recognition**.

The project focuses on building a robust ASR system that can accurately transcribe medical-related audio data, such as doctor-patient conversations or medical dictations. By fine-tuning the Whisper model on a medical dataset, I aim to highlight my expertise in:

**Data preprocessing and preparation** for machine learning tasks.

**Fine-tuning pre-trained models** for domain-specific applications.

**Deploying machine learning models** using user-friendly interfaces like Gradio.

Evaluating model performance using metrics like Word Error Rate (WER).

This project is not intended for commercial use but serves as a proof of concept for how advanced ASR models can be adapted to specialized domains like healthcare.

## **1. GPU Setup & Environment Configuration**

### Checking GPU Availability

Before starting, we ensure that a GPU is available. GPUs are critical for training deep learning models efficiently. If no GPU is found, the process will be much slower.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sat Mar 15 08:19:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   66C    P8             12W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### Installing Required Libraries

We install the necessary Python libraries to handle datasets, load pre-trained models, optimize training, and evaluate the model's performance. These libraries are essential for the entire workflow.

In [1]:
!pip install --upgrade --quiet pip
!pip install --upgrade --quiet datasets[audio] transformers accelerate evaluate jiwer tensorboard gradio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.2/46.2 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m93.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**datasets**: For loading and managing datasets.

**transformers**: For using pre-trained models like Whisper.

**accelerate**: For optimizing training on GPUs.

**evaluate**: For computing evaluation metrics like Word Error Rate (WER).

**gradio**: For creating a user-friendly interface to test the model.

**Also, login to huggingface to easily access the dataset and Model.**

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## **2. Dataset Loading & Preparing**

**Loading and Preparing the Dataset**

We load the medical ASR dataset, which contains audio files and their corresponding transcriptions. The dataset is split into training and test sets. We also resample the audio files to 16,000 Hz, which is the standard for the Whisper model. Unnecessary columns are removed to streamline the data.

In [None]:
from datasets import load_dataset, DatasetDict, Audio

# Load dataset
medical_dataset = DatasetDict()
medical_dataset["train"] = load_dataset("yashtiwari/PaulMooney-Medical-ASR-Data", split="test")
medical_dataset["test"] = load_dataset("yashtiwari/PaulMooney-Medical-ASR-Data", split="train")

# Convert 'path' (audio file path) to 'audio' (actual audio data)
medical_dataset = medical_dataset.cast_column("path", Audio(sampling_rate=16000))

# Remove unnecessary columns
medical_dataset = medical_dataset.remove_columns(["id", "prompt", "speaker_id"])

# Print dataset info
print(medical_dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['sentence', 'path'],
        num_rows: 5895
    })
    test: Dataset({
        features: ['sentence', 'path'],
        num_rows: 381
    })
})


## **3. Feature Extraction & Tokenization**

The audio files are converted into log-Mel spectrograms, which are the input features for the Whisper model. The text transcriptions are tokenized into numerical IDs. This preprocessing step is crucial for preparing the data for training.

In [None]:
# Import the WhisperFeatureExtractor class from the Hugging Face Transformers library
from transformers import WhisperFeatureExtractor

# Initialize the feature extractor using the pre-trained Whisper-small model
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

*   The WhisperFeatureExtractor is responsible for converting raw audio data into log-Mel spectrograms, which are the input features for the Whisper model.
*   The from_pretrained("openai/whisper-small") method loads the feature extractor configuration and weights from the pre-trained Whisper-small model hosted on Hugging Face's model hub.



In [None]:
# Import the WhisperTokenizer class from the Hugging Face Transformers library
from transformers import WhisperTokenizer

# Initialize the tokenizer using the pre-trained Whisper-small model
# Specify the language as "English" and the task as "transcribe"
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="English", task="transcribe")

In [None]:
# Import the WhisperProcessor class from the Hugging Face Transformers library
from transformers import WhisperProcessor

# Initialize the processor using the pre-trained Whisper-small model
# Specify the language as "English" and the task as "transcribe"
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="English", task="transcribe")



*   The WhisperProcessor is a convenience class that combines the functionality of the WhisperFeatureExtractor and WhisperTokenizer. It handles both audio feature extraction and text tokenization in a single object.

*   The from_pretrained("openai/whisper-small") method loads the processor configuration and weights from the pre-trained Whisper-small model.



In [None]:
print(medical_dataset["train"][0])

{'sentence': 'All my body is in a bad case and i need a good treatment', 'path': {'path': '1249120_44101988_103474667.wav', 'array': array([0.00867404, 0.01328385, 0.01013007, ..., 0.02171656, 0.02924957,
       0.04547866]), 'sampling_rate': 16000}}


## **4. Data Preprocessing for Model Training**

In [None]:
def prepare_medical_dataset(batch):
    # Load and resample audio
    audio = batch["path"]  # Use "path" since it contains the audio array

    # Compute log-Mel spectrogram features from the audio
    batch["input_features"] = feature_extractor(
        audio["array"], sampling_rate=audio["sampling_rate"]
    ).input_features[0]

    # Tokenize the text into labels
    batch["labels"] = tokenizer(batch["sentence"]).input_ids  # Use "sentence" for transcription
    return batch

In [None]:
medical_dataset = medical_dataset.map(
    prepare_medical_dataset,
    remove_columns=medical_dataset.column_names["train"],  # Remove old columns
    num_proc=2  # Parallel processing
)


## **5. Start the fine-tuning Model Setup & Training Configuration**

We load the pre-trained Whisper model from OpenAI. The model is configured for English transcription, which is the task we are focusing on.

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

In [None]:
model.generation_config.language = "English"
model.generation_config.task = "transcribe"

model.generation_config.forced_decoder_ids = None

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

## **6. Model Training & Evaluation**

The model is fine-tuned on the medical ASR dataset. We train it for 1,000 steps with a batch size of 16. The model's performance is evaluated periodically, and the final model is saved to the Hugging Face Hub for easy access.

In [None]:
import evaluate

metric = evaluate.load("wer")

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./Whisper-Small-Medical-ASR_BH-1_1",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=500,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)



In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=medical_dataset["train"],
    eval_dataset=medical_dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

  trainer = Seq2SeqTrainer(


In [None]:
processor.save_pretrained(training_args.output_dir)

[]

In [None]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss




TrainOutput(global_step=500, training_loss=0.32798467111587526, metrics={'train_runtime': 2078.8302, 'train_samples_per_second': 3.848, 'train_steps_per_second': 0.241, 'total_flos': 2.30608593395712e+18, 'train_loss': 0.32798467111587526, 'epoch': 1.3550135501355014})

## **7. Deployment & Gradio Interface**

In [None]:
kwargs = {
    "dataset_tags": "yashtiwari/PaulMooney-Medical-ASR-Data",
    "dataset": "Paul Mooney Medical ASR Data",  # a 'pretty' name for the training dataset
    "dataset_args": "split: train, test",
    "language": "en",
    "model_name": "Whisper Small Medical ASR -BH",  # replace with your name
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
}


In [None]:
trainer.push_to_hub(**kwargs)

CommitInfo(commit_url='https://huggingface.co/Bakhshial/Whisper-Small-Medical-ASR_BH-1_1/commit/1b2470893714062eab1728ce90e9c3fad1219460', commit_message='End of training', commit_description='', oid='1b2470893714062eab1728ce90e9c3fad1219460', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Bakhshial/Whisper-Small-Medical-ASR_BH-1_1', endpoint='https://huggingface.co', repo_type='model', repo_id='Bakhshial/Whisper-Small-Medical-ASR_BH-1_1'), pr_revision=None, pr_num=None)

In [2]:
from transformers import pipeline
import gradio as gr

# Load the fine-tuned Whisper model
pipe = pipeline(model="Bakhshial/Whisper-Small-Medical-ASR_BH-1_1")  # Change to your Hugging Face model repo if needed

# Function to transcribe uploaded audio files
def transcribe(audio_path):
    if not audio_path:
        return "Please upload an audio file."

    result = pipe(audio_path)  # Transcribe audio
    return result["text"]

# Gradio interface for testing
iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(type="filepath"),  # Fixed issue: Removed 'source' argument
    outputs="text",
    title="Whisper Medical ASR Test",
    description="Upload an audio file to test the fine-tuned Whisper-small model for medical speech recognition.",
)

# Launch the interface
iface.launch()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

Device set to use cpu


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://8728989261fba7248b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


