## Introduction to MLflow and OpenAI's Whisper

In this tutorial, we explore the integration of [OpenAI's Whisper](https://huggingface.co/openai), an automatic speech recognition (ASR) system, with MLflow. This guide is intended for those who are familiar with machine learning workflows and are interested in managing and deploying ASR models. We will demonstrate how to use MLflow to log, manage, and serve a state-of-the-art speech-to-text model provided by the [🤗 Hugging Face](https://huggingface.co/) [Transformers](https://huggingface.co/docs/transformers/model_doc/whisper) library.

### What is Whisper?

Whisper is an ASR model developed by [OpenAI](https://openai.com/) that has been trained on a diverse range of accents and environments. It is designed to convert spoken language into written text with high accuracy. The model is available through the Transformers library, which facilitates the use of pre-trained models for various tasks.

### Why MLflow with Whisper?

Integrating MLflow with Whisper provides several advantages:

- **Experiment Tracking**: Track and compare Whisper model configurations and performance across different experiments.
- **Model Management**: Maintain a centralized repository for different versions of Whisper models and their configurations.
- **Reproducibility**: Ensure consistent results by recording all necessary components to reproduce a transcription.
- **Deployment**: Simplify the deployment process of Whisper models into production environments.

### Learning Objectives

In this tutorial, you will:

- Set up an audio transcription **pipeline** using Whisper from the Transformers library.
- **Log** the Whisper model along with its configurations using MLflow.
- Automatically infer the input and output **signature** of the Whisper model.
- **Load** a stored Whisper model from MLflow in its native format for interactive usage.
- Simulate serving the Whisper model using MLflow's **pyfunc** model flavor and perform audio transcriptions.

By the end of this tutorial, you will have a comprehensive understanding of how to manage and deploy Whisper ASR models with MLflow, enhancing your machine learning operations for speech-to-text tasks.

Let's dive into the world of automatic speech recognition with MLflow and Whisper!

## Setting Up the Environment and Acquiring Audio Data

Before we start transcribing audio with Whisper, we need to set up our environment and acquire an audio file to work with. We'll also initialize our MLflow environment to track our experiments.

The following code block performs these initial steps:

1. **Audio Acquisition**: We download a sample audio file from NASA's public domain resources.
2. **Model Initialization**: We load the Whisper model, tokenizer, and feature extractor from the Transformers library.
3. **Pipeline Creation**: We create a transcription pipeline with the loaded Whisper model and its components.


In [1]:
import requests
import transformers

import mlflow


# Acquire an audio file that is in the public domain
resp = requests.get(
    "https://www.nasa.gov/wp-content/uploads/2015/01/590325main_ringtone_kennedy_WeChoose.mp3"
)
resp.raise_for_status()
audio = resp.content

task = "automatic-speech-recognition"
architecture = "openai/whisper-large-v3"

model = transformers.WhisperForConditionalGeneration.from_pretrained(architecture)
tokenizer = transformers.WhisperTokenizer.from_pretrained(architecture)
feature_extractor = transformers.WhisperFeatureExtractor.from_pretrained(architecture)
model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]
audio_transcription_pipeline = transformers.pipeline(
    task=task, model=model, tokenizer=tokenizer, feature_extractor=feature_extractor
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Formatting the Transcription Output

In this section, we introduce a utility function that is used solely for the purpose of enhancing the readability of the transcription output within this Jupyter notebook demo. It is important to note that this function is designed for demonstration purposes and should not be included in production code or used for any other purpose beyond this tutorial.

The `format_transcription` function takes a long string of transcribed text and formats it by splitting it into sentences and inserting newline characters. This makes the output easier to read when printed in the notebook environment.


In [2]:
def format_transcription(transcription):
    """
    Function for formatting a long string by splitting into sentences and adding newlines.
    """
    # Split the transcription into sentences, ensuring we don't split on abbreviations or initials
    sentences = [
        sentence.strip() + ("." if not sentence.endswith(".") else "")
        for sentence in transcription.split(". ")
        if sentence
    ]

    # Join the sentences with a newline character
    formatted_text = "\n".join(sentences)

    return formatted_text

## Executing the Transcription Pipeline

After setting up the Whisper model and the audio transcription pipeline, we can now process our audio file to extract the transcription. The following code block will feed the audio file into the pipeline and then print the formatted transcription. We're showing this transcription here so that we can see how loading the saved model back from MLflow achieves the exact same result.

The `format_transcription` function we defined earlier will be used here to split the transcription into sentences and add newline characters, which improves the readability of the output in our Jupyter notebook.

Execute the following code to perform the transcription and see the formatted output:

In [3]:
transcription = audio_transcription_pipeline(audio)

print(format_transcription(transcription["text"]))

We choose to go to the moon in this decade and do the other things.
Not because they are easy, but because they are hard.
3, 2, 1, 0.
All engines running.
Liftoff.
We have a liftoff.
32 minutes past the hour.
Liftoff on Apollo 11.


## Model Signature and Configuration

In this section, we will generate a signature for our model. The signature defines the schema of the model's inputs and outputs, which is essential for understanding the data types and structures that the model expects and produces. If we are working with raw binary audio data, the signature generated here will be appropriate. However, if we are working with other supported types, such as a numpy array of float32 representing the audio with the correct bitrate, we would need to specify a signature to override the default "binary" input type.

The `transformers` flavor for audio-based pipelines supports multiple audio input formats (provided that the underlying feature extractor within the pipeline supports the format), as well as a url-based input to perform transcription of audio files stored on the internet. We won't be showing this functionality here due to potential copyright concerns for fun audio files, but rest assured that if the source url points to a hosted sound file, Whisper will transcribe it.

Additionally, we will define the model configuration, which includes parameters such as the chunk length and stride length for processing the audio data. These configurations can be tuned based on the requirements of your specific use case.

Execute the following code to infer the model signature and set the model configuration:


In [4]:
model_config = {
    "chunk_length_s": 20,
    "stride_length_s": [5, 3],
}

signature = mlflow.models.infer_signature(
    audio,
    mlflow.transformers.generate_signature_output(audio_transcription_pipeline, audio),
    params=model_config,
)

## Setting the tracking server and creating an experiment

In order to view the results in our tracking server (for the purposes of this tutorial, we've started a local tracking server at this url)

We can start an instance of the MLflow server locally by running the following from a terminal to start the tracking server:

``` bash
    mlflow server --host 127.0.0.1 --port 8080
```

With the server started, the following code will ensure that all experiments, runs, models, parameters, and metrics that we log are being tracked within that server instance (which also provides us with the MLflow UI when navigating to that url address in a browser).

After setting the tracking url, we create a new MLflow Experiment to store the run we're about to create in. 

In [5]:
mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("WhisperTranscription")

<Experiment: artifact_location='mlflow-artifacts:/224567153691341564', creation_time=1699495264421, experiment_id='224567153691341564', last_update_time=1699495264421, lifecycle_stage='active', name='WhisperTranscription', tags={}>

## Logging the Model with MLflow

The next step in our MLflow tutorial is to log the model along with its configuration. Logging the model captures all the necessary information to reproduce the work and configure an environment suitable for inference in deployment infrastructure. 

This includes the model itself, the signature that defines the input and output formats, an example of the input data, and any additional configuration parameters that were set.

By logging this information, we create a record within MLflow that can be referred to for future experiments, shared with colleagues, or used to deploy the model into a production environment.

The following cell demonstrates how to log the model using MLflow's `log_model` function within the context of an MLflow run.

This code will create a new run in the current MLflow experiment, log the model, and store it in the specified artifact path. The `signature` and `input_example` help document how the model should be used, and the `model_config` captures any model-specific configurations that will be applied by default when using the pipeline for inference.

In [6]:
# Log the pipeline
with mlflow.start_run():
    model_info = mlflow.transformers.log_model(
        transformers_model=audio_transcription_pipeline,
        artifact_path="whisper_transcriber",
        signature=signature,
        input_example=audio,
        model_config=model_config,
    )



## Loading and Using the Model Pipeline

Once we have logged our model with MLflow, we can load and use it just as we would with the original model pipeline. This step is crucial for ensuring that our logged model behaves as expected and can be used for inference.

The code below demonstrates how to load the model in its native format using MLflow's `load_model` function. We then pass an audio input to the loaded model to obtain a transcription. In the case of this example, we're passing in an MP3 audio file.

We're showing this part of native loading and inference here as it is a good validation step prior to evaluating the pre-deployment pyfunc model loading that we will do next. Ensuring that the model (particularly one as large and complex as this) has no issues in its native format can help to reduce the amount of troubleshooting of issues that we may have to do if we were using something a bit more complex than a precisely pre-trained model from the Hugging Face Hub. 

In [7]:
# Load the pipeline in its native format
loaded_transcriber = mlflow.transformers.load_model(model_uri=model_info.model_uri)

transcription = loaded_transcriber(audio)

print(f"\nWhisper native output transcription:\n{format_transcription(transcription['text'])}")

Downloading artifacts:   0%|          | 0/29 [00:00<?, ?it/s]

2023/11/08 22:53:55 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
2023/11/08 22:54:33 INFO mlflow.transformers: 'runs:/a77b9f3c037948228dd24787e33f91b4/whisper_transcriber' resolved as 'mlflow-artifacts:/224567153691341564/a77b9f3c037948228dd24787e33f91b4/artifacts/whisper_transcriber'


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/13 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Whisper native output transcription:
We choose to go to the moon in this decade and do the other things.
Not because they are easy, but because they are hard.
3, 2, 1, 0.
All engines running.
Liftoff.
We have a liftoff.
32 minutes past the hour.
Liftoff on Apollo 11.


## Using the Pyfunc Flavor for Inference

MLflow provides a `pyfunc` flavor for models, which allows for a more generic interface that can work across different ML frameworks. This can be particularly useful when deploying models to different environments where the original framework may not be available or when a more flexible interface is required.

The following code demonstrates how to load the Whisper model as a `pyfunc` and perform a prediction. 

This approach showcases the versatility of MLflow in adapting models for different deployment scenarios.

Notice the slightly different output format for the pyfunc as compared to the native output. 
In order to conform to the standards of pyfunc output signatures, the output is represented as a `List[str]` type.

In [8]:
pyfunc_transcriber = mlflow.pyfunc.load_model(model_uri=model_info.model_uri)

pyfunc_transcription = pyfunc_transcriber.predict([audio])

# Note: the pyfunc return type if `return_timestamps` is set is a JSON encoded string.
print(f"\nPyfunc output transcription:\n{format_transcription(pyfunc_transcription[0])}")

Downloading artifacts:   0%|          | 0/29 [00:00<?, ?it/s]

2023/11/08 22:54:45 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


Loading checkpoint shards:   0%|          | 0/13 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Pyfunc output transcription:
We choose to go to the moon in this decade and do the other things.
Not because they are easy, but because they are hard.
3, 2, 1, 0.
All engines running.
Liftoff.
We have a liftoff.
32 minutes past the hour.
Liftoff on Apollo 11.


## Tutorial Roundup

Throughout this tutorial, we've explored how to:

- Set up an audio transcription pipeline using the OpenAI Whisper model.
- Format and prepare audio data for transcription.
- Log, load, and use the model with MLflow, leveraging both the native and pyfunc flavors for inference.
- Format the output for readability and practical use in a Jupyter Notebook environment.

We've seen the benefits of using MLflow for managing the machine learning lifecycle, including experiment tracking, model versioning, reproducibility, and deployment. By integrating MLflow with the Transformers library, we've streamlined the process of working with state-of-the-art NLP models, making it easier to track, manage, and deploy cutting-edge NLP applications.