# Automatically Detecting Language in Speech Using OpenAI Whisper: A Step-by-Step Guide

## Introduction

In this tutorial, we'll learn how to automatically detect the language spoken in an audio file using OpenAI’s Whisper model. Whisper is a state-of-the-art speech-to-text model capable of transcribing audio in various languages and detecting which language is being spoken.

We'll walk through the entire process, from setting up a Google Colab environment, installing necessary dependencies, loading the Whisper model, and analyzing audio files for language detection.

By the end of this tutorial, you'll have a working Google Colab notebook that can detect the language spoken in any given audio file.

## Prerequisites

To follow along, you'll need:

- A basic understanding of Python.
- A Google account for accessing Google Colab.
- Audio files for testing (optional, as we'll provide some sample files).

---

## Step 1: Setting up Google Colab Environment

First, open [Google Colab](https://colab.research.google.com) and create a new notebook.

### 1.1 Install the Required Libraries

We need to install `whisper` and `torch`. We'll use Whisper's Python API to detect the language in speech, and PyTorch as the backend for the model.

In [None]:
# Install OpenAI's Whisper and PyTorch
!pip install git+https://github.com/openai/whisper.git
!pip install torch --extra-index-url https://download.pytorch.org/whl/cu117


Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-4l0f7inf
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-4l0f7inf
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20231117)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting triton<3,>=2.0.0 (from openai-whisper==20231117)
  Downloading triton-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading triton-2.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

Explanation:
- `whisper` is the library that provides the Whisper model.
- `torch` is required as the backbone for model computation.

---

## Step 2: Importing Libraries

After installing the libraries, we’ll import the required modules.

In [None]:
import whisper
import torch
from google.colab import files
import os

Explanation:
- `whisper`: For loading and running the Whisper model.
- `torch`: To manage model operations (e.g., checking if a GPU is available).
- `files` and `os`: To handle file uploads and interactions in Colab.

---

## Step 3: Load the Whisper Model

Whisper comes with different versions of models based on size and performance (e.g., `tiny`, `base`, `small`, `medium`, and `large`). For this tutorial, we'll use the `base` model, which strikes a balance between speed and accuracy.

In [None]:
# Load the Whisper model
model = whisper.load_model("base")

100%|███████████████████████████████████████| 139M/139M [00:03<00:00, 40.5MiB/s]
  checkpoint = torch.load(fp, map_location=device)


Explanation:
- The `base` model is fast enough for general purposes. If you need higher accuracy, you can switch to `medium` or `large` models, but they may take longer to process.

---

## Step 4: Upload an Audio File

To test the language detection, you can either upload an audio file or use one from the internet.

In [None]:
# Upload audio files
uploaded = files.upload()
audio_path = next(iter(uploaded))  # Get the file path

Saving Derniere Danse_64.mp3 to Derniere Danse_64.mp3



Explanation:
- The `files.upload()` function allows you to upload audio files directly in Colab.
- `audio_path` is the path to the uploaded file, which will be passed to Whisper.

---
## Step 5: Transcribing and Detecting Language

Now, let’s use Whisper to transcribe the audio and detect the language spoken in the file

In [None]:
# Transcribe the audio and detect language
result = model.transcribe(audio_path, task="transcribe", fp16=False)

# Get the detected language
detected_language = result['language']
print(f"Detected language: {detected_language}")

Detected language: fr


Explanation:
- `model.transcribe()` takes the path of the audio file and processes it.
  - `task="transcribe"`: Tells Whisper to transcribe the audio.
  - `fp16=False`: Avoids using mixed precision (helps on non-GPU environments).
- The result contains the transcription and language detection results. We extract the detected language using `result['language']`.

---

## Step 6: Display the Language Detection Results

The language is detected as a two-character language code (like `'en'` for English, `'es'` for Spanish, etc.). To make this more human-readable, we can map the language code to the corresponding language name.

In [None]:
# Whisper provides a dictionary of language codes
# Whisper provides a dictionary of language codes
LANGUAGE_CODES = {
    'en': 'English',
    'es': 'Spanish',
    'fr': 'French',
    'de': 'German',
    'it': 'Italian',
    'pt': 'Portuguese',
    'ru': 'Russian',
    'hi': 'Hindi', # Added Hindi
    # Add more languages as needed
}

# Map detected language code to language name
detected_language_name = LANGUAGE_CODES.get(detected_language, 'Unknown')
print(f"Detected language: {detected_language_name}")

Detected language: French


Explanation:
- The `LANGUAGE_CODES` dictionary maps Whisper’s language codes to full language names.
- We use `get()` to find the corresponding language name or display 'Unknown' if the language code is not in the dictionary.

---

## Step 7: Transcribe the Audio

Let’s also display the transcription of the audio to ensure the model’s output is correct.

In [None]:
# Display the transcription
transcription = result['text']
print(f"Transcription:\n{transcription}")

Transcription:
 ... Oh ma douce souffrance Pourquoi c'est charlée, tu recommences Je ne suis qu'un lettre sans importance Sans lui je suis un peu parou Je déambuse, seule dans le métro Le dernier temps pour oblié ma pénémence Je veux m'entruire que tout commence Oh ma douce souffrance Je remercie le jour à nuit Je danse avec plus de voilà plus Ah vous pouvez l'amour à un bras de miel Je danse dans dans dans dans dans dans dans Et dans les bruits je couvrai ces peurs Et dans tout au bien là, dans tout les paris Je me la fande de des demandes volvole volvole volvole volvole Je déconse sur ce chemin, ton absence J'ai pas pris mes sentois ma vignée Qu'un des corps qui bruit vies de sens Je remercie le jour à nuit Je danse avec plus de voilà plus Ah vous pouvez l'amour à un bras de miel Je danse dans dans dans dans dans dans dans Et dans les bruits je couvrai ces peurs Et dans tout au bien là, dans tout paris Je me la fande de des demandes volvole volvole volvole volvole Et dans tout au bie

Explanation:
- `result['text']` contains the text transcription of the audio. This helps verify both the transcription and language detection.

## Conclusion

In this tutorial, we successfully implemented automatic language detection in speech using OpenAI's Whisper model. We demonstrated how to load the Whisper model, upload an audio file, detect the spoken language, and transcribe the audio.

The Whisper model is incredibly versatile and supports many languages, making it a powerful tool for multilingual speech applications. You can extend this notebook by experimenting with different audio files or using Whisper's other features, such as translation or text generation.

Feel free to use this workflow as a starting point for more complex speech analysis projects!

---

## Next Steps

- **Experiment with Larger Models**: Try the `medium` or `large` versions of Whisper to improve accuracy.
- **Handle Long Audio Files**: Explore Whisper’s capability to handle longer audio files by chunking the audio and processing in parts.
- **Real-time Language Detection**: Implement a system that detects language in real-time using live audio streams.

Enjoy coding!