# **Speech Recognition Inference - Language Detection from Audio Files**

> This notebook demonstrates how to preprocess `.flac` audio files and use pre-trained models (FNN, CNN, CNN-LSTM) for inference.

>You can refer to the actual model building in ['**speech recognition 01**'](https://colab.research.google.com/drive/1b2iAc8ye8DPcHz3LIHYa2N2o1oqQZgXs?usp=drive_link).


## Mount Google Drive

In [88]:
# Mount Google Drive
from google.colab import drive
import os

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Define Path to Models and Select Audio You want to Test With

In [99]:
# Define paths to models
MODEL_PATHS = {
    "FNN": "/content/drive/MyDrive/Language_Detection/models/FNN.h5",
    "CNN": "/content/drive/MyDrive/Language_Detection/models/CNN.h5",
    "CNN_LSTM": "/content/drive/MyDrive/Language_Detection/models/CNN_LSTM.h5"
}

In [90]:
# define path to audio files
AUDIO_FILES = [
    "/content/drive/MyDrive/Language_Dataset/test/de_f_63f5b79c76cf5a1a4bbd1c40f54b166e.fragment20.flac",
    "/content/drive/MyDrive/Language_Dataset/test/en_f_67a0cba10d171b24039a79faa1d4d603.fragment50.flac",
    "/content/drive/MyDrive/Language_Dataset/test/es_f_50298ab71aaba8508ebeef49d853df11.fragment82.flac",
    "/content/drive/MyDrive/Language_Dataset/test/de_m_923551d571cc437382d0294dda2dd0aa.fragment49.flac",
    "/content/drive/MyDrive/Language_Dataset/test/en_m_b74b2bf2af570393cae91f4ed89cece7.fragment17.flac",
]

## Load Necessary Libraries

In [91]:
# Load necessary libraries
import tensorflow as tf
import numpy as np
import librosa
from tabulate import tabulate  # To create formatted tables
from IPython.display import Audio, display

## Define Audio Preprocessing Function
The audio files need to be processed into a form suitable for input into the models. Here we preprocess them to extract MFCC features.

> The function `preprocess_audio` loads the audio file, extracts MFCC features, and reshapes the data for compatibility with the models.

In [92]:
def preprocess_audio(file_path, duration=10, sr=22050, n_mfcc=13):
    """
    Preprocess audio file for inference.
    Args:
        file_path (str): Path to the audio file.
        duration (int): Duration to load (seconds).
        sr (int): Sampling rate.
        n_mfcc (int): Number of MFCCs to extract.
    Returns:
        np.ndarray: Feature vector of shape (13,) for FNN or (13, 1) for CNN/CNN_LSTM.
    """
    audio, _ = librosa.load(file_path, sr=sr, duration=duration)
    mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc)
    mfcc_scaled = np.mean(mfcc.T, axis=0)  # Average across time dimension
    return mfcc_scaled


## Make Predictions Using All Models
Use the preprocessed audio data as input to the models for language detection.

> `predict_language` preprocesses the audio file, makes predictions using all three models, and decodes the numeric predictions back into language labels.

>Each model predicts a language, and the results from the three models are returned.

In [93]:
def predict_for_file(models, audio_file):
    """
    Perform inference with the given models for a single audio file and present the results in a formatted output,
    including an audio player for playback.
    Args:
        models (dict): Dictionary of models with their names and paths.
        audio_file (str): Path to the audio file.
    """
    # Display the audio file
    print(f"\n🎵 Processing File: {os.path.basename(audio_file)} 🎵")
    display(Audio(audio_file))  # Embed the audio player for the given file

    # Load models
    loaded_models = {name: tf.keras.models.load_model(path) for name, path in models.items()}

    # Preprocess the audio file
    audio_features = preprocess_audio(audio_file)  # Preprocess the audio

    # Prepare a table for the results
    results = []

    # Make predictions with each model
    for model_name, model in loaded_models.items():
        # Reshape input based on model type
        if model_name == "FNN":
            input_data = np.expand_dims(audio_features, axis=0)  # (1, 13)
        else:
            input_data = np.expand_dims(audio_features, axis=(0, -1))  # (1, 13, 1)

        # Make prediction
        prediction = model.predict(input_data)
        predicted_class = np.argmax(prediction, axis=-1)
        confidence = np.max(prediction)

        # Map predicted class to language label
        label_map = {0: "English", 1: "Spanish", 2: "German"}
        language = label_map[predicted_class[0]]

        # Add result to the table
        results.append([model_name, language, f"{confidence:.2%}"])

    # Print results in a tabular format
    headers = ["Model", "Predicted Language", "Confidence"]
    print(tabulate(results, headers=headers, tablefmt="fancy_grid"))

## Display Audio and Predictions
After making predictions, you can play the audio file and display the predictions in a clear format.

> The `predict_for_file` function first plays the audio file for listening, then displays the predictions from all three models.

> The predictions are printed in a readable format, indicating the language each model predicts for the given audio file.

In [100]:
# Run predictions
predict_for_file(MODEL_PATHS, AUDIO_FILES[0])



🎵 Processing File: de_f_63f5b79c76cf5a1a4bbd1c40f54b166e.fragment20.flac 🎵




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 143ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 346ms/step
╒══════════╤══════════════════════╤══════════════╕
│ Model    │ Predicted Language   │ Confidence   │
╞══════════╪══════════════════════╪══════════════╡
│ FNN      │ German               │ 99.46%       │
├──────────┼──────────────────────┼──────────────┤
│ CNN      │ English              │ 99.36%       │
├──────────┼──────────────────────┼──────────────┤
│ CNN_LSTM │ German               │ 99.89%       │
╘══════════╧══════════════════════╧══════════════╛


In [101]:
# Run predictions
predict_for_file(MODEL_PATHS, AUDIO_FILES[1])


🎵 Processing File: en_f_67a0cba10d171b24039a79faa1d4d603.fragment50.flac 🎵




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 85ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 113ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 363ms/step
╒══════════╤══════════════════════╤══════════════╕
│ Model    │ Predicted Language   │ Confidence   │
╞══════════╪══════════════════════╪══════════════╡
│ FNN      │ Spanish              │ 45.09%       │
├──────────┼──────────────────────┼──────────────┤
│ CNN      │ English              │ 100.00%      │
├──────────┼──────────────────────┼──────────────┤
│ CNN_LSTM │ English              │ 99.98%       │
╘══════════╧══════════════════════╧══════════════╛


In [102]:
# Run predictions
predict_for_file(MODEL_PATHS, AUDIO_FILES[2])


🎵 Processing File: es_f_50298ab71aaba8508ebeef49d853df11.fragment82.flac 🎵




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 111ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 337ms/step
╒══════════╤══════════════════════╤══════════════╕
│ Model    │ Predicted Language   │ Confidence   │
╞══════════╪══════════════════════╪══════════════╡
│ FNN      │ Spanish              │ 67.56%       │
├──────────┼──────────────────────┼──────────────┤
│ CNN      │ Spanish              │ 99.67%       │
├──────────┼──────────────────────┼──────────────┤
│ CNN_LSTM │ Spanish              │ 88.86%       │
╘══════════╧══════════════════════╧══════════════╛


In [103]:
# Run predictions
predict_for_file(MODEL_PATHS, AUDIO_FILES[3])


🎵 Processing File: de_m_923551d571cc437382d0294dda2dd0aa.fragment49.flac 🎵




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 82ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 146ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 339ms/step
╒══════════╤══════════════════════╤══════════════╕
│ Model    │ Predicted Language   │ Confidence   │
╞══════════╪══════════════════════╪══════════════╡
│ FNN      │ German               │ 97.31%       │
├──────────┼──────────────────────┼──────────────┤
│ CNN      │ German               │ 99.99%       │
├──────────┼──────────────────────┼──────────────┤
│ CNN_LSTM │ German               │ 100.00%      │
╘══════════╧══════════════════════╧══════════════╛


In [104]:
# Run predictions
predict_for_file(MODEL_PATHS, AUDIO_FILES[4])


🎵 Processing File: en_m_b74b2bf2af570393cae91f4ed89cece7.fragment17.flac 🎵




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 61ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 326ms/step
╒══════════╤══════════════════════╤══════════════╕
│ Model    │ Predicted Language   │ Confidence   │
╞══════════╪══════════════════════╪══════════════╡
│ FNN      │ English              │ 88.56%       │
├──────────┼──────────────────────┼──────────────┤
│ CNN      │ English              │ 100.00%      │
├──────────┼──────────────────────┼──────────────┤
│ CNN_LSTM │ English              │ 100.00%      │
╘══════════╧══════════════════════╧══════════════╛
