# Project 2: Who is talking?
## Model 1: Taken from Tensorflow
### Obtained 86% Accuracy

1.  Ximena Vazquez-Mellado Flores  171319
2.  Alejandro Sánchez Gónzalez 167299
3.  Ricardo Díaz Mendez 166435
4.  Juan Pablo Morales Durante 171657


# Introduction

This documentation offers an in-depth look at our machine learning project 'Who is talking?' This project is dedicated to the development of a robust model aimed at speaker identification. Building upon our prior project, where we recorded diverse audio samples from multiple speakers, the core objective of this work is to develop a model capable of distinguishing individual speakers.

This particular program is the second effort where two pre-trained models were used. This program utilizes the pre-trained YAMNet model, obtained from TensorFlow Hub, that is capable of recognizing 521 audio events. We utilize YAMNet to obtain high-quality embeddings from audio sources, extracting vital features from audio data, which we then use to train a custom classifier. To obtain better results, we fine-tune YAMNet for our specific audio classification task, saving the final resulting model into google drive. This program has been designed specifically for voice recognition tasks and can be adapted for various audio classification tasks.



# Libraries and dependencies
### Dependencies to override the version of Tensorflow and Tensorflow IO

We begin the program by installing and importing the necessary extensions that will be used in the code.

TensorFlow is an open-source library for numerical computation and machine learning. TensorFlow 2.11 is installed, which is a specific version known for its stability and feature set for deep learning models.
TensorFlow I/O is an extension library for TensorFlow that provides support for various file formats and file systems.
These two dependencies are installed to have a compatible environment with the YAMNet model.

PyDub is a simple and easy-to-use Python library for audio processing. It allows for manipulation of audio with a simple and Pythonic interface.

###Imported Libraries:
- **os:** Provides a portable way of using operating system-dependent functionality to interact with the file system.
- **random:** Implements pseudo-random number generators for various distributions.
- **numpy (as np):** Adds support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays.
- **pandas (as pd):** Offers data structures and operations for manipulating numerical tables and time series.
- **matplotlib.pyplot (as plt):** Provides a MATLAB-like plotting framework for creating static, interactive, and animated visualizations in Python.
- **pydub.AudioSegment:** Used for manipulating audio with an easy-to-use interface.
- **tensorflow_io (as tfio):** Extends TensorFlow functionality to handle audio processing and I/O operations.
- **tensorflow (as tf):** Serves as the backbone of machine learning and neural network operations.
- **tensorflow_hub (as hub):** Facilitates the transfer learning by allowing the use of reusable machine learning modules.
- **keras.models.load_model:** Used to load a saved Keras model from disk.
Data Preprocessing
- **sklearn.model_selection.train_test_split:** Utility function to split data arrays into two subsets: for training data and for testing data.

In [None]:
!pip install -q "tensorflow==2.11.*"
!pip install -q "tensorflow_io==0.28.*"
!pip install pydub

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 kB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fo

In [None]:
import os
from IPython import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_io as tfio
import random
from pydub import AudioSegment
from sklearn.model_selection import train_test_split
from keras.models import load_model

# Classes & data

In this step, we prepare the audio data for processing and feeding into a machine learning model. This involves defining the classes, loading audio files, segmenting them, and creating a structured dataset.

The classes represent different categories or labels in the dataset, each corresponding to a different speaker.

A TensorFlow decorator @tf.function is used to define the function "load_wav_16k_mono" that efficiently loads audio files as 16kHz mono signals. This function is vital for our model to work correctly since the YAMNet model requires the audio input to be mono and have a sample rate of 16KHz.

The "segment_audios" function is defined to segment the audio files into one-second clips, save them into a specified folder, and compile metadata about the segments.

"from google.colab import drive": this line mounts the Google Drive to access the dataset stored in a specified path, which is saved in the variable "dataset_paths"
segment_audios("divided_audios", dataset_paths)": The path to the dataset is specified, and the audio segmenting function is called to preprocess the audio files and save them on a local file under the name "divided_audios".

"for f in os.listdir(path):": The audio files are sorted into lists based on the initial characters of the filenames, corresponding to the predefined classes.
"dataframe = pd.DataFrame(data)": A dictionary and then a dataframe is created to map each audio file to its corresponding class label and a randomly generated key for splitting the dataset later.
"ds = tf.data.Dataset.from_tensor_slices((filenames, labels, keys))": The lists of filenames and labels are transformed into TensorFlow datasets for efficient input to a machine learning model.

The dataset is then mapped using the load_wav_for_map function, which applies the load_wav_16k_mono function to each element of the dataset.
"ds = ds.map(load_wav_for_map)": With this preparation, the audio dataset is now in a suitable format to be used to train the model, with audio files segmented, categorized, and encoded into tensors.

In [None]:
classes = ["jp", "xime", "jano", "rich", "hele", "gaby", "fede", "faro"]
mapped_classes = {
    "jp": 0,
    "xime": 1,
    "jano": 2,
    "rich": 3,
    "hele": 4,
    "gaby": 5,
    "fede": 6,
    "faro": 7
}

In [None]:
# -- Utility Function to load file as 16KHz mono --
@tf.function
def load_wav_16k_mono(filename):
    """ Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio. """
    file_contents = tf.io.read_file(filename)
    wav, sample_rate = tf.audio.decode_wav(
          file_contents,
          desired_channels=1)
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
    wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)
    return wav

In [None]:
def segment_audios(output_folder, root_folder):
    # Definir la duración de los segmentos en milisegundos (10 segundos = 10000 ms)
    duracion_segmento = 1000

    # Crear una lista para almacenar la información de los segmentos
    segment_info = []

    # Crear un directorio para guardar los segmentos si no existe
    if not os.path.exists(output_folder):
        os.mkdir(output_folder)

    i = 0

    for folder, _, files in os.walk(root_folder):
        for file in files:
            if file.endswith(".wav"):  # Filtrar archivos de audio, puedes ajustar la extensión según sea necesario
                file_path = os.path.join(folder, file)
                i = 0
                # Obtener el nombre base del archivo para usar como clase
                nombre_base = os.path.splitext(os.path.basename(file_path))[0]

                # Cargar el archivo de audio original
                audio = AudioSegment.from_file(file_path)

                # Obtener la duración total del audio en milisegundos
                duracion_total = len(audio)

                # Dividir el audio en segmentos de 10 segundos y guardar la información
                for inicio_ms in range(0, duracion_total, duracion_segmento):
                    fin_ms = inicio_ms + duracion_segmento

                    # Generar el nombre del archivo para el segmento
                    nombre_archivo = f"{nombre_base}_{i}.wav"  # {nombre_base}_{i}.wav
                    i += 1
                    # Guardar el segmento en el directorio !! en un folder por user TODO
                    segmento = audio[inicio_ms:fin_ms]
                    segmento.export(os.path.join(output_folder, nombre_archivo), format="wav")

    print(f"Segmentos de audio guardados con éxito en '{output_folder}'.")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Ruta de los archivos de audio
dataset_paths = '/content/drive/MyDrive/SelectedTopics/project2'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!rm -rf /content/divided_audios
segment_audios("divided_audios", dataset_paths)

Segmentos de audio guardados con éxito en 'divided_audios'.


In [None]:
# get all audios starting by name
path = "/content/divided_audios"
jp_audios = []
xime_audios = []
jano_audios = []
rich_audios = []
hele_audios = []
gaby_audios = []
fede_audios = []
faro_audios = []

for f in os.listdir(path):
  if f[0] == "j" and f[1] == "p":
    jp_audios.append(f)
  elif f[0] == "x":
    xime_audios.append(f)
  elif f[0] == "r":
    rich_audios.append(f)
  elif f[0] == "h":
    hele_audios.append(f)
  elif f[0] == "g":
    gaby_audios.append(f)
  elif f[0] == "f" and f[1] == "e":
    fede_audios.append(f)
  elif f[0] == "f" and f[1] == "a":
    faro_audios.append(f)
  else:
    jano_audios.append(f)

# create dataframe with filename, target
df = {}
for f in os.listdir(path):
  if f in jp_audios:
    df[f"/content/divided_audios/{f}"] = 0
  elif f in xime_audios:
    df[f"/content/divided_audios/{f}"] = 1
  elif f in jano_audios:
    df[f"/content/divided_audios/{f}"] = 2
  elif f in rich_audios:
    df[f"/content/divided_audios/{f}"] = 3
  elif f in hele_audios:
    df[f"/content/divided_audios/{f}"] = 4
  elif f in gaby_audios:
    df[f"/content/divided_audios/{f}"] = 5
  elif f in fede_audios:
    df[f"/content/divided_audios/{f}"] = 6
  else:
    df[f"/content/divided_audios/{f}"] = 7

# make key list
keys = []
for i in range(len(os.listdir("/content/divided_audios"))):
  keys.append(random.randint(0,4))

data = {
    'File': list(df.keys()),
    'Label': list(df.values()),
    'Key': keys
}

dataframe = pd.DataFrame(data)

# transform into tensors
filenames = dataframe["File"]
labels = dataframe["Label"]
keys = dataframe["Key"]
ds = tf.data.Dataset.from_tensor_slices((filenames, labels, keys))

In [None]:
ds.element_spec

(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

In [None]:
def load_wav_for_map(filename, label, key):
  return load_wav_16k_mono(filename), label, key

ds = ds.map(load_wav_for_map)
ds.element_spec

(TensorSpec(shape=<unknown>, dtype=tf.float32, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

# Downloading the YAMNet model

After that, the code is utilizing the YAMNet model, a deep learning model trained on a large dataset of environmental sounds, to extract features from audio data, followed by training a custom model and performing inference.

We begin by setting the URL where the YAMNet model is hosted on TensorFlow Hub to a variable called yamnet_model_handle.

The "extract_embedding" function applies YAMNet to the audio data to extract embeddings, which are rich representations of the audio features.
The dataset is processed to extract features using the map function, which applies the "extract_embedding" function to each element.

In [None]:
yamnet_model_handle = 'https://tfhub.dev/google/yamnet/1'
yamnet_model = hub.load(yamnet_model_handle)

In [None]:
# applies the embedding extraction model to a wav data
def extract_embedding(wav_data, label, key):
  ''' run YAMNet to extract embedding from the wav data '''
  scores, embeddings, spectrogram = yamnet_model(wav_data)
  num_embeddings = tf.shape(embeddings)[0]
  return (embeddings, tf.repeat(label, num_embeddings), tf.repeat(key, num_embeddings))

# extract embedding
ds = ds.map(extract_embedding).unbatch()
ds.element_spec

(TensorSpec(shape=(1024,), dtype=tf.float32, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

# Split data and remove "keys" column

The dataset, now stored in the variable "cached_ds" is split into training, validation, and test datasets based on the value of the keys which act as an identifier for which subset each data point should belong to.
With the line "**train_ds = train_ds.map(remove_fold_column).cache().shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)"** The datasets are further prepared:
- train_ds.map(remove_fold_column): This applies the remove_fold_column function to each element in the train_ds dataset. The function removes the keys field from the dataset, which is no longer needed after splitting the data.
- .cache(): This method caches the dataset in memory. This is done after mapping and before shuffling, batching, and prefetching to ensure that these operations do not need to be re-executed when the dataset is iterated over multiple epochs. Caching can speed up training by reducing the time spent on data loading and preprocessing during each epoch.
- .shuffle(1000): This method randomly shuffles the elements of the dataset. 1000 is the size of the buffer that shuffle will use to sample elements. A large buffer size ensures better randomness but uses more memory, while a smaller buffer size uses less memory but may reduce randomness. The chosen buffer size here is 1000, which means that the shuffling will happen within a window of 1000 elements.
- .batch(32): This method combines consecutive elements of the dataset into batches. The argument "32" specifies the batch size, meaning each batch will contain 32 elements (i.e., training examples). Batching is required for training neural networks, as it allows for more efficient gradient calculations by processing multiple data points in parallel.
- .prefetch(tf.data.AUTOTUNE): This method allows the dataset to prepare subsequent batches while the current batch is being processed. This can improve latency and throughput at the cost of using additional memory to store the prefetched batches. "tf.data.AUTOTUNE" is an argument that allows TensorFlow to automatically adjust the number of batches to prefetch dynamically, based on available resources and runtime conditions. This helps to optimize the prefetching process without manual tuning.

In [None]:
### split data
cached_ds = ds.cache()
train_ds = cached_ds.filter(lambda embedding, label, keys: keys == 0 or keys == 1)
val_ds = cached_ds.filter(lambda embedding, label, keys: keys == 2)
test_ds = cached_ds.filter(lambda embedding, label, keys: keys == 3)

# remove the folds column now that it's not needed anymore
remove_fold_column = lambda embedding, label, keys: (embedding, label)

train_ds = train_ds.map(remove_fold_column)
val_ds = val_ds.map(remove_fold_column)
test_ds = test_ds.map(remove_fold_column)

train_ds = train_ds.cache().shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
val_ds = val_ds.cache().batch(32).prefetch(tf.data.AUTOTUNE)
test_ds = test_ds.cache().batch(32).prefetch(tf.data.AUTOTUNE)

# Create, compile and train the new model

Now that the dataset has been prepared, the model is defined, compiled and trained using the training and validation datasets.

### Creation of the new model
Thie first block of code is defining a simple neural network model using TensorFlow's Keras API:
- tf.keras.Sequential: This is used to sequentially group a linear stack of layers into a TensorFlow Keras model.
- tf.keras.layers.Input: This specifies the input shape that the model will expect. Each input will be a 1D tensor (vector) with 1024 elements, which represents the embedding extracted from audio data by the YAMNet model. The "dtype" argument specifies that these elements are floating-point numbers.
- tf.keras.layers.Dense(512, activation='relu'): This is a fully connected layer (also known as a dense layer) with 512 neurons. The "activation='relu'" argument specifies that the Rectified Linear Unit (ReLU) function should be used as the activation function for each neuron in this layer.
- tf.keras.layers.Dense(8): This is another dense layer with 8 neurons, corresponding to the number of classes in the dataset (our classes 'xime', 'jp', 'jano', 'rich', and four others). Since this is the output layer and no activation function is specified, it implies that this model will output the raw scores (logits) for each class, which are typically passed through a softmax function during inference to obtain probabilities.
- name='my_model': This names the model "my_model", which can be useful for referencing the model later on.

### Compiling and training
The model is then compiled and trained using the training and validation datasets.
- my_model.compile(): This method configures the model for training.
- loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True): The loss function is set to Sparse Categorical Crossentropy, which is a standard loss function for multi-class classification problems.
The "from_logits=True" argument indicates that the outputs of the model are raw logits (not normalized to probabilities by a softmax function).
- optimizer="adam": The optimizer is set to 'adam', this option adjusts the weights during training to minimize the loss.
- metrics=['accuracy']: The metric for evaluation is set to accuracy, which is the fraction of correctly classified instances among the total number of instances.


After training, the model's performance is evaluated on the test dataset.
- loss, accuracy = my_model.evaluate(test_ds): The evaluate method computes the loss and accuracy metrics for the dataset provided.

In [None]:
# create new model
my_model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1024), dtype=tf.float32,
                          name='input_embedding'),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(8) # xime, jp, jano, rich, y youtubers
], name='my_model')

my_model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_14 (Dense)            (None, 512)               524800    
                                                                 
 dense_15 (Dense)            (None, 8)                 4104      
                                                                 
Total params: 528,904
Trainable params: 528,904
Non-trainable params: 0
_________________________________________________________________


In [None]:
my_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                 optimizer="adam",
                 metrics=['accuracy'])

In [None]:
history = my_model.fit(train_ds,
                       epochs=30,
                       validation_data=val_ds)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
loss, accuracy = my_model.evaluate(test_ds)

print(f"Loss: {loss}")
print(f"Accuracy: {accuracy*100:.2f}%")

Loss: 1.0290833711624146
Accuracy: 86.69%


# Final test of our new model

The function predict_audio_class is designed to take an audio file as input and predict which class the audio belongs to:
- "testing_wav_data = load_wav_16k_mono(audio_file)": This line calls the load_wav_16k_mono function to load the WAV file specified by audio_file, converting it to a mono-channel (single channel) audio with a sample rate of 16 kHz. The audio data is returned as a tensor.
- "scores, embeddings, spectrogram = yamnet_model(testing_wav_data)": Here, the yamnet_model is used to process the audio data and it returns three values:
  - scores: This is the output of the softmax layer of YAMNet, representing the probability of each AudioSet class.
  - embeddings: These are the 1024-dimensional embeddings extracted from the audio. They serve as a compact representation of the audio's features.
  - spectrogram: This is the mel-spectrogram used by YAMNet as part of its feature extraction process.
- "result = my_model(embeddings).numpy()": The extracted embeddings are then passed to my_model to get the predictions. The output result is converted to a NumPy array for easier manipulation.
- "inferred_class = classes[result.mean(axis=0).argmax()]": This line calculates the mean of the results across the time dimension (axis=0), which might be necessary if the model output includes time-distributed predictions, and finds the index of the highest score which corresponds to the most likely class prediction. This index is then used to find the actual class label from the classes list.
- "return f"{audio_file} audiofile is most similar to {inferred_class}."": When finished, the function returns a formatted string stating which class the audio file is most similar to, based on the model's inference.

The prediction function is applied to the dataset of test files to infer the class.
The test_files list contains the file paths to the audio files that we want to classify. The for loop iterates through this list, and for each path in test_files, it calls the predict_audio_class function. This function processes the audio file and predicts which class it belongs to. The prediction result is then printed to the console.

Finally, the trained model is saved to Google Drive for later use.

In [None]:
def predict_audio_class(audio_file):
    testing_wav_data = load_wav_16k_mono(audio_file)
    scores, embeddings, spectrogram = yamnet_model(testing_wav_data)
    result = my_model(embeddings).numpy()

    inferred_class = classes[result.mean(axis=0).argmax()]
    return f"{audio_file} audiofile is most similar to {inferred_class}."

In [None]:
test_files = [
    '/content/drive/MyDrive/SelectedTopics/project2/test/janoT.wav',
    '/content/drive/MyDrive/SelectedTopics/project2/test/ximeT.wav',
    '/content/drive/MyDrive/SelectedTopics/project2/test/jpT.wav',
    '/content/drive/MyDrive/SelectedTopics/project2/test/richT.wav'
]

for test in test_files:
  print(predict_audio_class(test))

/content/drive/MyDrive/SelectedTopics/project2/test/janoT.wav audiofile is most similar to jano.
/content/drive/MyDrive/SelectedTopics/project2/test/ximeT.wav audiofile is most similar to xime.
/content/drive/MyDrive/SelectedTopics/project2/test/jpT.wav audiofile is most similar to jp.
/content/drive/MyDrive/SelectedTopics/project2/test/richT.wav audiofile is most similar to rich.


In [None]:
# Define the path where you want to save the model in your Google Drive
model_save_path = '/content/drive/MyDrive/SelectedTopics/project2/yamnet-final.h5'

# Save the Keras model
my_model.save(model_save_path)

## Conclusion
In this project, we developed an audio classification model capable of identifying individual speakers from a dataset of voice recordings. Utilizing TensorFlow and the YAMNet model from TensorFlow Hub, we processed audio files to extract meaningful embeddings, which served as feature representations for our classification task.

The dataset was carefully segmented into 1-second audio clips to standardize the input and focus on short-term audio features. These segments were then used to train a neural network model, which included dense layers following the YAMNet embeddings extraction. The model's performance was evaluated based on its accuracy in classifying unseen data, and the results were promising.