<a href="https://colab.research.google.com/github/Nikhila-KS/Unravel_ML/blob/main/4.)Understanding_YAMNet_myNotes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/hub/tutorials/yamnet"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/yamnet.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/hub/blob/master/examples/colab/yamnet.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/hub/examples/colab/yamnet.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
  <td>
    <a href="https://tfhub.dev/google/yamnet/1"><img src="https://www.tensorflow.org/images/hub_logo_32px.png" />See TF Hub model</a>
  </td>
</table>

# Sound classification with YAMNet
----------------------------------

YAMNet is a deep net that predicts 521 audio event [classes](https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv) from the [AudioSet-YouTube corpus](http://g.co/audioset) it was trained on. It employs the
[Mobilenet_v1](https://arxiv.org/pdf/1704.04861.pdf) depthwise-separable
convolution architecture.

In Simple words -

YAMNet is a pre-trained deep neural network that can predict audio events from 521 classes, such as laughter, barking, or a siren. It was developed by Google AI and is available on TensorFlow Hub.

YAMNet is based on the MobileNetV1 depthwise-separable convolution architecture, which is designed to be efficient and accurate at classifying audio events. It is trained on a large dataset of audio recordings called the AudioSet corpus, which contains over 2 million audio clips from YouTube videos.

YAMNet can be used for a variety of tasks, such as:

* Sound event detection: YAMNet can be used to automatically detect the sound events that are present in an audio recording. This can be useful for tasks such as creating transcripts of audio recordings or identifying the source of a noise complaint.
* Audio tagging: YAMNet can be used to automatically tag audio recordings with the corresponding sound events. This can be useful for tasks such as organizing audio files or creating playlists.
* Sound synthesis: YAMNet can be used to synthesize new audio recordings that contain specific sound events. This can be useful for tasks such as creating sound effects or generating music.

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import csv

import matplotlib.pyplot as plt
from IPython.display import Audio
from scipy.io import wavfile

**Definitions**

----------------------------------------
**TensorFlow** is a popular open-source software library used for building and training machine learning models. It provides a framework that simplifies the process of developing and deploying machine learning algorithms. TensorFlow allows users to define and manipulate mathematical operations with multidimensional arrays called tensors, which are the fundamental building blocks of models.

-------------------------------------------
**TensorFlow Hub**, on the other hand, is an extension of TensorFlow that offers a repository of pre-trained machine learning models. These models are created by experts and can be easily reused in other projects. TensorFlow Hub simplifies the process of integrating pre-trained models into new applications, enabling developers to leverage existing models and their learned knowledge for specific tasks without having to start from scratch. It provides a convenient way to access and incorporate state-of-the-art models into your own machine learning projects.

---------------------------------------------------
**NumPy** is a popular Python library that stands for "Numerical Python." It provides a powerful set of tools and functions for working with arrays and matrices, making it an essential library for scientific computing and data analysis, particularly in machine learning. NumPy allows you to perform various mathematical operations on arrays efficiently, such as mathematical calculations, linear algebra, random number generation, and reshaping data. It provides a convenient and optimized way to handle large datasets and perform computations on them.

-------------------------------------------------------
**CSV** which stands for "Comma-Separated Values," is a simple file format used to store tabular data, such as spreadsheets or databases. It represents data in plain text, where each line of the file corresponds to a row of data, and the values within a row are separated by commas. CSV files are commonly used in machine learning for storing and importing datasets.

-------------------------------------------------------
**matplotlib.pyplot** is a module within the matplotlib library that provides a collection of functions for creating visualizations, such as plots, charts, and graphs, in Python. It allows you to display and customize data in a visually appealing manner.

-----------------------------------

**IPython.display** is a module that provides a set of functions to enhance the display of output within the IPython environment or Jupyter Notebook. It offers various capabilities to render and format different types of content, such as images, videos, audio, HTML, Markdown etc.

--------------------------------------------
**scipy.io** is a module within the SciPy library that provides functions for reading and writing data from different file formats. It enables you to load and save data in various formats, including MATLAB files, NetCDF files, WAV files, and more. The scipy.io module allows you to read data stored in these formats into a format that can be easily used within your Python code. It provides convenient functions to access and manipulate data from these file formats, enabling you to work with different types of data efficiently in your scientific or machine learning applications.

----------------------------

### Load the Model from TensorFlow Hub.

Note: to read the documentation just follow the model's [url](https://tfhub.dev/google/yamnet/1)

In [None]:
# Load the model.
model = hub.load('https://tfhub.dev/google/yamnet/1')

The labels file will be loaded from the models assets and is present at `model.class_map_path()`.
You will load it on the `class_names` variable.

In [None]:
# Find the name of the class with the top score when mean-aggregated across frames.
def class_names_from_csv(class_map_csv_text):
  """Returns list of class names corresponding to score vector."""
  class_names = []
  with tf.io.gfile.GFile(class_map_csv_text) as csvfile:
    #tf.io.gfile.GFile(class_map_csv_text) is a function call in TensorFlow that creates a file object for reading the contents
    #of the specified file, class_map_csv_text. It is part of the tf.io.gfile module, which provides a file system interface
    #compatible with multiple storage systems, such as local files, Google Cloud Storage, and
    #Hadoop Distributed File System (HDFS).

    reader = csv.DictReader(csvfile)
    print("the rows -")
    for row in reader:
      class_names.append(row['display_name'])
      print(row)
  print("==============================================")
  print("list class_names")
  print(class_names)
  return class_names

class_map_path = model.class_map_path().numpy()
class_names = class_names_from_csv(class_map_path)



Add a method to verify and convert a loaded audio is on the proper sample_rate (16K), otherwise it would affect the model's results.

In [None]:
def ensure_sample_rate(original_sample_rate, waveform,
                       desired_sample_rate=16000):
  """Resample waveform if required."""
  if original_sample_rate != desired_sample_rate:
    desired_length = int(round(float(len(waveform)) /
                               original_sample_rate * desired_sample_rate))
    waveform = scipy.signal.resample(waveform, desired_length)
  return desired_sample_rate, waveform

This function takes three arguments:

1] original_sample_rate: The original sample rate of the waveform.

2] waveform: The waveform data.

3] desired_sample_rate: The desired sample rate of the waveform.
The function first checks if the original sample rate is different from the desired sample rate. If it is, then the function resamples the waveform to the desired sample rate. Resampling is the process of converting a waveform from one sample rate to another.

The function then returns the desired sample rate and the resampled waveform.

key concepts:

Sample rate:  Sample rate is a measurement of the number of samples we take per second of audio; and therefore the speed at which we do so. In other words Sample rate is the number of samples of audio carried per second, measured in Hz or kHz

Waveform: A waveform is a graphical representation of a sound wave.

Resample: To resample a waveform is to convert it from one sample rate to another.

## Downloading and preparing the sound file

Here you will download a wav file and listen to it.
If you have a file already available, just upload it to colab and use it instead.

Note: The expected audio file should be a mono wav file at 16kHz sample rate.

In [None]:
!curl -O https://storage.googleapis.com/audioset/speech_whistling2.wav

In [None]:
!curl -O https://storage.googleapis.com/audioset/miaow_16k.wav

In [None]:
# wav_file_name = 'speech_whistling2.wav'
wav_file_name = 'miaow_16k.wav'
sample_rate, wav_data = wavfile.read(wav_file_name, 'rb')
sample_rate, wav_data = ensure_sample_rate(sample_rate, wav_data)

'''
This line of code uses the wavfile module to read the WAV file and store the following information in two variables:
sample_rate: The sample rate of the WAV file, which is the number of times per second that the audio signal is sampled.
wav_data: The audio data from the WAV file, which is a NumPy array of numbers representing the amplitude of the audio signal at
each sample.This line of code calls the ensure_sample_rate function to ensure that the sample rate of the WAV file is 16,000 Hz.
This is the standard sample rate for audio files.
'''

# Show some basic information about the audio.
duration = len(wav_data)/sample_rate
print(f'Sample rate: {sample_rate} Hz')
print(f'Total duration: {duration:.2f}s')
print(f'Size of the input: {len(wav_data)}')

# Listening to the wav file.
print(' ')
Audio(wav_data, rate=sample_rate)


The `wav_data` needs to be normalized to values in `[-1.0, 1.0]` (as stated in the model's [documentation](https://tfhub.dev/google/yamnet/1)).

In [None]:
waveform = wav_data / tf.int16.max

## Executing the Model

Now the easy part: using the data already prepared, you just call the model and get the: scores, embedding and the spectrogram.

The score is the main result you will use.
The spectrogram you will use to do some visualizations later.

In [None]:
# Run the model, check the output.
scores, embeddings, spectrogram = model(waveform)

In [None]:
scores_np = scores.numpy()
print('scores array - \n',scores_np)
spectrogram_np = spectrogram.numpy()
infered_class = class_names[scores_np.mean(axis=0).argmax()]
print('\n The main sound is: {infered_class}')

## Visualization

YAMNet also returns some additional information that we can use for visualization.
Let's take a look on the Waveform, spectrogram and the top classes inferred.

In [None]:
plt.figure(figsize=(10, 6))

# Plot the waveform.
plt.subplot(3, 1, 1)
plt.plot(waveform)
plt.xlim([0, len(waveform)])

# Plot the log-mel spectrogram (returned by the model).
'''A log-mel spectrogram is a visual representation of the frequency content of a sound. It is created by taking the Fourier
transform of a sound signal, and then applying a logarithm and a mel filterbank to the resulting frequency spectrum.'''
plt.subplot(3, 1, 2)
plt.imshow(spectrogram_np.T, aspect='auto', interpolation='nearest', origin='lower')

# Plot and label the model output scores for the top-scoring classes.
mean_scores = np.mean(scores, axis=0)
top_n = 10
top_class_indices = np.argsort(mean_scores)[::-1][:top_n]
plt.subplot(3, 1, 3)
plt.imshow(scores_np[:, top_class_indices].T, aspect='auto', interpolation='nearest', cmap='gray_r')

# patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS
# values from the model documentation
patch_padding = (0.025 / 2) / 0.01
plt.xlim([-patch_padding-0.5, scores.shape[0] + patch_padding-0.5])
# Label the top_N classes.
yticks = range(0, top_n, 1)
plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])
_ = plt.ylim(-0.5 + np.array([top_n, 0]))


Explanation of above code by BARD -


The code you provided is used to plot three different visualizations of a sound: the waveform, the log-mel spectrogram, and the model output scores for the top-scoring classes.

Code snippet
```
plt.figure(figsize=(10, 6))
```

This line of code creates a new figure with a width of 10 inches and a height of 6 inches.

Code snippet
```
# Plot the waveform.
plt.subplot(3, 1, 1)
plt.plot(waveform)
plt.xlim([0, len(waveform)])
```

This code creates a new subplot in the figure and plots the waveform. The subplot is divided into 3 rows and 1 column, and this code is creating the first subplot. The waveform is plotted using the plot() function, and the x-axis is limited to the range of values from 0 to the length of the waveform.

Code snippet
```
# Plot the log-mel spectrogram (returned by the model).
plt.subplot(3, 1, 2)
plt.imshow(spectrogram_np.T, aspect='auto', interpolation='nearest', origin='lower')
```

This code creates a new subplot in the figure and plots the log-mel spectrogram. The subplot is divided into 3 rows and 1 column, and this code is creating the second subplot. The log-mel spectrogram is plotted using the imshow() function. The aspect argument is set to auto to automatically scale the axes so that the spectrogram is displayed as a square. The interpolation argument is set to nearest to use nearest-neighbor interpolation when displaying the spectrogram. The origin argument is set to lower to display the spectrogram with the origin at the bottom-left corner.

Code snippet
```
# Plot and label the model output scores for the top-scoring classes.
mean_scores = np.mean(scores, axis=0)
top_n = 10
top_class_indices = np.argsort(mean_scores)[::-1][:top_n]
plt.subplot(3, 1, 3)
plt.imshow(scores_np[:, top_class_indices].T, aspect='auto', interpolation='nearest', cmap='gray_r')

# patch_padding = (PATCH_WINDOW_SECONDS / 2) / PATCH_HOP_SECONDS
# values from the model documentation
patch_padding = (0.025 / 2) / 0.01
plt.xlim([-patch_padding-0.5, scores.shape[0] + patch_padding-0.5])
# Label the top_N classes.
yticks = range(0, top_n, 1)
plt.yticks(yticks, [class_names[top_class_indices[x]] for x in yticks])
_ = plt.ylim(-0.5 + np.array([top_n, 0]))
```

This code creates a new subplot in the figure and plots the model output scores for the top-scoring classes. The subplot is divided into 3 rows and 1 column, and this code is creating the third subplot. The model output scores are plotted using the imshow() function. The aspect argument is set to auto to automatically scale the axes so that the scores are displayed as a square. The interpolation argument is set to nearest to use nearest-neighbor interpolation when displaying the scores. The cmap argument is set to gray_r to display the scores in a grayscale colormap.

The patch_padding variable is calculated as the half-width of the patch window divided by the patch hop size. This value is used to set the x-limits of the subplot so that the patches are centered in the subplot.

The yticks variable is created to hold the labels for the top-N classes. The labels are then plotted using the yticks() function. The _ variable is used to suppress the output of the ylim() function.

---------------------------------
The code you provided creates a figure with three subplots. The first subplot plots the waveform of a sound. The second subplot plots the log-mel spectrogram of the sound. The third subplot plots the model output scores for the top-scoring classes.

The waveform is a graph of the sound's amplitude over time. The amplitude is a measure of the loudness of the sound.

The log-mel spectrogram is a graph of the sound's frequency content over time. The frequency content is a measure of the different pitches that are present in the sound.

The model output scores are a measure of how likely it is that the sound belongs to each of the different classes.


Here are some of the things you can learn from the plots:

* The waveform can tell you about the overall structure of the sound. For example, you can see that the sound in the example code has a regular beat, which suggests that it is a piece of music.
* The log-mel spectrogram can tell you about the different frequencies that are present in the sound. For example, you can see that the sound in the example code has a lot of high-frequency content, which suggests that it is a high-pitched sound.
* The model output scores can tell you which classes the sound is most likely to belong to. For example, the sound in the example code has the highest scores for the classes "piano" and "guitar," which suggests that it is a piece of music that features both instruments.