# Transfer learning with YAMNet for cough sound classification

[YAMNet](https://tfhub.dev/google/yamnet/1) is a pre-trained deep neural network that can predict audio events from [521 classes](https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv), such as laughter, barking, or a siren. 

 In this tutorial you will learn how to:

- Load and use the YAMNet model for inference.
- Build a new model using the YAMNet embeddings to classify cat and dog sounds.
- Evaluate and export your model.


## Import TensorFlow and other libraries


Start by installing [TensorFlow I/O](https://www.tensorflow.org/io), which will make it easier for you to load audio files off disk.

In [None]:
!pip install tensorflow_io
# !pip install tensorflow==1.15.2

In [None]:
import os

from IPython import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_io as tfio
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, CSVLogger, EarlyStopping
import warnings

## About YAMNet

[YAMNet](https://github.com/tensorflow/models/tree/master/research/audioset/yamnet) is a pre-trained neural network that employs the [MobileNetV1](https://arxiv.org/abs/1704.04861) depthwise-separable convolution architecture. It can use an audio waveform as input and make independent predictions for each of the 521 audio events from the [AudioSet](http://g.co/audioset) corpus.

Internally, the model extracts "frames" from the audio signal and processes batches of these frames. This version of the model uses frames that are 0.96 second long and extracts one frame every 0.48 seconds .

The model accepts a 1-D float32 Tensor or NumPy array containing a waveform of arbitrary length, represented as single-channel (mono) 16 kHz samples in the range `[-1.0, +1.0]`. This tutorial contains code to help you convert WAV files into the supported format.

The model returns 3 outputs, including the class scores, embeddings (which you will use for transfer learning), and the log mel [spectrogram](https://www.tensorflow.org/tutorials/audio/simple_audio#spectrogram). You can find more details [here](https://tfhub.dev/google/yamnet/1).

One specific use of YAMNet is as a high-level feature extractor - the 1,024-dimensional embedding output. You will use the base (YAMNet) model's input features and feed them into your shallower model consisting of one hidden `tf.keras.layers.Dense` layer. Then, you will train the network on a small amount of data for audio classification _without_ requiring a lot of labeled data and training end-to-end. (This is similar to [transfer learning for image classification with TensorFlow Hub](https://www.tensorflow.org/tutorials/images/transfer_learning_with_hub) for more information.)

First, you will test the model and see the results of classifying audio. You will then construct the data pre-processing pipeline.

### Loading YAMNet from TensorFlow Hub

You are going to use a pre-trained YAMNet from [Tensorflow Hub](https://tfhub.dev/) to extract the embeddings from the sound files.

Loading a model from TensorFlow Hub is straightforward: choose the model, copy its URL, and use the `load` function.

Note: to read the documentation of the model, use the model URL in your browser.

In [None]:
yamnet_model_handle = 'https://tfhub.dev/google/yamnet/1'
yamnet_model = hub.load(yamnet_model_handle)

You will need a function to load audio files, which will also be used later when working with the training data. (Learn more about reading audio files and their labels in [Simple audio recognition](https://www.tensorflow.org/tutorials/audio/simple_audio#reading_audio_files_and_their_labels).

Note: The returned `wav_data` from `load_wav_16k_mono` is already normalized to values in the `[-1.0, 1.0]` range (for more information, go to [YAMNet's documentation on TF Hub](https://tfhub.dev/google/yamnet/1)).

In [None]:
# Utility functions for loading audio files and making sure the sample rate is correct.

@tf.function
def load_wav_16k_mono(filename):
    """ Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio. """
    file_contents = tf.io.read_file(filename)
    wav, sample_rate = tf.audio.decode_wav(
          file_contents,
          desired_channels=1)
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
    wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)
    return wav

### Load the class mapping

It's important to load the class names that YAMNet is able to recognize. The mapping file is present at `yamnet_model.class_map_path()` in the CSV format.

In [None]:
class_map_path = yamnet_model.class_map_path().numpy().decode('utf-8')
class_names =list(pd.read_csv(class_map_path)['display_name'])

for name in class_names[:20]:
  print(name)
print('...')

### Settings: Explore the data


In [None]:
pos_label= 'Cough'
neg_label= 'Speech'
base_data_path= '../input/privacy-aware-cough-event-detection/eval/eval/'
# base_data_path= '../input/privacy-aware-cough-event-detection/soundscapes/soundscapes/'
saved_model_path = '../input/privacy-aware-cough-event-detection/pretrained_'+pos_label+'_detection_3/pretrained_'+pos_label+'_detection_3'
# saved_model_path = '../input/privacy-aware-cough-event-detection/pretrained_'+pos_label+'_detection_with_speech/pretrained_'+pos_label+'_detection_with_speech'

### Filter the data

Now that the data is stored in the `DataFrame`, apply some transformations:

- Filter out rows and use only the selected classes - `Cough` and `Non-Cough`. If you want to use any other classes, this is where you can choose them.
- Amend the filename to have the full path. This will make loading easier later.
- Change targets to be within a specific range. In this example, `Non-Cough` will be `0`, and `Cough` will become `1`.

In [None]:
# cd_csv = base_data_path+pos_label+'_Detection_eval.csv'
cd_csv = '../input/privacy-aware-cough-event-detection/'+'Cough_Detection_eval.csv'
pd_data = pd.read_csv(cd_csv)
pd_data.head()

In [None]:
my_classes = [neg_label, pos_label]
map_class_to_id = {neg_label:0, pos_label:1}

filtered_pd = pd_data[pd_data.label.isin(my_classes)]

class_id = filtered_pd['label'].apply(lambda name: map_class_to_id[name])
filtered_pd = filtered_pd.assign(target=class_id)

full_path = filtered_pd['name'].apply(lambda row: os.path.join(base_data_path, row))
filtered_pd = filtered_pd.assign(name=full_path)


filtered_pd.head(10)

### Split the data



In [None]:
reloaded_model = tf.saved_model.load(saved_model_path)

And for the final test: given some sound data, does your model return the correct result?

In [None]:
test_df= filtered_pd.copy()
divide_by=1
TP=0
TN=0
FP=0
FN=0
P=0
N=0
for i in range (int(len(test_df)/divide_by)):
  testing_wav_label= test_df.iloc[i,1]
  if testing_wav_label== pos_label:
    P=P+1
  else:
    N=N+1
for i in range (int(len(test_df)/divide_by)):
  testing_wav_file_name= test_df.iloc[i,0]
  testing_wav_label= test_df.iloc[i,1]
  testing_wav_data= load_wav_16k_mono(testing_wav_file_name)
  reloaded_results = reloaded_model(testing_wav_data)
  prediction = my_classes[tf.argmax(reloaded_results)]
#   print(prediction)
  if prediction== testing_wav_label:
    if prediction== pos_label:
      TP=TP+1
    else:
      TN=TN+1
  if prediction!= testing_wav_label:
    if prediction== pos_label:
      FP=FP+1
    else:
      FN=FN+1
recall=0
precision=0
f1=0
accuracy=0
if P>0:
    recall= TP/P
if TP+FP>0:
    precision= TP/(TP+FP)
if precision+recall>0:
    f1= (2*precision*recall)/(precision+recall)
if P+N>0:
    accuracy= (TP+TN)/(P+N)
    
print("TP, P, TN, N:",TP, P, TN, N)
print("Recall:",recall,"\n")
print("Precision:",precision,"\n")
print("F1:",f1,"\n")
print("Accuracy:",accuracy,"\n")

If you want to try your new model on a serving setup, you can use the 'serving_default' signature.

In [None]:
# serving_results = reloaded_model.signatures['serving_default'](testing_wav_data)
# cat_or_dog = my_classes[tf.argmax(serving_results['classifier'])]
# print(f'The main sound is: {cat_or_dog}')


## (Optional) Some more testing

The model is ready.

Let's compare it to YAMNet on the test dataset.

In [None]:
# test_pd = filtered_pd.loc[filtered_pd['fold'] == 5]
# row = test_pd.sample(1)
# filename = row['filename'].item()
# print(filename)
# waveform = load_wav_16k_mono(filename)
# print(f'Waveform values: {waveform}')
# _ = plt.plot(waveform)

# display.Audio(waveform, rate=16000)

In [None]:
# # Run the model, check the output.
# scores, embeddings, spectrogram = yamnet_model(waveform)
# class_scores = tf.reduce_mean(scores, axis=0)
# top_class = tf.argmax(class_scores)
# inferred_class = class_names[top_class]
# top_score = class_scores[top_class]
# print(f'[YAMNet] The main sound is: {inferred_class} ({top_score})')

# reloaded_results = reloaded_model(waveform)
# your_top_class = tf.argmax(reloaded_results)
# your_inferred_class = my_classes[your_top_class]
# class_probabilities = tf.nn.softmax(reloaded_results, axis=-1)
# your_top_score = class_probabilities[your_top_class]
# print(f'[Your model] The main sound is: {your_inferred_class} ({your_top_score})')

## Next steps

You have created a model that can classify sounds from dogs or cats. With the same idea and a different dataset you can try, for example, building an [acoustic identifier of birds](https://www.kaggle.com/c/birdclef-2021/) based on their singing.

Share your project with the TensorFlow team on social media!
