---
Title : Application_VGGish  
Author : Dmitrašinović Théotime  
Date : 25/10/2023  
**But** :  
1. Application du modèle pré entrainné YAMNet pour récupérer les prédictions de son environnant

  
---

# YAMNet PreTrained Model

## GitHub Explication

[Lien GitHub](https://github.com/tensorflow/models/tree/master/research/audioset/yamnet)


YAMNet is a pretrained deep net that predicts [521 audio event classes](https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv) based on the [AudioSet-YouTube corpus](https://research.google.com/audioset/) , and employing the [Mobilenet_v1](https://arxiv.org/pdf/1704.04861.pdf) depthwise-separable convolution architecture.

YAMNet also requires downloading the following data file:

[YAMNet model weights](https://storage.googleapis.com/audioset/yamnet.h5) in Keras saved weights in HDF5 format.

### Install dependences.

In [None]:
!pip install numpy resampy tensorflow soundfile

### Installation

In [None]:
# Clone TensorFlow models repo into a 'models' directory.
!git clone https://github.com/tensorflow/models.git
import os
os.chdir("models/research/audioset/yamnet/")
# Download data file into same directory as code.
!curl -O https://storage.googleapis.com/audioset/yamnet.h5



In [None]:
# Installation ready, let's test it.
!python yamnet_test.py

### Usage

You can run the model over existing soundfiles using `inference.py`:

```python inference.py input_sound.wav```  

The code will report the top-5 highest-scoring classes averaged over all the frames of the input. You can access greater detail by modifying the example code in inference.py.  

See the jupyter notebook `yamnet_visualization.ipynb` for an example of displaying the per-frame model output scores.

### About the Model

The YAMNet code layout is as follows:

- `yamnet.py`: Model definition in Keras.
- `params.py`: Hyperparameters. You can usefully modify PATCH_HOP_SECONDS.
- `features.py`: Audio feature extraction helpers.
- `inference.py`: Example code to classify input wav files.
- `yamnet_test.py`: Simple test of YAMNet installation
- `inferenceALL.py`: Pour récupérer le vecteur entier et pas seulement les 10 premiers

### Input: Audio Features

See `features.py`.  

As with our previous release [VGGish](https://github.com/tensorflow/models/tree/master/research/audioset/vggish), YAMNet was trained with audio features computed as follows:  

- All audio is resampled to 16 kHz mono.  
- A spectrogram is computed using magnitudes of the Short-Time Fourier Transform with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann window.  
- A mel spectrogram is computed by mapping the spectrogram to 64 mel bins covering the range 125-7500 Hz.  
- A stabilized log mel spectrogram is computed by applying log(mel-spectrum + 0.001) where the offset is used to avoid taking a logarithm of zero.  
- These features are then framed into 50%-overlapping examples of 0.96 seconds, where each example covers 64 mel bands and 96 frames of 10 ms each.  

These 96x64 patches are then fed into the Mobilenet_v1 model to yield a 3x2 array of activations for 1024 kernels at the top of the convolution. These are averaged to give a 1024-dimension embedding, then put through a single logistic layer to get the 521 per-class output scores corresponding to the 960 ms input waveform segment. (Because of the window framing, you need at least 975 ms of input waveform to get the first frame of output scores.)

### Class vocabulary

The file `yamnet_class_map.csv` describes the audio event classes associated with each of the 521 outputs of the network. Its format is: `index,mid,display_name`  

where `index` is the model output index (0..520), `mid` is the machine identifier for that class (e.g. /m/09x0r), and `display_name` is a human-readable description of the class (e.g. Speech).  

The original Audioset data release had 527 classes. This model drops six of them on the recommendation of our Fairness reviewers to avoid potentially offensive mislabelings. We dropped the gendered versions (Male/Female) of Speech and Singing. We also dropped Battle cry and Funny music.

In [None]:
yamnet_classes = yamnet_model.class_names('yamnet_class_map.csv')
yamnet_classes

## Application

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
from os import listdir
from os.path import isfile, join

Mounted at /content/gdrive


In [None]:
!pwd

/content/models/research/audioset/yamnet


In [None]:
import pickle
def save_embeddings(all_embe, folder_path, version):
  # save all embeddings
  with open(folder_path + 'Predictions_'+str(version)+'.pkl', 'wb') as ff:
    pickle.dump(all_embe, ff)
  # save list of id
  with open(folder_path + "log_"+str(version)+".pkl", "wb") as fp:
    pickle.dump(list(all_embe.keys()), fp)
  # on supprime les anciens fichiers
  try:
    os.remove(folder_path + 'Predictions_'+str(version-1)+'.pkl')
    os.remove(folder_path + "log_"+str(version-1)+".pkl")
  except:
    print("n'a pas pu delete les anciennes sauvegardes")

In [None]:
!pip install pydub

In [None]:
from pydub import AudioSegment
def scinderAudio(AudioPath, fileName, maxi = 4000000):
  OutPath = "/content/gdrive/MyDrive/Projet_Multimedia/download/Audio_Temp/"
  sound = AudioSegment.from_mp3(AudioPath+fileName)
  bordures = [i*maxi for i in range(int(np.ceil(len(sound) / maxi)))] + [len(sound)]
  for s in range(len(bordures)-1):
    sound[bordures[s]:bordures[s+1]].export(OutPath+fileName[:-4]+str(bordures[s+1])+".mp3", format="mp3")
  print(fileName, " Scindé en", len(bordures)-1, "parties")
  return OutPath

In [None]:
# MODIFICATION du fichier inference.py
from tqdm.notebook import tqdm_notebook
from __future__ import division, print_function

import sys

import numpy as np
import resampy
import soundfile as sf
import tensorflow as tf
import librosa
import params as yamnet_params
import yamnet as yamnet_model

import time
def main(AudioFolderPath, logs):
  # recup des audio files
  AudioFileName = [fi for fi in listdir(AudioFolderPath) if isfile(join(AudioFolderPath, fi))]
  AudioFilePath = [AudioFolderPath + afn for afn in AudioFileName]

  params = yamnet_params.Params()
  yamnet = yamnet_model.yamnet_frames_model(params)
  yamnet.load_weights('yamnet.h5')
  yamnet_classes = yamnet_model.class_names('yamnet_class_map.csv')

  all_preds = {}

  for f, file_name in tqdm_notebook(enumerate(AudioFilePath), desc="Fichiers audios traités"):
    if AudioFileName[f][:-4] in logs:
      time.sleep(0.01)
    else:

      # si fichier trop volumineux
      if os.path.getsize(AudioFolderPath + AudioFileName[f]) > 40000000:
        OutPath = scinderAudio(AudioFolderPath, AudioFileName[f])
        files = [fi for fi in listdir(OutPath) if isfile(join(OutPath, fi))]
        embes = []
        for afile in files:
          # get audio signal
          audioPath = OutPath + afile
          wav_data , sr = librosa.load(audioPath , sr=16000)
          #assert wav_data.dtype == np.int16, 'Bad sample type: %r' % wav_data.dtype
          waveform = wav_data / 32768.0  # Convert to [-1.0, +1.0]
          waveform = waveform.astype('float32')
          # Convert to mono and the sample rate expected by YAMNet.
          if len(waveform.shape) > 1:
            waveform = np.mean(waveform, axis=1)
          if sr != params.sample_rate:
            waveform = resampy.resample(waveform, sr, params.sample_rate)
          # Predict YAMNet classes.
          scores, embeddings, spectrogram = yamnet(waveform)
          # Scores is a matrix of (time_frames, num_classes) classifier scores.
          # Average them along time to get an overall classifier output for the clip.
          prediction = np.mean(scores, axis=0)
          # Run the model, check the output.
          embes.append(prediction)
          os.remove(OutPath+afile)
        all_preds[AudioFileName[f][:-4]] = np.max(embes, axis=0)
      else:

        # Decode the WAV file.
        wav_data, sr = sf.read(file_name, dtype=np.int16)
        assert wav_data.dtype == np.int16, 'Bad sample type: %r' % wav_data.dtype
        waveform = wav_data / 32768.0  # Convert to [-1.0, +1.0]
        waveform = waveform.astype('float32')

        # Convert to mono and the sample rate expected by YAMNet.
        if len(waveform.shape) > 1:
          waveform = np.mean(waveform, axis=1)
        if sr != params.sample_rate:
          waveform = resampy.resample(waveform, sr, params.sample_rate)

        # Predict YAMNet classes.
        scores, embeddings, spectrogram = yamnet(waveform)
        # Scores is a matrix of (time_frames, num_classes) classifier scores.
        # Average them along time to get an overall classifier output for the clip.
        prediction = np.mean(scores, axis=0)
        # Report the highest-scoring classes and their scores.
        #top5_i = np.argsort(prediction)[::-1]#[:5]
        all_preds[AudioFileName[f][:-4]] = prediction
        save_embeddings(all_preds, AudioEmbeddingsPath, f)
  return all_preds

In [None]:
AudioEmbeddingsPath = "/content/gdrive/MyDrive/Projet_Multimedia/download/Audio_Embeddings/YAMNet/"
AudioPath = "/content/gdrive/MyDrive/Projet_Multimedia/download/Audio/"

In [None]:
with open(AudioEmbeddingsPath + 'log_ok_555.pkl', "rb") as fp:
    logs1 = pickle.load(fp)
with open(AudioEmbeddingsPath + 'log_ok_941.pkl', "rb") as fp:
    logs2 = pickle.load(fp)
with open(AudioEmbeddingsPath + 'log_ok_1429.pkl', "rb") as fp:
    logs3 = pickle.load(fp)
logs = logs1 + logs2 + logs3
len(logs)

1430

In [None]:
logs2[-1]

'-z2BgjH_CtIA'

In [None]:
AudioFileName = [f for f in listdir(AudioPath) if isfile(join(AudioPath, f))]

In [None]:
AudioFileName[942]

'-1wVXK5FKVO0.mp3'

In [None]:
pred = main(AudioPath, logs)

Fichiers audios traités: 0it [00:00, ?it/s]

n'a pas pu delete les anciennes sauvegardes
-3qaKevyVuS4.mp3  Scindé en 1 parties
n'a pas pu delete les anciennes sauvegardes
-nd9Cen7REwM.mp3  Scindé en 2 parties
n'a pas pu delete les anciennes sauvegardes
-e_t6_zCrwz0.mp3  Scindé en 1 parties
n'a pas pu delete les anciennes sauvegardes
-JAAQ2FnnkX8.mp3  Scindé en 1 parties
n'a pas pu delete les anciennes sauvegardes
-KsREXvSMe9c.mp3  Scindé en 1 parties
n'a pas pu delete les anciennes sauvegardes
--jeILZA-hDE.mp3  Scindé en 1 parties
n'a pas pu delete les anciennes sauvegardes
-euTUvnCixSk.mp3  Scindé en 1 parties
n'a pas pu delete les anciennes sauvegardes
-L8hM2kbw2Ik.mp3  Scindé en 1 parties
n'a pas pu delete les anciennes sauvegardes


## Verification et réunion de toutes les sessions de calculs

In [None]:
with open(AudioEmbeddingsPath + 'log_ok_555.pkl', "rb") as fp:
    logs1 = pickle.load(fp)
with open(AudioEmbeddingsPath + 'log_ok_941.pkl', "rb") as fp:
    logs2 = pickle.load(fp)
with open(AudioEmbeddingsPath + 'log_ok_1429.pkl', "rb") as fp:
    logs3 = pickle.load(fp)
with open(AudioEmbeddingsPath + 'log_ok_2279.pkl', "rb") as fp:
    logs4 = pickle.load(fp)
logs = logs1 + logs2 + logs3 + logs4
len(logs)

2280

In [None]:
with open(AudioEmbeddingsPath + 'Predictions_ok_555.pkl', "rb") as fp:
    embe1 = pickle.load(fp)
with open(AudioEmbeddingsPath + 'Predictions_ok_941.pkl', "rb") as fp:
    embe2 = pickle.load(fp)
with open(AudioEmbeddingsPath + 'Predictions_ok_1429.pkl', "rb") as fp:
    embe3 = pickle.load(fp)
with open(AudioEmbeddingsPath + 'Predictions_ok_2279.pkl', "rb") as fp:
    embe4 = pickle.load(fp)
embes = {**embe1, **embe2, **embe3, **embe4}

save_embeddings(embes, AudioEmbeddingsPath, "ALL_predictions_YAMNet")
len(embes)

n'a pas pu delete les anciennes sauvegardes


2280

In [None]:
with open(AudioEmbeddingsPath + 'ALL_predictions_YAMNet.pkl', "rb") as fp:
    embes = pickle.load(fp)

In [None]:
len(embes)

2280