# Audio Spectogram Transformer to classify audio

### Contents
0. Set-up environment
1. Load audio
2. Prepare audio for the model
3. Load model
4. Run model

source: https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer

<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/AST/Inference_with_the_Audio_Spectogram_Transformer_to_classify_audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Set-up environment

First we install 🤗 Transformers and torchaudio

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git

In [6]:
!pip install torchaudio

Collecting torchaudio
  Downloading torchaudio-2.0.2-cp310-cp310-macosx_10_9_x86_64.whl (3.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torch==2.0.1
  Downloading torch-2.0.1-cp310-none-macosx_10_9_x86_64.whl (143.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: torch, torchaudio
  Attempting uninstall: torch
    Found existing installation: torch 2.0.0
    Uninstalling torch-2.0.0:
      Successfully uninstalled torch-2.0.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.15.1 requires torch==2.0.0, but you have torch 2.0.1 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.0.1 torchaudio-

## 1. Load audio

Let's load some audio on which we'd like to test the model.

Check the file on Huggingface Hub: https://huggingface.co/datasets/nielsr/audio-spectogram-transformer-checkpoint

In [2]:
# experimental code to load a file from Amsterdam Urban Sounds dataset
from huggingface_hub import hf_hub_download
import IPython

filepath = hf_hub_download(repo_id="UrbanSounds/AmsterdamSounds",
                           filename="A_RGracht-0004_181256.wav",
                           repo_type="dataset")

IPython.display.Audio(filepath)#not working as we use a .flac file

## 2. Prepare audio for the model (using feature extractor)

We can prepare the audio using ASTFeatureExtractor, which turns it into a tensor of shape (batch_size, time_dimension, frequency_dimension). This is also known as a spectrogram.

In [3]:
from transformers import ASTFeatureExtractor

feature_extractor = ASTFeatureExtractor()

In [18]:
import torchaudio

waveform, sampling_rate = torchaudio.load(filepath)
waveform = waveform.squeeze().numpy()

waveform.shape

(2, 480000)

In [19]:
print(waveform)
print(sampling_rate)

[[ 0.00363159  0.00396729  0.00457764 ... -0.01593018 -0.01599121
   0.        ]
 [-0.00299072 -0.00213623 -0.00091553 ... -0.00512695 -0.00500488
   0.        ]]
48000


In [7]:
inputs = feature_extractor(waveform, sampling_rate=16000, padding="max_length", return_tensors="pt")
input_values = inputs.input_values
print(input_values.shape)

torch.Size([2, 1024, 128])


## 3. Load model

Next we load one of the models that the AST authors released from the [hub](https://huggingface.co/models?other=audio-spectrogram-transformer).

This one was fine-tuned on AudioSet, an important benchmark for audio classification.

In [8]:
from transformers import AutoModelForAudioClassification

model = AutoModelForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

In [9]:
#print the model
print(model)

ASTForAudioClassification(
  (audio_spectrogram_transformer): ASTModel(
    (embeddings): ASTEmbeddings(
      (patch_embeddings): ASTPatchEmbeddings(
        (projection): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ASTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ASTLayer(
          (attention): ASTAttention(
            (attention): ASTSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ASTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ASTIntermediate(
            (de

In [10]:
#print the classes
print(model.config.id2label)

{0: 'Speech', 1: 'Male speech, man speaking', 2: 'Female speech, woman speaking', 3: 'Child speech, kid speaking', 4: 'Conversation', 5: 'Narration, monologue', 6: 'Babbling', 7: 'Speech synthesizer', 8: 'Shout', 9: 'Bellow', 10: 'Whoop', 11: 'Yell', 12: 'Battle cry', 13: 'Children shouting', 14: 'Screaming', 15: 'Whispering', 16: 'Laughter', 17: 'Baby laughter', 18: 'Giggle', 19: 'Snicker', 20: 'Belly laugh', 21: 'Chuckle, chortle', 22: 'Crying, sobbing', 23: 'Baby cry, infant cry', 24: 'Whimper', 25: 'Wail, moan', 26: 'Sigh', 27: 'Singing', 28: 'Choir', 29: 'Yodeling', 30: 'Chant', 31: 'Mantra', 32: 'Male singing', 33: 'Female singing', 34: 'Child singing', 35: 'Synthetic singing', 36: 'Rapping', 37: 'Humming', 38: 'Groan', 39: 'Grunt', 40: 'Whistling', 41: 'Breathing', 42: 'Wheeze', 43: 'Snoring', 44: 'Gasp', 45: 'Pant', 46: 'Snort', 47: 'Cough', 48: 'Throat clearing', 49: 'Sneeze', 50: 'Sniff', 51: 'Run', 52: 'Shuffle', 53: 'Walk, footsteps', 54: 'Chewing, mastication', 55: 'Biting

## 4. Run model 
Next let's forward the audio through the model! We perform an argmax on the model's logits to get the predicted class index. We use model.config.id2label to turn that back into text.

In [11]:
import torch

with torch.no_grad():
  outputs = model(input_values)

In [12]:
predicted_class_idx = outputs.logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

RuntimeError: a Tensor with 2 elements cannot be converted to Scalar

In [8]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ -0.8760,  -7.0043,  -8.6603,  -8.7062,  -8.7342,  -7.5784, -11.6310,
          -8.9365, -11.0352, -11.4663, -10.3499, -12.8677, -12.1096, -13.2591,
         -11.7077, -10.2867,  -9.6017, -11.8425, -11.6257, -11.3271, -11.4826,
         -10.9645, -11.3571, -11.5905,  -9.6362, -12.3283, -10.8791,  -7.6673,
          -8.3218, -12.5256,  -9.7643,  -8.5246,  -8.9107,  -9.7567, -10.6851,
         -11.6329, -10.8608,  -9.7865, -11.2953, -10.2231, -11.0701,  -9.4346,
         -11.2001,  -9.1306, -11.0114, -10.6504, -10.0189, -10.6329, -10.6451,
         -10.8483, -11.0439, -10.2059, -11.1721,  -9.9780, -10.2870, -11.2638,
         -11.5230,  -9.7743, -11.5319, -10.8476, -10.1974, -11.7021, -11.3919,
         -11.0658,  -9.8445, -11.1825, -11.4833, -11.1436, -11.9012,  -9.9602,
          -9.9852, -12.5412,  -7.0751,  -8.4316,  -8.8403, -11.3575, -10.8831,
         -11.8821, -11.1314, -10.5175, -10.7800,  -9.5904, -11.3637, -11.1587,
         