## Overview
### Due February 15th

Instructions:

Assignments are to be turned in individually, you can collaborate with one classmate.

After completing the assignment, please submit this notebook.

Upload your completed assignment on moodle.

## Audio Field Exploration

In this assignment we are going to explore recent papers and associated code published about the paper. The objective is to become familiar with current research in the audio field by choosing a paper from paperswithcode.com, summarizing its key findings, and implementing its models and metrics.

### Question 1 [10 points]
#### Paper Selection and Justification

Choose a paper from paperswithcode.com in the audio field (https://paperswithcode.com/area/audio). Explain why you chose this paper (10 points)

```Your Answer Here```

**Paper Selection:**

The paper selected for this assignment is "CNN Architectures for Large-Scale Audio Classification" by Hershey et al. (2017). It explores the use of Convolutional Neural Networks (CNNs) for classifying audio on a large scale, using a dataset of 70 million training videos with over 30,000 labels.


**This paper was chosen for several compelling reasons:**

* **Relevance to the Field:** Large-scale audio classification is a crucial task in various applications like content analysis, environmental monitoring, and speech recognition. This paper directly addresses this important area, making it relevant to current research trends.

* **Novel Approach:** The paper investigates the effectiveness of different CNN architectures for audio classification, drawing inspiration from their success in image recognition. Applying CNNs to audio presents a novel approach with potential for significant advancements in the field.

* **Extensive Experimentation:** The authors conduct thorough experiments on a massive dataset, providing valuable insights into the scalability and performance of different CNN models for large-scale audio classification. This rigorous evaluation strengthens the credibility of their findings.

* **Impactful Results:** The paper demonstrates that CNN architectures can achieve competitive results for audio classification, even surpassing traditional methods in some cases. This has significant implications for the future development of audio processing techniques.

* **Open Access and Code Availability:** The paper is publicly available on arXiv, and the authors have released associated code. This facilitates further exploration and replication of their work, promoting transparency and collaboration in the field.

In summary, this paper's focus on large-scale audio classification, innovative use of CNNs, extensive experimentation, impactful results, and open access make it an excellent choice for this assignment. It provides valuable insights into the potential of CNNs for advancing the field of audio processing and offers a solid foundation for further exploration and research.

### Question 2 [10 points]
#### Paper Summary

Summarize the paper briefly, including its main contributions and key results.



```Your Answer Here```

**Paper Summary:**

This paper explores the use of Convolutional Neural Networks (CNNs) for large-scale audio classification. The authors trained various CNN architectures, including AlexNet, VGG, Inception, and ResNet, on a massive dataset of 70 million training videos with over 30,000 labels.

**Main Contributions:**

* **Adaptation of CNNs for Audio:** The paper demonstrates the effectiveness of adapting CNN architectures, originally designed for image recognition, to the task of audio classification.
* **Large-Scale Experimentation:** The authors conducted extensive experiments on a massive dataset, providing valuable insights into the scalability and performance of CNNs for large-scale audio classification.
* **Performance Benchmarking:** The paper compares the performance of different CNN architectures and establishes benchmarks for audio classification on the AudioSet dataset.
* **Acoustic Event Detection:** The authors demonstrate the applicability of the learned embeddings for Acoustic Event Detection (AED) on the AudioSet dataset, achieving significant improvements over baseline methods.

**Key Results:**

* **CNNs Achieve Competitive Performance:** CNN architectures, particularly ResNet and Inception, achieved competitive results for audio classification, surpassing traditional methods in some cases.
* **Larger Training Sets Improve Performance:** The authors observed that increasing the size of the training set generally led to improved performance.
* **Transfer Learning is Effective:** Embeddings learned from the video-level classification task were found to be effective for AED, showcasing the benefits of transfer learning.
* **CNNs Outperform Traditional Methods:** The models using embeddings from the CNN classifiers outperformed a baseline model using raw features on the AudioSet AED classification task.

In essence, this paper demonstrates the potential of CNNs for large-scale audio classification and provides valuable insights into their performance and scalability. The results highlight the effectiveness of adapting CNN architectures from image recognition to audio tasks and the benefits of using large-scale datasets for training.

### Question 3 [20 points]
#### Metrics Explanation
Explain the metrics used in the paper, including their purpose and how they are calculated.

```Your Answer Here```
The paper primarily uses two metrics to evaluate the performance of the CNN architectures for audio classification:

**Mean Average Precision (mAP):**

* **Purpose:** mAP is a widely used metric in information retrieval and object detection tasks. In this paper, it's used to assess the overall performance of the CNN models in classifying audio events across a large number of classes. A higher mAP indicates better performance, reflecting the model's ability to accurately rank relevant audio events higher in its predictions.
* **Calculation:**
  * Predictions and Ground Truth: The models generate predictions for each audio segment, assigning probabilities to different audio event classes. These predictions are compared to the ground truth labels, which indicate the actual classes present in the audio.
  * Ranking: For each class, the audio segments are ranked based on their predicted probabilities for that class.
  * Precision and Recall at Each Rank: Precision and recall are calculated at each position in the ranked list. Precision measures the accuracy of the predictions up to that point, while recall measures the proportion of relevant audio events retrieved.
  * Average Precision (AP): AP is calculated for each class by averaging the precision values at each rank where a relevant audio event is found. This provides a measure of the model's performance for that specific class.
  * Mean Average Precision (mAP): Finally, mAP is calculated by averaging the AP values across all classes, providing a single overall performance metric for the model.

**d-prime:**

* **Purpose:** d-prime is a signal detection theory metric that measures the sensitivity of a classifier in distinguishing between two classes (in this case, the presence or absence of a specific audio event). It quantifies the separation between the distributions of the classifier's outputs for the two classes. A higher d-prime indicates better discrimination ability.
* **Calculation:**
  * Distributions: The distributions of the classifier's outputs (e.g., probabilities or scores) are estimated for the two classes.
  * Means and Standard Deviations: The means and standard deviations of these distributions are calculated.
  * d-prime Formula: d-prime is then calculated using the following formula: d-prime = (mean of signal distribution - mean of noise distribution) / sqrt((variance of signal distribution + variance of noise distribution) / 2). where "signal" refers to the presence of the audio event and "noise" refers to its absence.

**Important Points in the Paper's Context:**

The paper uses these metrics to evaluate the performance of different CNN architectures (such as AlexNet, VGG, Inception, and ResNet) on the AudioSet dataset, which is a large-scale dataset for audio event classification.
They report both mAP and d-prime to provide a comprehensive assessment of the models' abilities to classify and discriminate between audio events.
The paper emphasizes the importance of using large-scale datasets and appropriate evaluation metrics for advancing research in audio classification.


### Question 4 [20 points]
#### Code Implementation
Clone the code associated with the paper into a Colab notebook and load the models.

```Your Answer Here```

In [2]:
"""Your Code Here"""
!git clone https://github.com/harritaylor/torchvggish.git

Cloning into 'torchvggish'...
remote: Enumerating objects: 209, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 209 (delta 29), reused 24 (delta 24), pack-reused 172 (from 1)[K
Receiving objects: 100% (209/209), 328.85 KiB | 2.63 MiB/s, done.
Resolving deltas: 100% (95/95), done.


In [3]:
%cd torchvggish

/content/torchvggish


In [4]:
import os
import requests

In [5]:
def download_vggish_weights(output_dir='./'):
    # Check if weights file already exists
    weights_file = os.path.join(output_dir, 'vggish_model.ckpt')
    if os.path.isfile(weights_file):
        print(f'Weights file already exists at {weights_file}. Skipping download.')
        return

    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Download the weights file
    url = 'https://storage.googleapis.com/audioset/vggish_model.ckpt'
    print(f'Downloading weights file from {url} to {weights_file}...')
    response = requests.get(url, stream=True)
    with open(weights_file, 'wb') as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
    print('Download complete.')

In [6]:
download_vggish_weights(output_dir='./vggish')

Downloading weights file from https://storage.googleapis.com/audioset/vggish_model.ckpt to ./vggish/vggish_model.ckpt...
Download complete.


In [7]:
!pip install resampy
import torch

# Load the vggish model using torch.hub
model = torch.hub.load('harritaylor/torchvggish', 'vggish')
model.eval()

Collecting resampy
  Downloading resampy-0.4.3-py3-none-any.whl.metadata (3.0 kB)
Downloading resampy-0.4.3-py3-none-any.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: resampy
Successfully installed resampy-0.4.3


Downloading: "https://github.com/harritaylor/torchvggish/zipball/master" to /root/.cache/torch/hub/master.zip
Downloading: "https://github.com/harritaylor/torchvggish/releases/download/v0.1/vggish-10086976.pth" to /root/.cache/torch/hub/checkpoints/vggish-10086976.pth
100%|██████████| 275M/275M [00:00<00:00, 300MB/s]
Downloading: "https://github.com/harritaylor/torchvggish/releases/download/v0.1/vggish_pca_params-970ea276.pth" to /root/.cache/torch/hub/checkpoints/vggish_pca_params-970ea276.pth
100%|██████████| 177k/177k [00:00<00:00, 5.61MB/s]


VGGish(
  (features): Sequential(
    (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (11): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (12): ReLU(inplace=True)
    (13): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (14): ReLU(inplace=True)
    (15): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False

In [8]:
import urllib

# Define the URL and filename of the audio file
url, filename = ("http://soundbible.com/grab.php?id=1698&type=wav", "bus_chatter.wav")

# Download the audio file using urllib
try:
    urllib.URLopener().retrieve(url, filename)
except:
    urllib.request.urlretrieve(url, filename)

In [9]:
model.forward(filename)

tensor([[158.,  24., 142.,  ..., 218., 117., 255.],
        [162.,  37., 149.,  ..., 165.,   0., 255.],
        [160.,  30., 143.,  ..., 192., 191., 255.],
        ...,
        [157.,  37., 148.,  ..., 138., 189., 255.],
        [162.,  37., 155.,  ..., 156.,  14., 255.],
        [166.,  18., 156.,  ..., 172., 136., 255.]],
       grad_fn=<SqueezeBackward0>)

In [10]:
embeddings = model.forward(filename)

### Question 5 [20 points]
#### Model Evaluation
Run the models on different data from the paper (either from other datasets or by creating your own data) and report the results. For resources on where to find datasets please refer to the lecture notes.

In [11]:
"""Your Code Here"""
!wget https://github.com/karolpiczak/ESC-50/archive/master.zip -O esc50.zip


--2025-02-22 09:01:56--  https://github.com/karolpiczak/ESC-50/archive/master.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/karolpiczak/ESC-50/zip/refs/heads/master [following]
--2025-02-22 09:01:56--  https://codeload.github.com/karolpiczak/ESC-50/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.114.9
Connecting to codeload.github.com (codeload.github.com)|140.82.114.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘esc50.zip’

esc50.zip               [    <=>             ] 615.78M  13.2MB/s    in 43s     

2025-02-22 09:02:40 (14.3 MB/s) - ‘esc50.zip’ saved [645695005]



In [12]:
!unzip esc50.zip

Archive:  esc50.zip
33c8ce9eb2cf0b1c2f8bcf322eb349b6be34dbb6
   creating: ESC-50-master/
   creating: ESC-50-master/.circleci/
  inflating: ESC-50-master/.circleci/config.yml  
   creating: ESC-50-master/.github/
  inflating: ESC-50-master/.github/stale.yml  
 extracting: ESC-50-master/.gitignore  
  inflating: ESC-50-master/LICENSE   
  inflating: ESC-50-master/README.md  
   creating: ESC-50-master/audio/
  inflating: ESC-50-master/audio/1-100032-A-0.wav  
  inflating: ESC-50-master/audio/1-100038-A-14.wav  
  inflating: ESC-50-master/audio/1-100210-A-36.wav  
  inflating: ESC-50-master/audio/1-100210-B-36.wav  
  inflating: ESC-50-master/audio/1-101296-A-19.wav  
  inflating: ESC-50-master/audio/1-101296-B-19.wav  
  inflating: ESC-50-master/audio/1-101336-A-30.wav  
  inflating: ESC-50-master/audio/1-101404-A-34.wav  
  inflating: ESC-50-master/audio/1-103298-A-9.wav  
  inflating: ESC-50-master/audio/1-103995-A-30.wav  
  inflating: ESC-50-master/audio/1-103999-A-30.wav  
  inflat

In [13]:
!mv ESC-50-master ESC-50

In [14]:
dataset_path = "/content/torchvggish/ESC-50"

In [15]:
import pandas as pd

metadata_file = os.path.join(dataset_path, 'meta/esc50.csv')
metadata = pd.read_csv(metadata_file)

In [22]:
import librosa
import soundfile as sf # Import soundfile

def extract_embeddings(audio_file):
    # Load audio with resampling and exception handling
    try:
        waveform, sr = librosa.load(audio_file, sr=16000, mono=True, res_type='kaiser_fast')
        # Check if sample rate is valid before resampling
        if sr is None or sr <= 0:
            print(f"Invalid sample rate for audio file: {audio_file}. Sample rate: {sr}")
            return None
    except Exception as e:
        print(f"Error loading audio file: {audio_file}. Error: {e}")
        return None  # Return None if loading fails

    # Save the waveform to a temporary WAV file
    import tempfile
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_wav:
        # Use soundfile.write to save the audio
        sf.write(temp_wav.name, waveform, sr) # Use soundfile to write audio
        temp_wav_path = temp_wav.name

    # Run VGGish model to extract embeddings using the temporary WAV file path
    with torch.no_grad():
        embeddings = model(temp_wav_path) # Use the temporary WAV file path

    # Remove the temporary WAV file
    import os
    os.remove(temp_wav_path)

    return embeddings.numpy()

In [23]:
embeddings_list = []
labels_list = []

for index, row in metadata.iterrows():
    audio_file = os.path.join(dataset_path, 'audio', row['filename'])

    embeddings = extract_embeddings(audio_file)
    if embeddings is not None:
        embeddings_list.append(embeddings)
        labels_list.append(row['category'])

In [24]:
print(f"Number of embeddings: {len(embeddings_list)}")
print(f"Number of labels: {len(labels_list)}")

Number of embeddings: 2000
Number of labels: 2000


In [25]:
print("First 5 embeddings:")
for i in range(5):  # Print the first 5 embeddings
    print(embeddings_list[i])

print("\nFirst 5 labels:")
for i in range(5):  # Print the first 5 labels
    print(labels_list[i])

First 5 embeddings:
[[175.   8. 147.  97. 215.  76.  77. 130. 148. 175. 151.  51. 120. 147.
  101.  35. 110. 238. 197. 194.   0. 208.  99. 178.  70. 152. 150. 186.
  124.  68. 193. 131. 106.  82. 140. 131. 139. 121. 184. 136. 116. 175.
   32. 144. 123. 164. 133.  92. 124. 110. 127.  94. 129.  78.  86. 151.
  126. 106. 113. 206. 131. 122. 121.  97. 174. 168.  82. 124. 144. 135.
  146. 113. 127. 188. 138. 155. 132. 138.  94. 168. 100. 194. 153. 131.
   92. 102. 138. 202. 147.  89.  63. 145.  96. 113. 111. 172.  89.  51.
  122. 211. 207. 122. 120. 114. 106. 183.  84. 115. 171. 130. 179. 136.
   68. 137.  72.  56. 147. 184.  49. 217. 102. 116. 101.  84. 171. 148.
   69. 255.]
 [175.   8. 147.  97. 215.  76.  77. 130. 148. 175. 151.  51. 120. 147.
  101.  35. 110. 238. 197. 194.   0. 208.  99. 178.  70. 152. 150. 186.
  124.  68. 193. 131. 106.  82. 140. 131. 139. 121. 184. 136. 116. 175.
   32. 144. 123. 164. 133.  92. 124. 110. 127.  94. 129.  78.  86. 151.
  126. 106. 113. 206. 131. 122.

In [26]:
print("Shape of the first embedding:", embeddings_list[0].shape)

Shape of the first embedding: (5, 128)


In [30]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Assuming embeddings_list is a list of 2D embeddings
# Reshape embeddings_array to 2D by flattening each embedding
embeddings_array = np.array([emb.flatten() for emb in embeddings_list])
labels_array = np.array(labels_list)

# Now embeddings_array should have shape (num_samples, num_features)
print("Shape of embeddings_array after flattening:", embeddings_array.shape)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    embeddings_array, labels_array, test_size=0.2, random_state=42
)

# Create and train the SVM classifier
classifier = SVC(kernel="linear")
classifier.fit(X_train, y_train)

Shape of embeddings_array after flattening: (2000, 640)


In [31]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

Accuracy: 0.695
                  precision    recall  f1-score   support

        airplane       0.50      0.62      0.56         8
       breathing       0.50      0.75      0.60         8
  brushing_teeth       0.70      1.00      0.82         7
     can_opening       0.47      0.88      0.61         8
        car_horn       0.29      0.67      0.40         3
             cat       0.86      0.50      0.63        12
        chainsaw       0.75      0.75      0.75         4
  chirping_birds       0.86      1.00      0.92         6
    church_bells       1.00      1.00      1.00         8
        clapping       0.90      0.82      0.86        11
     clock_alarm       0.75      0.75      0.75         8
      clock_tick       0.20      0.40      0.27         5
        coughing       0.44      0.40      0.42        10
             cow       0.75      0.69      0.72        13
  crackling_fire       1.00      0.44      0.62         9
        crickets       0.71      0.83      0.77        

##### Question 6 [20 points]

Please record a 2-5 minute video of yourself explaining questions 1 to 7.

If you are shy (or have a bad hairday) you can use filters to augment or cover your face. Please submit it as a public google drive url.