## Real-Time Sound Classification using PyTorch


This project aims to develop an artificial intelligence system capable of classifying live sounds using the PyTorch framework. The goal is to build a simple yet effective sound classification model that can accurately identify different types of sounds in real-time.

In [1]:
from MODELS.ai_0014_MARK1 import SoundClassifier_MARK1
from MODELS.ai_0014_MARK2 import SoundClassifier_MARK2
from MODELS.ai_0014_MARK3 import SoundClassifier_MARK3
from GPU_torch import GPU

In [2]:
device = GPU()
print("current device : ",device)
print(device)

Apple device detected
Activating Apple Silicon GPU
current device :  mps
mps


In [3]:
##IDK what is wrong with that up cell
import torch
import subprocess

def GPU():
    if torch.cuda.is_available() == True:
        device = 'cuda'
        templist = [1, 2, 3]
        templist = torch.FloatTensor(templist).to(device)
        print("Cuda torch working : ", end="")
        print(templist.is_cuda)
        print("current device no. : ", end="")
        print(torch.cuda.current_device())
        print("GPU device count : ", end="")
        print(torch.cuda.device_count())
        print("GPU name : ", end="")
        print(torch.cuda.get_device_name(0))
        print("device : ", device)
        # Execute the nvidia-smi command using subprocess
        try:
            output = subprocess.check_output(['nvidia-smi']).decode('utf-8')
            print("nvidia-smi output:")
            print(output)
        except (subprocess.CalledProcessError, FileNotFoundError) as e:
            print("Error executing nvidia-smi command:", str(e))
    elif torch.backends.mps.is_available() == True:
        print("Apple device detected\nActivating Apple Silicon GPU")
        device = torch.device("mps")
    else:
        print("cant use gpu , activating cpu")
        device = 'cpu'

    return device
device = GPU()
print(device)

Apple device detected
Activating Apple Silicon GPU
mps


In [4]:
'''SEED Everything'''
import torch
import random
import numpy as np
def seed_everything(SEED=42):
    random.seed(SEED)
    np.random.seed(SEED)
    torch.manual_seed(SEED)
    torch.cuda.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.benchmark = True # keep True if all the input have same size.
SEED=42
seed_everything(SEED=SEED)

In [5]:
try:
    import soundata

    dataset = soundata.initialize('urbansound8k')
    #dataset.download()  # download the dataset
    #dataset.validate()  # validate that all the expected files are there

    example_clip = dataset.choice_clip()  # choose a random example clip
    print(example_clip)  # see the available data
except:
    print("SKIP")

Clip(
  audio_path="/Users/cafalena/sound_datasets/urbansound8k/audio/fold3/65750-3-3-48.wav",
  clip_id="65750-3-3-48",
  audio: The clip's audio
            * np.ndarray - audio signal
            * float - sample rate,
  class_id: The clip's class id.
            * int - integer representation of the class label (0-9). See Dataset Info in the documentation for mapping,
  class_label: The clip's class label.
            * str - string class name: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren, street_music,
  fold: The clip's fold.
            * int - fold number (1-10) to which this clip is allocated. Use these folds for cross validation,
  freesound_end_time: The clip's end time in Freesound.
            * float - end time in seconds of the clip in the original freesound recording,
  freesound_id: The clip's Freesound ID.
            * str - ID of the freesound.org recording from which this clip was taken,
  freesound_sta

In [6]:
import pandas as pd
import torchaudio
import torchaudio.transforms as transforms
import torch
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
import os
import torch
import torchaudio
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import soundfile as sf
import pandas as pd
from torch.utils.data import Dataset, DataLoader


class UrbanSoundDataset(Dataset):
    def __init__(self, csv_file, file_path, folderList, transform=None, target_length=None):
        self.file_path = file_path
        self.file_labels = (csv_file)
        self.file_names = [str(self.file_labels.iloc[i, 0]) for i in folderList.index]
        self.labels = self.file_labels.loc[folderList.index, "classID"].values
        self.folders = [str(self.file_labels.iloc[i, 5]) for i in folderList.index]
        self.transform = transform
        self.target_length = target_length

    def __getitem__(self, index):
        # format the file path and load the file
        path = self.file_path + 'fold' + self.folders[index] + '/' + self.file_names[index]
        sound = torchaudio.load(path)
        soundData = sound[0][0, :]

        # pad/truncate soundData to target_length
        if self.target_length:
            if len(soundData) < self.target_length:
                padding = torch.zeros(self.target_length - len(soundData))
                soundData = torch.cat((soundData, padding))
            elif len(soundData) > self.target_length:
                soundData = soundData[:self.target_length]

        soundData = soundData.unsqueeze(0)

        # apply transformations
        if self.transform is not None:
            soundData = self.transform(soundData)

        return soundData, self.labels[index]

    def __len__(self):
        return len(self.file_names)



  from .autonotebook import tqdm as notebook_tqdm


In [7]:
class UrbanSound8KDataset2(Dataset):
    def __init__(self, csv_file, file_path, processor, sample_rate, seconds=None):
        self.annotations = csv_file
        self.file_path = file_path
        self.processor = processor
        self.sample_rate = sample_rate
        self.resampler = torchaudio.transforms.Resample(orig_freq=44100, new_freq=self.sample_rate)
        self.seconds = seconds * self.sample_rate if seconds else self.sample_rate * 5

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        audio_file_path = os.path.join(self.file_path, 'fold' + str(self.annotations.iloc[index]['fold']), self.annotations.iloc[index]['slice_file_name'])
        label = self.annotations.iloc[index, 6]
        waveform, sample_rate = torchaudio.load(audio_file_path)
        
        # Convert to mono by averaging channels
        waveform = waveform.mean(dim=0)
        
        # pad/truncate waveform to target_length
        if self.seconds:
            if waveform.shape[0] < self.seconds:
                padding = torch.zeros(self.seconds - waveform.shape[0])
                waveform = torch.cat((waveform, padding))
            elif waveform.shape[0] > self.seconds:
                waveform = waveform[:self.seconds]

        # Resample from 44.1kHz to 16kHz
        waveform = self.resampler(waveform)
        
        # Now truncating to 4 seconds of audio (64000 samples at 16000Hz)
        inputs = self.processor(waveform, sampling_rate=self.sample_rate, max_length=self.seconds, return_tensors="pt", padding=True, truncation=True)
        
        return inputs.input_values[0], torch.tensor(label)


In [8]:
#wav2

#PHARA

BATCH = 16

# Load the dataset
try:
    file_path = '/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/audio/'
except:
    file_path = 'C:/Users/PC/AppData/@FOLDER/@Project/UrbanSound8K/audio/'
try:
    csv_file = pd.read_csv('/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/metadata/UrbanSound8K.csv')
except:
    csv_file = pd.read_csv('C:/Users/PC/AppData/@FOLDER/@Project/UrbanSound8K/metadata/UrbanSound8K.csv')



# Create datasets
from sklearn.model_selection import train_test_split

model = Wav2Vec2ForSequenceClassification.from_pretrained("facebook/wav2vec2-base-960h", num_labels=10)  # num_labels는 레이블의 개수에 따라 수정해야 합니다.
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

#this model is trained with 16,000 Hz sample rate
train_dataset = UrbanSound8KDataset2(csv_file=csv_file, file_path=file_path, processor=processor,sample_rate=16000)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH, shuffle=True)


# Now, `train_data` is your training set (70% of total),
# `val_set` is your validation set (15% of total), and
# `test_data` is your testing set (15% of total).


Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForSequenceClassification: ['lm_head.bias', 'lm_head.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['projector.bias', 'wav2vec2.masked_spec_embed', 'classifier.bias', 'classifier.weight', 'projector.weight']
You should probably TRAIN this model on a down-stream task to be 

In [9]:
#PHARA

BATCH = 16

In [10]:
from tqdm import tqdm
device = 'cpu'
# Hyperparameters

os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'


EPOCHS = 100
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
loss_fn = torch.nn.CrossEntropyLoss()

model = model.to(device)
model = model.train()

for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)
    losses = []
    correct_predictions = 0

    for data in tqdm(train_loader):
        input_values = data[0].to(device)
        labels = data[1].to(device)

        outputs = model(input_values, labels=labels)
        _, preds = torch.max(outputs.logits, dim=1)
        loss = loss_fn(outputs.logits, labels)

        correct_predictions += torch.sum(preds == labels)
        losses.append(loss.item())

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    train_acc = correct_predictions.double() / len(train_loader.dataset)
    train_loss = np.mean(losses)
    print(f'Train loss {train_loss} accuracy {train_acc}')

Epoch 1/100
----------


100%|██████████| 546/546 [1:07:10<00:00,  7.38s/it]


Train loss 2.266715331828638 accuracy 0.11051305542830966
Epoch 2/100
----------


100%|██████████| 546/546 [1:07:33<00:00,  7.42s/it]


Train loss 2.263129065761636 accuracy 0.11200183234081539
Epoch 3/100
----------


100%|██████████| 546/546 [1:05:52<00:00,  7.24s/it]


Train loss 2.2629002917817225 accuracy 0.11268896014658726
Epoch 4/100
----------


100%|██████████| 546/546 [1:02:16<00:00,  6.84s/it]


Train loss 2.262000119293129 accuracy 0.11268896014658726
Epoch 5/100
----------


100%|██████████| 546/546 [1:02:43<00:00,  6.89s/it]


Train loss 2.2613277365436484 accuracy 0.1152084287677508
Epoch 6/100
----------


100%|██████████| 546/546 [1:02:25<00:00,  6.86s/it]


Train loss 2.2610191189762436 accuracy 0.11555199267063673
Epoch 7/100
----------


100%|██████████| 546/546 [1:02:31<00:00,  6.87s/it]


Train loss 2.2613978272392634 accuracy 0.10799358680714613
Epoch 8/100
----------


100%|██████████| 546/546 [1:01:58<00:00,  6.81s/it]


Train loss 2.261728421235696 accuracy 0.11268896014658726
Epoch 9/100
----------


100%|██████████| 546/546 [1:02:26<00:00,  6.86s/it]


Train loss 2.261409415430202 accuracy 0.11108566193311956
Epoch 10/100
----------


100%|██████████| 546/546 [1:01:56<00:00,  6.81s/it]


Train loss 2.2605679074486535 accuracy 0.11509390746678883
Epoch 11/100
----------


100%|██████████| 546/546 [1:02:46<00:00,  6.90s/it]


Train loss 2.2598534045201957 accuracy 0.11933119560238205
Epoch 12/100
----------


100%|██████████| 546/546 [1:02:03<00:00,  6.82s/it]


Train loss 2.2613695628477104 accuracy 0.10902427851580394
Epoch 13/100
----------


100%|██████████| 546/546 [1:01:42<00:00,  6.78s/it]


Train loss 2.2610873036332184 accuracy 0.11612459917544664
Epoch 14/100
----------


100%|██████████| 546/546 [1:02:17<00:00,  6.85s/it]


Train loss 2.260849258838556 accuracy 0.10994044892349977
Epoch 15/100
----------


100%|██████████| 546/546 [1:06:18<00:00,  7.29s/it]


Train loss 2.260920618479942 accuracy 0.11268896014658726
Epoch 16/100
----------


100%|██████████| 546/546 [1:07:38<00:00,  7.43s/it]


Train loss 2.2606189919042063 accuracy 0.10570316078790655
Epoch 17/100
----------


 27%|██▋       | 148/546 [17:35<47:19,  7.13s/it] 


KeyboardInterrupt: 

## Train/Validation/Test

In [None]:
# Create transforms


#Resampleing for lower data (it could be used, or not. But not too low or too high)
import torchaudio.transforms as T

# Create transforms
transform = torch.nn.Sequential(
    T.Resample(orig_freq=44100, new_freq=16000),
    T.MFCC(sample_rate=16000),
)


# Load the dataset
try:
    file_path = '/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/audio/'
except:
    file_path = 'C:/Users/PC/AppData/@FOLDER/@Project/UrbanSound8K/audio/'
try:
    csv_file = pd.read_csv('/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/metadata/UrbanSound8K.csv')
except:
    csv_file = pd.read_csv('C:/Users/PC/AppData/@FOLDER/@Project/UrbanSound8K/metadata/UrbanSound8K.csv')



# Create datasets
from sklearn.model_selection import train_test_split
train_set, temp_set = train_test_split(csv_file, test_size=0.3, random_state=42)
val_set, test_set = train_test_split(temp_set, test_size=0.5, random_state=42)

target_length = 4 * 44100  # 4 seconds

train_data = UrbanSoundDataset(csv_file=csv_file, file_path=file_path, 
                                folderList=train_set['fold'], 
                                transform=transform, 
                                target_length=target_length)
val_data = UrbanSoundDataset(csv_file=csv_file, file_path=file_path, 
                                folderList=val_set['fold'] , 
                                transform=transform, 
                                target_length=target_length)
test_data = UrbanSoundDataset(csv_file=csv_file, file_path=file_path, 
                                folderList=test_set['fold'], 
                                transform=transform, 
                                target_length=target_length)


# Create dataloaders
import torch.utils.data as data
train_loader = data.DataLoader(train_data, batch_size=BATCH, shuffle=True, drop_last=True)
val_loader = data.DataLoader(val_data, batch_size=BATCH, shuffle=True)
test_loader = data.DataLoader(test_data, batch_size=BATCH, shuffle=True)

# Now, `train_data` is your training set (70% of total),
# `val_set` is your validation set (15% of total), and
# `test_data` is your testing set (15% of total).




### MODEL

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class SOUND_CNN(nn.Module):
    def __init__(self):
        super(SOUND_CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2)
        self.fc1 = nn.Linear(64 * 10 * 110, 1000)
        self.fc2 = nn.Linear(1000, 10) # the output layer should match the number of classes, which is 10 for UrbanSound8K dataset

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        #print(x.shape)
        x = self.pool(F.relu(self.conv2(x)))
        #print(x.shape)
        x = x.view(-1, 64 * 10 * 110)
        #print(x.shape)
        x = F.relu(self.fc1(x))
        #print(x.shape)
        x = self.fc2(x)
        #print(x.shape)
        
        return x


In [None]:
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torch.optim as optim
from tqdm import tqdm


model = Wav2Vec2ForSequenceClassification.from_pretrained("facebook/wav2vec2-base-960h", num_labels=10).to(device)
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    running_loss = 0.0
    for i, data in tqdm(enumerate(train_loader, 0)):
        inputs, labels = data[0].to(device), data[1].to(device)

        # Reshape inputs to 1D tensor
        inputs = inputs.view(inputs.shape[0], -1)

        # Process inputs with processor
        inputs = processor(inputs, sampling_rate=16000, return_tensors="pt", padding=True, truncation=True, max_length=64000, return_attention_mask=True)
        input_values = inputs.input_values.to(device)
        attention_mask = inputs.attention_mask.to(device)

        optimizer.zero_grad()

        outputs = model(input_values, attention_mask=attention_mask)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()


        running_loss += loss.item()
        if i % 2000 == 1999:
            print('[%d, %5d] loss: %.3f' %
                    (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')


Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForSequenceClassification: ['lm_head.bias', 'lm_head.weight']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['projector.bias', 'classifier.bias', 'wav2vec2.masked_spec_embed', 'projector.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be 

KeyboardInterrupt: 

### Dataset

### LR result

LR : 100  Epoch [1/10], Loss: 123043.2521
LR : 100  Epoch [2/10], Loss: 7324.8045
LR : 100  Epoch [3/10], Loss: 461.2951
LR : 100  Epoch [4/10], Loss: 100.8706
LR : 100  Epoch [5/10], Loss: 65.5783
LR : 100  Epoch [6/10], Loss: 346.9864
LR : 100  Epoch [7/10], Loss: 63.1574
LR : 100  Epoch [8/10], Loss: 56.8437
LR : 100  Epoch [9/10], Loss: 61.7810
LR : 100  Epoch [10/10], Loss: 68.3983
LR : 10  Epoch [1/10], Loss: 1080.9213
LR : 10  Epoch [2/10], Loss: 35.1927
LR : 10  Epoch [3/10], Loss: 4.9769
LR : 10  Epoch [4/10], Loss: 2.4054
LR : 10  Epoch [5/10], Loss: 3.5182
LR : 10  Epoch [6/10], Loss: 3.4323
LR : 10  Epoch [7/10], Loss: 3.0489
LR : 10  Epoch [8/10], Loss: 5.0180
LR : 10  Epoch [9/10], Loss: 2.3545
LR : 10  Epoch [10/10], Loss: 2.4156
LR : 1  Epoch [1/10], Loss: 13.9965
LR : 1  Epoch [2/10], Loss: 2.9778
LR : 1  Epoch [3/10], Loss: 2.3417
LR : 1  Epoch [4/10], Loss: 2.3440
LR : 1  Epoch [5/10], Loss: 2.3113
LR : 1  Epoch [6/10], Loss: 2.3460
LR : 1  Epoch [7/10], Loss: 2.2775
LR : 1  Epoch [8/10], Loss: 2.2869
LR : 1  Epoch [9/10], Loss: 2.2935
LR : 1  Epoch [10/10], Loss: 2.2680
LR : 0.1  Epoch [1/10], Loss: 3.1002
LR : 0.1  Epoch [2/10], Loss: 2.2569
LR : 0.1  Epoch [3/10], Loss: 2.2446
LR : 0.1  Epoch [4/10], Loss: 2.2292
LR : 0.1  Epoch [5/10], Loss: 2.2173
LR : 0.1  Epoch [6/10], Loss: 2.2035
LR : 0.1  Epoch [7/10], Loss: 2.1767
LR : 0.1  Epoch [8/10], Loss: 2.1629
LR : 0.1  Epoch [9/10], Loss: 2.146


In [None]:
import torch.optim as optim
from tqdm import tqdm

model = SOUND_CNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# train the network
for epoch in range(10):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in tqdm(enumerate(train_loader, 0)):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].to(device), data[1].to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                    (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')


191it [02:07,  1.50it/s]
191it [02:11,  1.46it/s]
191it [01:52,  1.70it/s]
191it [01:50,  1.73it/s]
191it [01:47,  1.77it/s]
191it [02:04,  1.54it/s]
191it [01:48,  1.75it/s]
191it [01:47,  1.78it/s]
191it [01:53,  1.69it/s]
191it [01:51,  1.72it/s]

Finished Training





In [None]:
correct = 0
total = 0
with torch.no_grad():
    for data in test_loader:
        sounds, labels = data[0].to(device), data[1].to(device)
        outputs = model(sounds)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the test images: %d %%' % (100 * correct / total))


Accuracy of the network on the test images: 11 %


In [None]:





#LR_list = [100,10,1,1e-1,1e-2,1e-3,1e-5,1e-7,1e-10]
#paremeters

from tqdm import tqdm


LR_list = [1e-2]
NB_EPOCH = 3
num_classes = 10  # for UrbanSound8K dataset, there are 10 classes

#audio_dir = "C:/Users/PC/AppData/@FOLDER/@Project/UrbanSound8K/audio"
audio_dir = "/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/audio"
#Saving the best model
min_loss = float('inf')

for LR in LR_list:

    #// we have to change that it uses many lr 
    model = SOUND_CNN().to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    model.train()
    model.to(device)
    for epoch in range(NB_EPOCH):
        running_loss = 0.0
        for data, target in tqdm(train_loader): # tqdm
        #for data, target in (train_loader): #no tqdm
            data = data.unsqueeze(1)
            data = data.to(device)
            target = target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target.squeeze())
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * data.size(0)
        epoch_loss = running_loss / len(train_set)
        print('LR : {}  Epoch [{}/{}], Loss: {:.4f}'.format(LR,epoch+1, NB_EPOCH, epoch_loss))
        # Save the BEST model if the current epoch loss is less than the minimum loss so far
#------------------------------- Valdation
        # Validation phase
        model.eval()
        val_running_loss = 0.0
        with torch.no_grad():
            for data, target in val_loader:
                data = data.unsqueeze(1)
                data = data.to(device)
                target = target.to(device)
                output = model(data)
                loss = criterion(output, target.squeeze())
                val_running_loss += loss.item() * data.size(0)
        val_epoch_loss = val_running_loss / len(val_set)
        print('LR : {}  Epoch [{}/{}], Validation Loss: {:.4f}'.format(LR, epoch+1, NB_EPOCH, val_epoch_loss))

        # Save the BEST model if the current epoch validation loss is less than the minimum loss so far
        if val_epoch_loss < min_loss:
            print("Saving the best model with validation loss: {:.4f}".format(val_epoch_loss))
            min_loss = val_epoch_loss
            torch.save(model.state_dict(), "MARK2_best.pth")

  0%|          | 0/191 [00:00<?, ?it/s]


RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [32, 1, 1, 40, 442]

## Test

In [None]:
model.load_state_dict(torch.load("MARK2_best.pth"))
model.eval()

#I dont know why, but If I use GPU(mps) an error accurs
device = 'cpu'
model.to('cpu')
#model.to(device)

with torch.no_grad():
    correct = 0
    total = 0
    for data, target in tqdm(test_loader):
        #data = data.unsqueeze(0)
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        _, predicted = torch.max(output.data,1)##IMPORTANT if your model dont use batchnorm, use max(output.data,0) instead
        total += target.size(0)
        correct += (predicted == target.squeeze()).sum().item()
    print('Accuracy of the model on the validation set: {:.2f}%'.format(100 * correct / total))

1
2
3
4
5


100%|██████████| 219/219 [01:08<00:00,  3.20it/s]

Accuracy of the model on the validation set: 12.81%





In [None]:
# BATCHSIZE = 64
# from tqdm import tqdm

# try:
#     testdataset = pd.read_csv('/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/metadata/UrbanSound8K.csv')
# except:
#     testdataset = pd.read_csv("C:/Users/PC/AppData/@FOLDER/@Project/UrbanSound8K/metadata/UrbanSound8K.csv")
    
# num_classes = 10  # for UrbanSound8K dataset, there are 10 classes
# input_size = 44100 * 4  # 44100Hz * 4 seconds

# test_loader = UrbanSoundDataset(testdataset)
# train_loader = torch.utils.data.DataLoader(test_loader, batch_size=BATCHSIZE, shuffle=True)
# model = SoundClassifier_MARK2(input_size, num_classes)
# model.load_state_dict(torch.load("best_model.pth"))
# model.eval()

# #model.to('cpu')
# model.to(device)
# with torch.no_grad():
#     print(4)
#     correct = 0
#     total = 0
#     for data, target in tqdm(train_loader):
#         data = data.unsqueeze(0)
#         data = data.to(device)
#         target = target.to(device)
        
#         output = model(data)
#         _, predicted = torch.max(output.data,1)##IMPORTANT if your model dont use batchnorm, use max(output.data,0) instead
#         total += target.size(0)
#         correct += (predicted == target.squeeze()).sum().item()
#     print('Accuracy of the model on the validation set: {:.2f}%'.format(100 * correct / total))

#### Diary

**5/14 1300**: I have resolved the folder location issue, but now I am facing a new problem. The folder location and the sound file do not match. It's strange because the folder and the CSV files are fine. However, the `audio_path` is pointing in the wrong direction.

**5/14 1310**: I noticed that the folder and the file name were slightly off, which indicates that it's not entirely random. So, I decided to avoid using split and shuffle. Surprisingly, it worked. It seems like the split function was causing the problem, but I will keep monitoring the situation. Although the location error still persists, I tried specifying the complete location path as "/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/". This resolved the issue, and I observed that it recognized multiple sound files. However, now I am facing a tensor problem. I need to address this next.

**5/14 1752**: I was planning to use a CNN (Convolutional Neural Network), but I realized that sound waves are 1-dimensional. I'm struggling to figure out how to utilize a CNN with sound. Therefore, for now, I will stick with an NN (Neural Network). Once I successfully implement the NN, I can revisit using a CNN. Additionally, I need to work on the accuracy and test code sections.  

**5/15 1530** I solved the problem with NN models matmul (matix muliply) problem. The problem was about the batch norm and dim (if the dim is 1 more, you need unsqueeze please check .shape())  

**5/17 1915** I attempted to download the YouTube AudioSet, but encountered several issues and was unsuccessful. As an alternative, I downloaded another dataset from AIHUB. However, I am uncertain if I will utilize it due to the extensive processing required. To make progress at this moment, I need to focus on implementing the Convolutional Neural Network (CNN) and signal processing.