## Real-Time Sound Classification using PyTorch


This project aims to develop an artificial intelligence system capable of classifying live sounds using the PyTorch framework. The goal is to build a simple yet effective sound classification model that can accurately identify different types of sounds in real-time.

In [2]:
from MODELS.ai_0014_MARK1 import SoundClassifier_MARK1
from MODELS.ai_0014_MARK2 import SoundClassifier_MARK2
from MODELS.ai_0014_MARK3 import SoundClassifier_MARK3
from GPU_torch import GPU

In [3]:
device = GPU()
print("current device : ",device)
print(device)

Apple device detected
Activating Apple Silicon GPU
current device :  mps
mps


In [4]:
##IDK what is wrong with that up cell
import torch
import subprocess

def GPU():
    if torch.cuda.is_available() == True:
        device = 'cuda'
        templist = [1, 2, 3]
        templist = torch.FloatTensor(templist).to(device)
        print("Cuda torch working : ", end="")
        print(templist.is_cuda)
        print("current device no. : ", end="")
        print(torch.cuda.current_device())
        print("GPU device count : ", end="")
        print(torch.cuda.device_count())
        print("GPU name : ", end="")
        print(torch.cuda.get_device_name(0))
        print("device : ", device)
        # Execute the nvidia-smi command using subprocess
        try:
            output = subprocess.check_output(['nvidia-smi']).decode('utf-8')
            print("nvidia-smi output:")
            print(output)
        except (subprocess.CalledProcessError, FileNotFoundError) as e:
            print("Error executing nvidia-smi command:", str(e))
    elif torch.backends.mps.is_available() == True:
        print("Apple device detected\nActivating Apple Silicon GPU")
        device = torch.device("mps")
    else:
        print("cant use gpu , activating cpu")
        device = 'cpu'

    return device
device = GPU()
print(device)

Apple device detected
Activating Apple Silicon GPU
mps


In [5]:
'''SEED Everything'''
import torch
import random
import numpy as np
def seed_everything(SEED=42):
    random.seed(SEED)
    np.random.seed(SEED)
    torch.manual_seed(SEED)
    torch.cuda.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    torch.backends.cudnn.benchmark = True # keep True if all the input have same size.
SEED=42
seed_everything(SEED=SEED)

In [6]:
try:
    import soundata

    dataset = soundata.initialize('urbansound8k')
    #dataset.download()  # download the dataset
    #dataset.validate()  # validate that all the expected files are there

    example_clip = dataset.choice_clip()  # choose a random example clip
    print(example_clip)  # see the available data
except:
    print("SKIP")

Clip(
  audio_path="/Users/cafalena/sound_datasets/urbansound8k/audio/fold3/65750-3-3-48.wav",
  clip_id="65750-3-3-48",
  audio: The clip's audio
            * np.ndarray - audio signal
            * float - sample rate,
  class_id: The clip's class id.
            * int - integer representation of the class label (0-9). See Dataset Info in the documentation for mapping,
  class_label: The clip's class label.
            * str - string class name: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren, street_music,
  fold: The clip's fold.
            * int - fold number (1-10) to which this clip is allocated. Use these folds for cross validation,
  freesound_end_time: The clip's end time in Freesound.
            * float - end time in seconds of the clip in the original freesound recording,
  freesound_id: The clip's Freesound ID.
            * str - ID of the freesound.org recording from which this clip was taken,
  freesound_sta

In [7]:
import matplotlib.pyplot as plt
import os
import torch
import librosa
import numpy as np
import pandas as pd
from pydub import AudioSegment

class UrbanSoundDataset(torch.utils.data.Dataset):  
    def __init__(self, annotations, audio_dir):  
        if isinstance(annotations, pd.DataFrame):  
            self.annotations = annotations  
        else:
            self.annotations = pd.read_csv(annotations)
        self.audio_dir = audio_dir  # add this line to save audio directory

    def __len__(self):  
        return len(self.annotations)  

    def __getitem__(self, index):
        audio_path = os.path.join(self.audio_dir,"fold"+str(self.annotations.loc[index]['fold']),self.annotations.loc[index, 'slice_file_name']) 
        class_id = self.annotations.loc[index, 'classID']
        audio, _ = librosa.load(audio_path, sr=44100, mono=True)  # setting standard sampling rate

        # Handle variable length audio files
        fixed_length = 44100 * 4  # 4 seconds // datashape (batch, 44100*4)
        if len(audio) < fixed_length:
            audio = np.pad(audio, (0, fixed_length - len(audio)))  # pad with zeros (zeros are silence)
        elif len(audio) > fixed_length:
            audio = audio[:fixed_length]  # trim to fixed length
        # Transform audio into Mel spectrogram
        return torch.from_numpy(audio), torch.tensor([class_id]) 



#dataset = UrbanSoundDataset()


## 전처리 (미적용)

In [8]:
#/ 전처리 나중에 적용

def calc_fft(y, rate):
    n = len(y)
    freq = np.fft.rfftfreq(n, d=1/rate)
    Y = abs(np.fft.rfft(y)/n)
    return Y, freq

def plot_signal_fft(signal, rate):
    fig, axs = plt.subplots(2, 1, figsize=(20, 10))
    axs[0].plot(signal)
    axs[0].set_title('Signal')
    Y, freq = calc_fft(signal, rate)
    axs[1].plot(freq, Y)
    axs[1].set_title('FFT')
    plt.show()

def calc_spectrogram(signal, rate):
    n_fft = 2048
    hop_length = 512
    spectrogram = librosa.stft(signal, n_fft=n_fft, hop_length=hop_length)
    spectrogram = np.abs(spectrogram)
    log_spectrogram = librosa.amplitude_to_db(spectrogram)
    return log_spectrogram

def plot_spectrogram(signal, rate):
    log_spectrogram = calc_spectrogram(signal, rate)
    fig, axs = plt.subplots(1, 1, figsize=(20, 10))
    axs.imshow(log_spectrogram, aspect='auto', origin='lower', cmap='jet')


## Train/Validation/Test

In [9]:


from sklearn.model_selection import train_test_split

# Load the dataset
try: 
    csvdataset = pd.read_csv('/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/metadata/UrbanSound8K.csv')
except:
    csvdataset = pd.read_csv('C:/Users/PC/AppData/@FOLDER/@Project/UrbanSound8K/metadata/UrbanSound8K.csv')



# Split the dataset into 80% training and 20% temporary
train_data, temp = train_test_split(csvdataset, test_size=0.2, random_state=42)

# Reset index
train_data = train_data.reset_index(drop=True)

# Split the temporary set into 50% validation and 50% testing
validation_data, test_data = train_test_split(temp, test_size=0.5, random_state=42)

# Reset index
validation_data = validation_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)



# Now, `train_data` is your training set (80% of total), 
# `validation_data` is your validation set (10% of total), and 
# `test_data` is your testing set (10% of total).

### Dataset

In [10]:

BATCHSIZE = 4


audio_dir = "/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/audio"

train_set = UrbanSoundDataset(train_data, audio_dir)
val_set = UrbanSoundDataset(validation_data, audio_dir)
test_set = UrbanSoundDataset(test_data, audio_dir)

train_loader = torch.utils.data.DataLoader(train_set, batch_size=BATCHSIZE, shuffle=True, drop_last=True)
val_loader = torch.utils.data.DataLoader(val_set, batch_size=BATCHSIZE, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=BATCHSIZE, shuffle=False)



### LR result

LR : 100  Epoch [1/10], Loss: 123043.2521
LR : 100  Epoch [2/10], Loss: 7324.8045
LR : 100  Epoch [3/10], Loss: 461.2951
LR : 100  Epoch [4/10], Loss: 100.8706
LR : 100  Epoch [5/10], Loss: 65.5783
LR : 100  Epoch [6/10], Loss: 346.9864
LR : 100  Epoch [7/10], Loss: 63.1574
LR : 100  Epoch [8/10], Loss: 56.8437
LR : 100  Epoch [9/10], Loss: 61.7810
LR : 100  Epoch [10/10], Loss: 68.3983
LR : 10  Epoch [1/10], Loss: 1080.9213
LR : 10  Epoch [2/10], Loss: 35.1927
LR : 10  Epoch [3/10], Loss: 4.9769
LR : 10  Epoch [4/10], Loss: 2.4054
LR : 10  Epoch [5/10], Loss: 3.5182
LR : 10  Epoch [6/10], Loss: 3.4323
LR : 10  Epoch [7/10], Loss: 3.0489
LR : 10  Epoch [8/10], Loss: 5.0180
LR : 10  Epoch [9/10], Loss: 2.3545
LR : 10  Epoch [10/10], Loss: 2.4156
LR : 1  Epoch [1/10], Loss: 13.9965
LR : 1  Epoch [2/10], Loss: 2.9778
LR : 1  Epoch [3/10], Loss: 2.3417
LR : 1  Epoch [4/10], Loss: 2.3440
LR : 1  Epoch [5/10], Loss: 2.3113
LR : 1  Epoch [6/10], Loss: 2.3460
LR : 1  Epoch [7/10], Loss: 2.2775
LR : 1  Epoch [8/10], Loss: 2.2869
LR : 1  Epoch [9/10], Loss: 2.2935
LR : 1  Epoch [10/10], Loss: 2.2680
LR : 0.1  Epoch [1/10], Loss: 3.1002
LR : 0.1  Epoch [2/10], Loss: 2.2569
LR : 0.1  Epoch [3/10], Loss: 2.2446
LR : 0.1  Epoch [4/10], Loss: 2.2292
LR : 0.1  Epoch [5/10], Loss: 2.2173
LR : 0.1  Epoch [6/10], Loss: 2.2035
LR : 0.1  Epoch [7/10], Loss: 2.1767
LR : 0.1  Epoch [8/10], Loss: 2.1629
LR : 0.1  Epoch [9/10], Loss: 2.146


In [11]:
#LR_list = [100,10,1,1e-1,1e-2,1e-3,1e-5,1e-7,1e-10]
#paremeters

from tqdm import tqdm


LR_list = [1e-2]
NB_EPOCH = 3
num_classes = 10  # for UrbanSound8K dataset, there are 10 classes
input_size = 44100 * 4  # 44100Hz * 4 seconds
#audio_dir = "C:/Users/PC/AppData/@FOLDER/@Project/UrbanSound8K/audio"
audio_dir = "/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/audio"
#Saving the best model
min_loss = float('inf')

for LR in LR_list:

    #// we have to change that it uses many lr 
    model = SoundClassifier_MARK2(input_size, num_classes)
    optimizer = torch.optim.Adam(model.parameters(), lr=LR)
    criterion = torch.nn.CrossEntropyLoss()

    model.train()
    model.to(device)
    for epoch in range(NB_EPOCH):
        running_loss = 0.0
        for data, target in tqdm(train_loader): # tqdm
        #for data, target in (train_loader): #no tqdm
            data = data.unsqueeze(1)
            data = data.to(device)
            target = target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target.squeeze())
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * data.size(0)
        epoch_loss = running_loss / len(train_set)
        print('LR : {}  Epoch [{}/{}], Loss: {:.4f}'.format(LR,epoch+1, NB_EPOCH, epoch_loss))
        # Save the BEST model if the current epoch loss is less than the minimum loss so far
#------------------------------- Valdation
        # Validation phase
        model.eval()
        val_running_loss = 0.0
        with torch.no_grad():
            for data, target in val_loader:
                data = data.unsqueeze(1)
                data = data.to(device)
                target = target.to(device)
                output = model(data)
                loss = criterion(output, target.squeeze())
                val_running_loss += loss.item() * data.size(0)
        val_epoch_loss = val_running_loss / len(val_set)
        print('LR : {}  Epoch [{}/{}], Validation Loss: {:.4f}'.format(LR, epoch+1, NB_EPOCH, val_epoch_loss))

        # Save the BEST model if the current epoch validation loss is less than the minimum loss so far
        if val_epoch_loss < min_loss:
            print("Saving the best model with validation loss: {:.4f}".format(val_epoch_loss))
            min_loss = val_epoch_loss
            torch.save(model.state_dict(), "MARK2_best.pth")

  0%|          | 0/1746 [00:04<?, ?it/s]


KeyboardInterrupt: 

## Test

In [15]:
model.load_state_dict(torch.load("MARK2_best.pth"))
model.eval()

#I dont know why, but If I use GPU(mps) an error accurs
device = 'cpu'
model.to('cpu')
#model.to(device)

with torch.no_grad():
    correct = 0
    total = 0
    for data, target in tqdm(test_loader):
        #data = data.unsqueeze(0)
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        _, predicted = torch.max(output.data,1)##IMPORTANT if your model dont use batchnorm, use max(output.data,0) instead
        total += target.size(0)
        correct += (predicted == target.squeeze()).sum().item()
    print('Accuracy of the model on the validation set: {:.2f}%'.format(100 * correct / total))

1
2
3
4
5


100%|██████████| 219/219 [01:08<00:00,  3.20it/s]

Accuracy of the model on the validation set: 12.81%





In [None]:
# BATCHSIZE = 64
# from tqdm import tqdm

# try:
#     testdataset = pd.read_csv('/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/metadata/UrbanSound8K.csv')
# except:
#     testdataset = pd.read_csv("C:/Users/PC/AppData/@FOLDER/@Project/UrbanSound8K/metadata/UrbanSound8K.csv")
    
# num_classes = 10  # for UrbanSound8K dataset, there are 10 classes
# input_size = 44100 * 4  # 44100Hz * 4 seconds

# test_loader = UrbanSoundDataset(testdataset)
# train_loader = torch.utils.data.DataLoader(test_loader, batch_size=BATCHSIZE, shuffle=True)
# model = SoundClassifier_MARK2(input_size, num_classes)
# model.load_state_dict(torch.load("best_model.pth"))
# model.eval()

# #model.to('cpu')
# model.to(device)
# with torch.no_grad():
#     print(4)
#     correct = 0
#     total = 0
#     for data, target in tqdm(train_loader):
#         data = data.unsqueeze(0)
#         data = data.to(device)
#         target = target.to(device)
        
#         output = model(data)
#         _, predicted = torch.max(output.data,1)##IMPORTANT if your model dont use batchnorm, use max(output.data,0) instead
#         total += target.size(0)
#         correct += (predicted == target.squeeze()).sum().item()
#     print('Accuracy of the model on the validation set: {:.2f}%'.format(100 * correct / total))

#### Diary

**5/14 1300**: I have resolved the folder location issue, but now I am facing a new problem. The folder location and the sound file do not match. It's strange because the folder and the CSV files are fine. However, the `audio_path` is pointing in the wrong direction.

**5/14 1310**: I noticed that the folder and the file name were slightly off, which indicates that it's not entirely random. So, I decided to avoid using split and shuffle. Surprisingly, it worked. It seems like the split function was causing the problem, but I will keep monitoring the situation. Although the location error still persists, I tried specifying the complete location path as "/Users/cafalena/sound_datasets/urbansound8k/UrbanSound8K/". This resolved the issue, and I observed that it recognized multiple sound files. However, now I am facing a tensor problem. I need to address this next.

**5/14 1752**: I was planning to use a CNN (Convolutional Neural Network), but I realized that sound waves are 1-dimensional. I'm struggling to figure out how to utilize a CNN with sound. Therefore, for now, I will stick with an NN (Neural Network). Once I successfully implement the NN, I can revisit using a CNN. Additionally, I need to work on the accuracy and test code sections.  

**5/15 1530** I solved the problem with NN models matmul (matix muliply) problem. The problem was about the batch norm and dim (if the dim is 1 more, you need unsqueeze please check .shape())  

**5/17 1915** I attempted to download the YouTube AudioSet, but encountered several issues and was unsuccessful. As an alternative, I downloaded another dataset from AIHUB. However, I am uncertain if I will utilize it due to the extensive processing required. To make progress at this moment, I need to focus on implementing the Convolutional Neural Network (CNN) and signal processing.