## **CNN e Fine-Tuning com WiSARD e HDC**

#### **Trabalho de I.A. Verde**
#### **Prof.:** Leandro Santiago
#### **Equipe:** Alessandra Gomes, Camila Alves, Camila Rocha e Sandy Cabral
#### **Neste Notebook:** ####
- Implementação da CNN do artigo "Emotion Recognition in Instrumental Music Using AI" em **PyTorch**. O artigo está disponível em https://sol.sbc.org.br/index.php/bracis/article/view/33627 e o respectivo repositório em https://github.com/Camila-Ferr/Emotional-Mapping
- Dataset disponível em: https://drive.google.com/drive/folders/19wQom3nrmOlWYhUS5Fjy150zbFaIJe_T?usp=drive_link
- Implementação de Fine-Tuning com WiSARD e HDC
- Comparação entre Métricas




# **Parte 1: Implementação do CNN em PyTorch**

A implementação original da CNN do artigo "Emotion Recognition in Instrumental Music Using AI" foi feita com Tensorflow. Para a execução do fine-tuning proposto para este trabalho, o código refeito utilizando PyTorch

In [None]:
#imports

import os
import torch
import numpy as np

import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models

from torch import Tensor
from torchvision import transforms
from torch.utils.data import TensorDataset, DataLoader, random_split
from torch import nn, optim
from torchvision import datasets, transforms, models

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

from PIL import Image

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# atualize com o caminho para o conforme seu drive pessoal
spectograms_path = '/content/.../Spectograms'

for file in os.listdir(spectograms_path):
    print(file)

happy
romantic
dramatic
aggressive
sad


In [None]:
# carrega e prepara as imagens para o treinamento
def load_images_from_path_pytorch(path, label):
    images = []
    labels = []
    transform = transforms.Compose([
        # redimensiona para o tamanho desejado
        transforms.Resize((224, 224)),
        # converte para tensor e normaliza para [0,1]
        transforms.ToTensor(),
        # normalização padrão ImageNet
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])

    for file in os.listdir(path):
        if file.endswith('.png'):
            img_path = os.path.join(path, file)
            img = Image.open(img_path).convert('RGB')
            img_tensor = transform(img)
            images.append(img_tensor)
            labels.append(label)

    images_tensor = torch.stack(images)
    return images_tensor, labels


In [None]:
emotions = ["aggressive", "dramatic", "happy", "romantic", "sad"]

x_CNN_Aut = []
y_CNN_Aut = []

# Carregar e armazena em lista as imagens e os rótulos
for emotion in emotions:
    images, labels = load_images_from_path_pytorch( f"{spectograms_path}/{emotion}", emotion)
    x_CNN_Aut.extend(images)
    y_CNN_Aut.extend(labels)

# Converter listas para arrays numpy
x_CNN_Aut = np.array(x_CNN_Aut)
y_CNN_Aut = np.array(y_CNN_Aut)

# Codificar os rótulos da classe target
label_encoder = LabelEncoder()
y_CNN_Aut_encoded = label_encoder.fit_transform(y_CNN_Aut)

# Divisão do dataset em treino e teste
x_train_CNN_Aut, x_test_CNN_Aut, y_train_CNN_Aut, y_test_CNN_Aut = train_test_split(x_CNN_Aut, y_CNN_Aut_encoded, stratify=y_CNN_Aut_encoded, test_size=0.2, random_state=0)

# Normalização
x_train_norm_CNN_Aut = x_train_CNN_Aut / 255.0
x_test_norm_CNN_Aut = x_test_CNN_Aut / 255.0

# número total de classes do seu problema
num_classes = len(emotions)

# adaptação dos conjuntos de treino e teste para DataLoader
x_train_tensor = torch.tensor(x_train_CNN_Aut, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_CNN_Aut, dtype=torch.long)
x_test_tensor = torch.tensor(x_test_CNN_Aut, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test_CNN_Aut, dtype=torch.long)

train_dataset = TensorDataset(x_train_tensor, y_train_tensor)
test_dataset = TensorDataset(x_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [None]:
#hiperparâmetros utilizados no artigo
units = 512
dropout = 0.30000000000000004
learning_rate = 0.0003209523930194174

In [None]:
vgg = models.vgg16(pretrained=True)

# congela convoluções
for param in vgg.features.parameters():
    param.requires_grad = False

# customização do classificador vgg
num_features = vgg.classifier[0].in_features
vgg.classifier = nn.Sequential(
    nn.Flatten(),
    nn.Linear(num_features, units),
    nn.ReLU(),
    nn.Dropout(dropout),
    nn.Linear(units, num_classes)
)

vgg = vgg.to(device)

# otimizador e loss
optimizer = optim.Adam(vgg.parameters(), lr = learning_rate)
criterion = nn.CrossEntropyLoss()

# loop de treinamento
vgg.train()
for epoch in range(30):
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = vgg(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

# avaliação
vgg.eval()
all_preds = []
all_labels = []
with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        outputs = vgg(images)
        preds = torch.argmax(outputs, dim=1).cpu().numpy()
        all_preds.extend(preds)
        all_labels.extend(labels.numpy())

Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /root/.cache/torch/hub/checkpoints/vgg16-397923af.pth
100%|██████████| 528M/528M [00:10<00:00, 54.0MB/s]


In [None]:
# métricas obtidas pelo modelo treinado de acordo com o artigo
print(classification_report(all_labels, all_preds, target_names=emotions))

              precision    recall  f1-score   support

  aggressive       0.94      0.96      0.95       100
    dramatic       0.93      0.91      0.92       100
       happy       1.00      0.95      0.97       100
    romantic       0.86      0.88      0.87       100
         sad       0.91      0.94      0.93       100

    accuracy                           0.93       500
   macro avg       0.93      0.93      0.93       500
weighted avg       0.93      0.93      0.93       500



In [None]:
# descomentar se necessário

# salvar o modelo
#torch.save(vgg.state_dict(), 'vgg16_CNN_model_weights.pth')

# **Parte 2: Fine-Tuning**

## **Última Camada de Convolução**

A arquitetura padrão do VGG16 é organizada em dois blocos principais: features e classifier. O bloco de features contém as camadas convolucionais e pooling e o classifier as camadas totalmente conectadas do modelo. O bloco features é o responsável pela extração das características visuais da imagem, logo, será o utilizado na implementação do fine-tuning deste trabalho.

A visualização do bloco features do modelo VGG16 mostra que a última camada convolucional é a Conv2d, na posição 28, seguida por uma função de ativação ReLU e uma pooling MaxPool.

In [None]:
vgg.features

Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU(inplace=True)
  (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (3): ReLU(inplace=True)
  (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (6): ReLU(inplace=True)
  (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (8): ReLU(inplace=True)
  (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (11): ReLU(inplace=True)
  (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (13): ReLU(inplace=True)
  (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (15): ReLU(inplace=True)
  (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (17): Conv2d(256, 512, kernel_si

In [None]:
# Descomentar caso inicie a execução do carregamento do arquivo do modelo

# carregar o modelo
#model_path = '/content/.../vgg16_CNN_model_weights.pth'

#model = models.vgg16()
#model.load_state_dict(torch.load(model_path, weights_only=True, strict=False))
#model.eval()

#vgg = models.vgg16(pretrained=True)
#for param in vgg.features.parameters():
#    param.requires_grad = False

#vgg = vgg.to(device)
# modo avaliação, sem treinamento das convoluções
#vgg.eval()

In [None]:
# extração do vetor de features da última camada convolucional
# x: batch de imagens
def extract_features(x):
    with torch.no_grad():
        # retorna a saída da última camada convolucional (default)
        x = vgg.features(x)
        # reduz a uma média global espacial em 512x1x1
        x = F.adaptive_avg_pool2d(x, (1, 1))
        # achatamento do vetor de features para 1D
        features = torch.flatten(x, 1)
    return features

## **Fine-Tuning com Wisard**

In [None]:
# instalação do torchwnn
!pip install torchwnn

Collecting torchwnn
  Downloading torchwnn-0.0.0.tar.gz (12 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ucimlrepo (from torchwnn)
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->torchwnn)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->torchwnn)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch->torchwnn)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch->torchwnn)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch->torchwnn)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014

In [None]:
from torchwnn.classifiers import Wisard
from torchwnn.encoding import Thermometer

In [None]:
# extração de features para as imagens no conjunto de treinamento
all_features = []
all_labels = []

for images, labels in train_loader:
    images = images.to(device)
    with torch.no_grad():
        features = extract_features(images)
    #lista de tensores
    all_features.append(features.cpu())
    all_labels.append(labels.cpu())

# extração de features para as imagens no conjunto de teste
all_features_t = []
all_labels_t = []

for images, labels in test_loader:
    images = images.to(device)
    with torch.no_grad():
        features = extract_features(images)
    #lista de tensores
    all_features_t.append(features.cpu())
    all_labels_t.append(labels.cpu())

In [None]:
# concatena os batches dos dados de treinamento
X_train_tensor = torch.cat(all_features, dim=0)
y_train_tensor = torch.cat(all_labels, dim=0)

print(X_train_tensor.shape)

# concatena os batches dos dados de teste
X_test_tensor = torch.cat(all_features_t, dim=0)
y_test_tensor = torch.cat(all_labels_t, dim=0)

print(X_test_tensor.shape)

torch.Size([2000, 512])
torch.Size([500, 512])


In [None]:
# codificação com Termômetro
bits_encoding = 16
encoding = Thermometer(bits_encoding).fit(X_train_tensor)
X_train_bin = encoding.binarize(X_train_tensor).flatten(start_dim=1)

encoding2 = Thermometer(bits_encoding).fit(X_test_tensor)
X_test_bin = encoding2.binarize(X_test_tensor).flatten(start_dim=1)

In [None]:
# treinamento do modelo Wisard e predição

entry_size = X_train_bin.shape[1]
tuple_size = 32

y_train_tensor = y_train_tensor.to(torch.int32)
X_train_bin = X_train_bin.to(torch.int32)

model_w = Wisard(entry_size, num_classes, tuple_size)
model_w.fit(X_train_bin, y_train_tensor)

predictions_w = model_w.predict(X_test_bin)
predictions_w

tensor([3, 1, 2, 0, 3, 2, 1, 4, 1, 3, 2, 0, 0, 2, 3, 2, 3, 4, 2, 1, 0, 2, 0, 1,
        2, 1, 3, 3, 3, 1, 4, 0, 4, 0, 1, 0, 0, 4, 4, 0, 1, 2, 2, 2, 1, 4, 4, 2,
        4, 0, 3, 1, 2, 0, 1, 3, 1, 2, 3, 1, 3, 4, 4, 1, 3, 0, 1, 0, 0, 2, 0, 1,
        2, 3, 3, 1, 1, 0, 1, 3, 4, 3, 4, 2, 0, 2, 4, 4, 0, 4, 2, 3, 1, 0, 4, 2,
        1, 4, 0, 2, 0, 1, 2, 0, 1, 1, 1, 1, 0, 3, 2, 3, 1, 4, 3, 3, 3, 2, 0, 0,
        0, 3, 2, 0, 3, 4, 4, 4, 4, 3, 0, 4, 1, 3, 1, 0, 4, 3, 1, 0, 2, 0, 3, 4,
        3, 1, 0, 1, 0, 3, 0, 2, 4, 2, 0, 2, 3, 4, 4, 0, 2, 0, 0, 1, 3, 0, 4, 2,
        2, 3, 3, 4, 1, 3, 4, 4, 4, 0, 4, 1, 4, 3, 0, 1, 2, 2, 0, 2, 0, 0, 4, 1,
        3, 0, 4, 1, 1, 3, 4, 1, 2, 4, 4, 3, 3, 0, 0, 2, 4, 3, 3, 0, 2, 0, 2, 0,
        0, 1, 2, 4, 4, 3, 2, 3, 3, 0, 0, 0, 2, 1, 2, 0, 3, 2, 1, 0, 0, 4, 0, 3,
        3, 2, 4, 0, 1, 1, 3, 1, 2, 1, 0, 0, 3, 3, 0, 2, 0, 0, 4, 3, 4, 0, 0, 2,
        0, 4, 0, 1, 4, 4, 2, 0, 0, 2, 4, 2, 4, 0, 3, 3, 3, 2, 1, 2, 0, 3, 0, 0,
        4, 4, 3, 2, 0, 3, 1, 2, 2, 3, 4,

In [None]:
# Gerar relatório de classificação
print(classification_report(y_test_tensor, predictions_w, target_names=emotions))

              precision    recall  f1-score   support

  aggressive       0.51      0.58      0.54       100
    dramatic       0.50      0.45      0.47       100
       happy       0.63      0.63      0.63       100
    romantic       0.44      0.48      0.46       100
         sad       0.58      0.51      0.54       100

    accuracy                           0.53       500
   macro avg       0.53      0.53      0.53       500
weighted avg       0.53      0.53      0.53       500



## **Fine-Tuning com HDC**

In [None]:
!pip install torchhd
!pip install binhd

[31mERROR: Could not find a version that satisfies the requirement torchhd (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for torchhd[0m[31m
[0mCollecting binhd
  Downloading binhd-1.0.0a0-py3-none-any.whl.metadata (3.4 kB)
Collecting torch-hd (from binhd)
  Downloading torch_hd-5.8.4-py3-none-any.whl.metadata (10 kB)
Downloading binhd-1.0.0a0-py3-none-any.whl (14 kB)
Downloading torch_hd-5.8.4-py3-none-any.whl (360 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m361.0/361.0 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch-hd, binhd
Successfully installed binhd-1.0.0a0 torch-hd-5.8.4


In [None]:
import torchhd
from torchhd import embeddings

from binhd.classifiers import BinHD
from binhd.embeddings import ScatterCode

In [None]:
# definição de hiperparâmetros

dimension = 1000
num_levels = 100

min_val = X_train_tensor.min().item()
max_val = X_train_tensor.max().item()
print(f"Feature min: {min_val}, max: {max_val}")

Feature min: 0.0, max: 5.7164740562438965


In [None]:
# classe RecordEncoder com modificações referentes ao uso de cpu ou cuda
class RecordEncoder(nn.Module):
    def __init__(self, out_features, size, levels, low, high, device=None):
        super(RecordEncoder, self).__init__()
        self.device = device or torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        self.position = embeddings.Random(size, out_features, vsa="BSC", dtype=torch.uint8)
        self.value = ScatterCode(levels, out_features, low=low, high=high)

    def forward(self, x):
        # mover entrada para o device
        x = x.to(self.device)

        sample_hv = torchhd.bind(self.position.weight, self.value(x))
        sample_hv = torchhd.multiset(sample_hv)
        return sample_hv

In [None]:
# codificação baseada em Record
record_encode = RecordEncoder(dimension, X_train_tensor.shape[1], num_levels, min_val, max_val, device=device)
record_encode = record_encode.to(device)

X_train_tensor = X_train_tensor.to(device)
y_train_tensor = y_train_tensor.to(device)

X_test_tensor = X_test_tensor.to(device)
y_test_tensor = y_test_tensor.to(device)

In [None]:
# ajustes com relação ao device na classe BinHD, pois alguns tensores criados pela classe eram armazenados em cpu
class BinHD2(nn.Module):
    def __init__(
        self,
        n_dimensions: int,
        n_classes: int,
        *,
        epochs: int = 30,
        device: torch.device = None,
    ) -> None:
        super().__init__()

        self.device = device or torch.device('cpu')

        self.n_dimensions = n_dimensions
        self.n_classes = n_classes
        self.epochs = epochs
        self.classes_counter = torch.empty((n_classes, n_dimensions), device=self.device, dtype=torch.int8)
        self.classes_hv = None
        self.reset_parameters()

    def reset_parameters(self) -> None:
        nn.init.zeros_(self.classes_counter)

    def fit(self, input: Tensor, target: Tensor):
        input = input.to(self.device)
        target = target.to(self.device)
        input = 2 * input - 1
        self.classes_counter.index_add_(0, target, input)
        self.classes_hv = self.classes_counter.clamp(min=0, max=1)

    def fit_adapt(self, input: Tensor, target: Tensor):
        for _ in range(self.epochs):
            self.adapt(input, target)

    def adapt(self, input: Tensor, target: Tensor):
        input = input.to(self.device)
        target = target.to(self.device)

        pred = self.predict(input)
        is_wrong = target != pred

        if is_wrong.sum().item() == 0:
            return

        input = input[is_wrong]
        input = 2 * input - 1
        target = target[is_wrong]
        pred = pred[is_wrong]

        self.classes_counter.index_add_(0, target, input, alpha=1)
        self.classes_counter.index_add_(0, pred, input, alpha=-1)
        self.classes_hv = torch.where(self.classes_counter >= 0, 1, 0)

    def forward(self, samples: Tensor) -> Tensor:
        samples = samples.to(self.device)
        response = torch.empty((self.n_classes, samples.shape[0]), dtype=torch.int8, device=self.device)

        for i in range(self.n_classes):
            response[i] = torch.sum(torch.bitwise_xor(samples, self.classes_hv[i]), dim=1)

        return response.transpose_(0, 1)

    def predict(self, samples: Tensor) -> Tensor:
        samples = samples.to(self.device)
        return torch.argmin(self(samples), dim=-1)


In [None]:
model_bhd = BinHD2(dimension, num_classes)
model_bhd.to(device)

with torch.no_grad():

    X_train_record = record_encode(X_train_tensor).to(device)
    X_test_record = record_encode(X_test_tensor).to(device)

    #X_train_record.is_cuda = True
    #y_train_tensor.is_cuda = True
    model_bhd.fit(X_train_record,y_train_tensor)

    predictions_bhd = model_bhd.predict(X_test_record)

In [None]:
# Gerar relatório de classificação
print(classification_report(y_test_tensor.cpu(), predictions_bhd.cpu(), target_names=emotions))

              precision    recall  f1-score   support

  aggressive       0.19      0.18      0.19       100
    dramatic       0.21      0.54      0.30       100
       happy       0.20      0.09      0.12       100
    romantic       0.21      0.20      0.21       100
         sad       0.25      0.01      0.02       100

    accuracy                           0.20       500
   macro avg       0.21      0.20      0.17       500
weighted avg       0.21      0.20      0.17       500



# **Parte 3: Análise Comparativa**

<table>
<tr>
<td><b>Modelo</b></td>
<td><b>Acurácia Geral</b></td>
<td><b>Precisão Média</b></td>
<td><b>Recall Médio</b></td>
<td><b>F1-Score Médio</b></td>
</tr>
<tr>
<td><b>CNN</b></td>
<td>93.00%</td>
<td>92.80%</td>
<td>92.80%</td>
<td>92.80%</td>
</tr>
<tr>
<td><b>WiSARD</b></td>
<td>53.00%</td>
<td>53.20%</td>
<td>53.00%</td>
<td>53.80%</td>
</tr>
<tr>
<td><b>BinHD</b></td>
<td>20.00%</td>
<td>21.20%</td>
<td>20.40%</td>
<td>16.80%</td>
</tr>
</table>

Pela tabela comparativa, pode-se observar, de forma geral, que:

**CNN:** Modelo com melhor desempenho. Apresentou altos e melhores valores para as métricas, indicando que se trata de um modelo com ótimo desempenho para a tarefa de classificação de imagens de spectrogramas.

**WiSARD:** Modelo com desempenho mediado, com métricas significativamente inferiores à CNN, na faixa de 53%.

**BinHD:** Modelo com pior desempenho, com métricas muito baixas e indicando que o modelo não conseguiu generalizar o problema para executar a tarefa de classificação.

De forma mais específica:

O **modelo CNN** se destaca o melhor desempenho para a classe "happy" e "aggressive", com todas as métricas acima de 92%. Além disso, os valores elevados de F1-Score indicam bom equilíbrio entre precisão e recall.

O desempenho mediano do **modelo WiSARD**, com precisão e recall na faixa de 53%, indica que o modelo consegue capturar algumas características relevantes, porém não o suficiente para generalizar bem.

As métricas muito baixas do **modelo BinHD** indicam que o modelo pode não ter capturado as características dos dados e, por isso, não consegue distinguir adequadamente as cinco classes da tarefa. Este problema pode ocorrer devido a falhas nos ajustes do treinamento.

