- Causal Dilated Convolution Autoencoder을 활용하여 멜로디의 유사성을 반영할 수 있는 노래의 임베딩을 구함

- 멜 스펙트로그램 데이터를 압축하여 멜로디의 유사성을 반영할 수 있는 노래의 임베딩을 구하는 것이 본 모델의 목표

- 모델의 구조는 다음과 같음
    - Mel-Spectrogram-> Causal Dilated Convolution Encoder -> Layer -> Tanh -> Layer -> Causal Dilated Convolution Decoder -> Reconstruct Mel-Spectrogram
    - Causal Dilated Convolution Encoder는 Causal padding과 Dilated Convolution을 활용하여 적은 Layer로도 t - 1의 인과성을 효과적으로 학습할 수 았도록 구성함 (과거의 데이터를 학습하여 데이터를 압축)
    - Layer -> Tanh -> Layer는 Mel-Spectrogram을 -1 ~ 1 까지의 값을 가지는 128차원의 Mel 임베등으로 나타낼 수 있도록 구성함
    - Causal Dilated Convolution Decoder는 Reverse 함수, Causal padding, Dilated Convolution을 활용하여 적은 Layer로도 t + 1의 인과성을 효과적으로 학습할 수 았도록 구성함 (미래의 데이터를 학습하여 과거를 복원)
    - Mel-Spectrogram과 Reconstruct Mel-Spectrogram의 MSELoss를 바탕으로 모델의 가중치를 업데이트 함


- 모델 학습 결과 동일한 노래의 일본어, 한국어 버전 등의 노래가 가장 유사하게 나오는 것으로 보아 본 모델이 멜로디의 유사성을 반영했다고 볼 수 있음

In [None]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import random

import torch
import torch.nn as nn
import torch.nn.functional as F

import math
import pickle
import gc

import warnings
warnings.filterwarnings(action='ignore')

data_dir = '/content/drive/MyDrive/제 13회 투빅스 컨퍼런스 음악추천/Data/'
batch_data_dir = '/content/drive/MyDrive/제 13회 투빅스 컨퍼런스 음악추천/Batch_Data/'
model_dir = '/content/drive/MyDrive/제 13회 투빅스 컨퍼런스 음악추천/Model/'

# 데이터 확인

In [None]:
# 노래 데이터 불러오기
song_meta_df = pd.read_json(data_dir + 'song_meta_data_v3.json')
song_meta_df = song_meta_df.sort_values('id')
song_meta_df = song_meta_df.reset_index(drop = True)
song_meta_df['song_embedding_idx'] = song_meta_df.index

In [None]:
# 각 배치 단위 데이터 불러오기
import re
import os
file_list = os.listdir(batch_data_dir)

def get_int(x):
    x = int(re.sub('[^0-9]', '', x))
    return x

file_list = sorted([get_int(x) for x in file_list], reverse = True)

In [None]:
batch_name = file_list[0]
batch_li = [i for i in range(1, batch_name + 1)]
random.Random(22).shuffle(batch_li)

train_batch_li = batch_li[:-100]
val_batch_li = batch_li[-100:]
test_batch_li = sorted(batch_li)

# 학습 설정

In [None]:
epochs = 500
batch_size = 128

In [None]:
if torch.cuda.is_available():
  DEVICE = torch.device('cuda')
else:
  DEVICE = torch.device('cpu')
print(DEVICE)

In [None]:
def train(model, train_loader):
    model.train()
    train_loss = 0

    for batch_name in tqdm(train_loader):
        mel = np.load(batch_data_dir + f'{batch_name}.npy')

        mel = torch.FloatTensor(mel).to(DEVICE)
        
        optimizer.zero_grad()
        
        encode, output = model(mel)
        
        loss = criterion(output, mel)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
 
    train_loss /= (len(train_loader))

    return train_loss

def val(model, train_loader):
    model.eval()
    val_loss = 0

    with torch.no_grad():
        for batch_name in tqdm(train_loader):
            mel = np.load(batch_data_dir + f'{batch_name}.npy')
            
            mel = torch.FloatTensor(mel).to(DEVICE)

            encode, output = model(mel)

            loss = criterion(output, mel)

            val_loss += loss.item()

    val_loss /= (len(train_loader))

    return val_loss


def get_mel_embeding(model, train_loader):
    model.eval()
    mel_embeding_li = []
    with torch.no_grad():
        for batch_name in tqdm(train_loader):
            mel = np.load(batch_data_dir + f'{batch_name}.npy')
            
            mel = torch.FloatTensor(mel).to(DEVICE)
            encode, output = model(mel)
            mel_embeding_li.append(encode.detach().cpu().numpy())

    return mel_embeding_li

In [None]:
# encoder <- 과거의 데이터를 바탕으로 미래를 예측 (t-1의 인과성 학습)
# decoder <- 미래의 데이터를 바탕으로 과거를 예측 (t+1의 인과성 학습)
class TimeAutoEncoder(nn.Module):
    def __init__(self):
        super(TimeAutoEncoder, self).__init__()
        self.conv1 = nn.Sequential(
            nn.Conv1d(in_channels = 48, out_channels = 512, kernel_size = 3, stride = 1, padding = 0, dilation = 1),
            nn.BatchNorm1d(512),
            nn.ReLU(),
        )

        self.conv2 = nn.Sequential(
            nn.Conv1d(in_channels = 512, out_channels = 256, kernel_size = 3, stride = 1, padding = 0, dilation = 2),
            nn.BatchNorm1d(256),
            nn.ReLU(),
        )

        self.conv3 = nn.Sequential(
            nn.Conv1d(in_channels = 256, out_channels = 128, kernel_size = 3, stride = 1, padding = 0, dilation = 4),
            nn.BatchNorm1d(128),
            nn.ReLU(),
        )

        self.conv4 = nn.Sequential(
            nn.Conv1d(in_channels = 128, out_channels = 64, kernel_size = 3, stride = 1, padding = 0, dilation = 8),
            nn.BatchNorm1d(64),
            nn.ReLU(),
        )

        self.conv5 = nn.Sequential(
            nn.Conv1d(in_channels = 64, out_channels = 32, kernel_size = 3, stride = 1, padding = 0, dilation = 16),
            nn.BatchNorm1d(32),
            nn.ReLU(),
        )

        self.conv6 = nn.Sequential(
            nn.Conv1d(in_channels = 32, out_channels = 16, kernel_size = 3, stride = 1, padding = 0, dilation = 32),
            nn.BatchNorm1d(16),
            nn.ReLU(),
        )

        self.conv7 = nn.Sequential(
            nn.Conv1d(in_channels = 16, out_channels = 8, kernel_size = 3, stride = 1, padding = 0, dilation = 64),
            nn.BatchNorm1d(8),
            nn.ReLU(),
        )

        self.encoder_fc = nn.Sequential(
            nn.Linear(8 * 1876, 128),
            nn.BatchNorm1d(128),
            nn.Tanh(),
        )
        
        self.decoder_fc = nn.Sequential(
            nn.Linear(128, 8 * 1876),
            nn.ReLU(),
        )

        self.t_conv1 = nn.Sequential(
            # nn.ConvTranspose1d(in_channels = 8, out_channels = 16, kernel_size  = 3, stride = 1, dilation=62),
            nn.Conv1d(in_channels = 8, out_channels = 16, kernel_size = 3, stride = 1, padding = 0, dilation = 64),
            nn.BatchNorm1d(16),
            nn.ReLU(),
        )

        self.t_conv2 = nn.Sequential(
            # nn.ConvTranspose1d(in_channels = 16, out_channels = 32, kernel_size  = 3, stride = 1, dilation = 30),
            nn.Conv1d(in_channels = 16, out_channels = 32, kernel_size = 3, stride = 1, padding = 0, dilation = 32),
            nn.BatchNorm1d(32),
            nn.ReLU(),
        )

        self.t_conv3 = nn.Sequential(
            # nn.ConvTranspose1d(in_channels = 32, out_channels = 64, kernel_size  = 3, stride = 1, dilation=14),
            nn.Conv1d(in_channels = 32, out_channels = 64, kernel_size = 3, stride = 1, padding = 0, dilation = 16),
            nn.BatchNorm1d(64),
            nn.ReLU(),
        )

        self.t_conv4 = nn.Sequential(
            # nn.ConvTranspose1d(in_channels = 64, out_channels = 128, kernel_size  = 3, stride = 1, dilation = 6),
            nn.Conv1d(in_channels = 64, out_channels = 128, kernel_size = 3, stride = 1, padding = 0, dilation = 8),
            nn.BatchNorm1d(128),
            nn.ReLU(),
        )

        self.t_conv5 = nn.Sequential(
            # nn.ConvTranspose1d(in_channels = 128, out_channels = 256, kernel_size  = 3, stride = 1, dilation=2),
            nn.Conv1d(in_channels = 128, out_channels = 256, kernel_size = 3, stride = 1, padding = 0, dilation = 4),
            nn.BatchNorm1d(256),
            nn.ReLU(),
        )

        self.t_conv6 = nn.Sequential(
            # nn.ConvTranspose1d(in_channels = 256, out_channels = 512, kernel_size  = 3, stride = 1, dilation = 1),
            nn.Conv1d(in_channels = 256, out_channels = 512, kernel_size = 3, stride = 1, padding = 0, dilation = 2),
            nn.BatchNorm1d(512),
            nn.ReLU(),
        )

        self.t_conv7 = nn.Sequential(
            # nn.ConvTranspose1d(in_channels = 512, out_channels = 48, kernel_size  = 3, stride = 1, dilation= 1),
            nn.Conv1d(in_channels = 512, out_channels = 48, kernel_size = 3, stride = 1, padding = 0, dilation = 1)
        )

    def forward(self, mel_spec):
        x = F.pad(mel_spec, pad = (2, 0, 0, 0))
        x = self.conv1(x)
        # print(x.shape)
        x = F.pad(x, pad = (4, 0, 0, 0))
        x = self.conv2(x)
        # print(x.shape)
        x = F.pad(x, pad = (8, 0, 0, 0))
        x = self.conv3(x)
        # print(x.shape)
        x = F.pad(x, pad = (16, 0, 0, 0))
        x = self.conv4(x)
        # print(x.shape)
        x = F.pad(x, pad = (32, 0, 0, 0))
        x = self.conv5(x)
        # print(x.shape)
        x = F.pad(x, pad = (64, 0, 0, 0))
        x = self.conv6(x)
        # print(x.shape)
        x = F.pad(x, pad = (128, 0, 0, 0))
        x = self.conv7(x)
        # print(x.shape)
        encode = self.encoder_fc(x.view(-1, 8 * 1876))

        # print('decode')
        x = self.decoder_fc(encode)
        x = x.view(-1, 8, 1876)
        x = torch.swapaxes(torch.fliplr(torch.swapaxes(x, 1, 2)), 1, 2)
        x = F.pad(x, pad = (128, 0, 0, 0))
        x = self.t_conv1(x)
        # print(x.shape)
        x = F.pad(x, pad = (64, 0, 0, 0))
        x = self.t_conv2(x)
        # print(x.shape)
        x = F.pad(x, pad = (32, 0, 0, 0))
        x = self.t_conv3(x)
        # print(x.shape)
        x = F.pad(x, pad = (16, 0, 0, 0))
        x = self.t_conv4(x)
        # print(x.shape)
        x = F.pad(x, pad = (8, 0, 0, 0))
        x = self.t_conv5(x)
        # print(x.shape)
        x = F.pad(x, pad = (4, 0, 0, 0))
        x = self.t_conv6(x)
        # print(x.shape)
        x = F.pad(x, pad = (2, 0, 0, 0))
        x = self.t_conv7(x)
        # print(x.shape)
        x = torch.swapaxes(torch.fliplr(torch.swapaxes(x, 1, 2)), 1, 2)
        
        return encode, x

# 학습

In [None]:
model = TimeAutoEncoder().to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)
criterion = nn.MSELoss()

# print(model)

In [None]:
from torchsummary import summary as summary_

summary_(model, (48, 1876),batch_size = batch_size)

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv1d-1           [128, 512, 1876]          74,240
       BatchNorm1d-2           [128, 512, 1876]           1,024
              ReLU-3           [128, 512, 1876]               0
            Conv1d-4           [128, 256, 1876]         393,472
       BatchNorm1d-5           [128, 256, 1876]             512
              ReLU-6           [128, 256, 1876]               0
            Conv1d-7           [128, 128, 1876]          98,432
       BatchNorm1d-8           [128, 128, 1876]             256
              ReLU-9           [128, 128, 1876]               0
           Conv1d-10            [128, 64, 1876]          24,640
      BatchNorm1d-11            [128, 64, 1876]             128
             ReLU-12            [128, 64, 1876]               0
           Conv1d-13            [128, 32, 1876]           6,176
      BatchNorm1d-14            [128, 3

In [None]:
import time

min_loss = 987654321

for epoch in range(1, epochs + 1):
    start = time.time()
    random.shuffle(train_batch_li)
    train_loss = train(model = model, train_loader = train_batch_li)
    val_loss = val(model = model, train_loader = val_batch_li)
    end = time.time()

    print(f'EPOCH:{epoch}, Train Loss:{train_loss}, Val Loss:{val_loss}, 학습 시간: {end - start}')
    if val_loss < min_loss:
        min_loss = val_loss
        torch.save(model.state_dict(), model_dir + f'TimeAutoEncoder_val.pt')
        print('모델 저장')

# mel_embedding 저장

In [None]:
model = TimeAutoEncoder().to(DEVICE)
model.load_state_dict(torch.load(model_dir + f'TimeAutoEncoder_val.pt', map_location = DEVICE))

In [None]:
mel_embeding_li = get_mel_embeding(model = model, train_loader = test_batch_li)
mel_embeding = np.concatenate(mel_embeding_li, axis = 0)

In [None]:
mel_embeding.shape

In [None]:
np.save(batch_data_dir + 'mel_embeding_val.npy', mel_embeding)