<a href="https://colab.research.google.com/github/Stone-bridge-NLP/BERT/blob/main/GenreClassification_BERT_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo of Genre Classification using BERT embedding

Hongik univ 2021 NLP team project  
JunHyeon Kwon

Huggingface usage referenced from here:  
https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/huggingface_pytorch-transformers.ipynb

# Setting Environment

In [1]:
# required packages to use BERT via hub models
%%bash
pip install tqdm boto3 requests regex sentencepiece sacremoses

Collecting boto3
  Downloading boto3-1.20.19-py3-none-any.whl (131 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
Collecting botocore<1.24.0,>=1.23.19
  Downloading botocore-1.23.19-py3-none-any.whl (8.4 MB)
Collecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
Collecting urllib3<1.27,>=1.25.4
  Downloading urllib3-1.26.7-py2.py3-none-any.whl (138 kB)
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
Installing collected packages: urllib3, jmespath, botocore, s3transfer, sentencepiece, sacremoses, boto3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
Successfully installed boto

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.


In [2]:
# additional packages required (to avoid error, not mentioned in the tutorial)
!pip install huggingface_hub
!pip install tokenizers

Collecting huggingface_hub
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[?25l[K     |█████▎                          | 10 kB 22.8 MB/s eta 0:00:01[K     |██████████▋                     | 20 kB 25.8 MB/s eta 0:00:01[K     |███████████████▉                | 30 kB 25.5 MB/s eta 0:00:01[K     |█████████████████████▏          | 40 kB 18.7 MB/s eta 0:00:01[K     |██████████████████████████▌     | 51 kB 9.5 MB/s eta 0:00:01[K     |███████████████████████████████▊| 61 kB 9.4 MB/s eta 0:00:01[K     |████████████████████████████████| 61 kB 451 kB/s 
Installing collected packages: huggingface-hub
Successfully installed huggingface-hub-0.2.1
Collecting tokenizers
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 7.6 MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.10.3


In [3]:
# clone my github repo to import utils.py
!git clone https://github.com/Stone-bridge-NLP/BERT.git
%cp /content/BERT/utils.py /content/utils.py

Cloning into 'BERT'...
remote: Enumerating objects: 56, done.[K
remote: Counting objects: 100% (56/56), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 56 (delta 17), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (56/56), done.


In [4]:
# download model save and dataset to local disk

# https://drive.google.com/file/d/1PWrQeJ7bu1OAufGshDBO8e3qdtUy35xA/view?usp=sharing
!gdown --id 1PWrQeJ7bu1OAufGshDBO8e3qdtUy35xA
SAVE_FILENAME = 'checkpoint.pth'
# https://drive.google.com/file/d/168qGvi5w4Wwgu5QTpPoLkZzoJNAgyc6b/view?usp=sharing
!gdown --id 168qGvi5w4Wwgu5QTpPoLkZzoJNAgyc6b
TEST_FILENAME = 'preprocessed_test_data.csv'

Downloading...
From: https://drive.google.com/uc?id=1PWrQeJ7bu1OAufGshDBO8e3qdtUy35xA
To: /content/checkpoint.pth
100% 27.8M/27.8M [00:00<00:00, 170MB/s]
Downloading...
From: https://drive.google.com/uc?id=168qGvi5w4Wwgu5QTpPoLkZzoJNAgyc6b
To: /content/preprocessed_test_data.csv
100% 9.63M/9.63M [00:00<00:00, 85.1MB/s]


In [5]:
import pandas as pd
import numpy as np
import random
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import utils

# Define model and other classes

In [6]:
# Custom class from torch.utils.data.Dataset
# Tokenization and integer labeling happens here
# shuffle and batch tokenizing can be done with torch.utils.data.DataLoader
class LyricsAndGenreDataset(Dataset):
  def __init__(self, dataframe, tokenizer, num_sentence):
    self.df = dataframe
    self.tk = tokenizer
    self.num_sentence = num_sentence
    self.genre_name2id = {
        'Electronic':0, 
        'Country':1, 
        'R&B':2, 
        'Jazz':3, 
        'Indie':4, 
        'Pop':5, 
        'Folk':6, 
        'Metal':7, 
        'Hip-Hop':8, 
        'Rock':9}

  def __len__(self):
    return len(self.df)
  
  def __getitem__(self, idx):
    if torch.is_tensor(idx):
      idx = idx.tolist()

    genre = self.genre_name2id[self.df['Genre'][idx]]
    lyric = [self.df['Lyrics'][idx]]

    with torch.no_grad():
      indexed_tokens = self.tk.batch_encode_plus(
            lyric, add_special_tokens=True, padding= 'max_length', 
            max_length=2**9*self.num_sentence, truncation=True)
      
      tk_tensor = torch.tensor(indexed_tokens['input_ids']).view(-1,2**9)
      sg_tensor = torch.tensor(indexed_tokens['token_type_ids']).view(-1,2**9)
      at_tensor = torch.tensor(indexed_tokens['attention_mask']).view(-1,2**9)

    return genre, tk_tensor, sg_tensor, at_tensor

In [7]:
# classifier model
# manually stacked lstm layer to gradually decrease hidden_size
# one FC layer attached at the end
# ====================================
# Param seq_len has the sequence length info of each song in a batch.
# For some songs, sequence ends way earlier than 512 tokens, resulting 
# long sequence of padding at the end. This might make it hard for lstm
# to extract useful information from the sequence. With the info from seq_len
# it pulls output from certain time step and feeds to the FC layer.
class TextLSTM(nn.Module):
  def __init__(self, input_size, hidden_size, n_class):
    super(TextLSTM, self).__init__()

    self.hidden_size = hidden_size

    self.lstm1 = nn.LSTM(
              input_size=input_size,
              hidden_size=hidden_size*5,
              num_layers=1,
              dropout=0,
              batch_first=True)
    
    self.lstm2 = nn.LSTM(
              input_size=hidden_size*5,
              hidden_size=hidden_size*4,
              num_layers=1,
              dropout=0,
              batch_first=True)

    self.lstm3 = nn.LSTM(
              input_size=hidden_size*4,
              hidden_size=hidden_size*2,
              num_layers=1,
              dropout=0,
              batch_first=True)
    
    self.lstm4 = nn.LSTM(
              input_size=hidden_size*2,
              hidden_size=hidden_size,
              num_layers=1,
              dropout=0,
              batch_first=True)

    self.dense = nn.Sequential(
        nn.ReLU(),
        nn.Linear(hidden_size, n_class),
        nn.Softmax(dim=1))

  def forward(self, X, seq_len):
    # X of shape N,L,Hin
    # hidden_and_cell zeros by default
    # outputs of shape N,L,Hout
    outputs = X
    outputs, hidden_and_cell = self.lstm1(outputs)
    outputs, hidden_and_cell = self.lstm2(outputs)
    outputs, hidden_and_cell = self.lstm3(outputs)
    outputs, hidden_and_cell = self.lstm4(outputs)
    seq_len = torch.tile(seq_len.view(batch_size,1,1),(1,1,self.hidden_size))
    outputs = torch.gather(outputs,1,seq_len)
    outputs = outputs[:,-1]  # last hidden Layer of shape N,Hout
    return self.dense(outputs) # return of shape N,n_class

# Load and Run

In [8]:
#### hyperparameters ####
batch_size = 128

# fixed parameters
hidden_size = 128
num_sentences = 1
v_dim = 768
n_genre = 10
genre_id2name = ['Electronic', 'Country', 'R&B', 'Jazz', 'Indie', 'Pop', 'Folk', 'Metal', 'Hip-Hop', 'Rock']
genre_name2id = {'Electronic':0, 'Country':1, 'R&B':2, 'Jazz':3, 'Indie':4, 'Pop':5, 'Folk':6, 'Metal':7, 'Hip-Hop':8, 'Rock':9}
device = torch.device('cuda') if (torch.cuda.is_available())else torch.device('cpu')

In [10]:
## load the dataset and model save

# load pretrained BERT tokenizer and bare BERT model
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')
bert_embedding = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased').to(device)

# load test dataset
test_dataset = pd.read_csv('./'+TEST_FILENAME)
print(test_dataset['Genre'].value_counts())
print(len(test_dataset))

# declare torch.utils.data.Dataset
test_set = LyricsAndGenreDataset(test_dataset,tokenizer,num_sentences)

# test data loader
test_loader = DataLoader(test_set,batch_size=batch_size, shuffle=True, 
                          num_workers=0, drop_last=True)

# model
lstm_classifier = TextLSTM(v_dim, hidden_size, n_genre).to(device)

# load model if possible
try:
  cp = torch.load(SAVE_FILENAME)
  epoch_start= cp['current_epoch']+1
  lstm_classifier.load_state_dict(cp['model'])
  print(f'\nsavefile from {SAVE_FILENAME} loaded')
except FileNotFoundError:
  print(f'\nNo such savefile {SAVE_FILENAME}')

# print summary
print(lstm_classifier)
print(sum(p.numel() for p in lstm_classifier.parameters() if p.requires_grad))

Using cache found in /root/.cache/torch/hub/huggingface_pytorch-transformers_master
Using cache found in /root/.cache/torch/hub/huggingface_pytorch-transformers_master
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassificati

Rock          1410
Pop           1110
Hip-Hop        960
Country        810
Metal          810
Electronic     659
Jazz           659
Indie          510
R&B            509
Folk           495
Name: Genre, dtype: int64
7932

savefile from checkpoint.pth loaded
TextLSTM(
  (lstm1): LSTM(768, 640, batch_first=True)
  (lstm2): LSTM(640, 512, batch_first=True)
  (lstm3): LSTM(512, 256, batch_first=True)
  (lstm4): LSTM(256, 128, batch_first=True)
  (dense): Sequential(
    (0): ReLU()
    (1): Linear(in_features=128, out_features=10, bias=True)
    (2): Softmax(dim=1)
  )
)
6960394


In [11]:
# test the model with test dataset
lstm_classifier.eval()
c_mat = np.zeros((10,4)) # confusion matrix. TP, FP, FN, TN
f1 = []
with torch.no_grad():
  for b, batch in enumerate(test_loader):
    label_batch = batch[0].to(device)
    tk_batch = batch[1].to(device)
    sg_batch = batch[2].to(device)
    at_batch = batch[3].to(device)

    seq_len = np.sum(at_batch.detach().cpu().numpy(), axis=(1,2)) - 1
    seq_len = torch.LongTensor(seq_len).to(device)

    embedding = bert_embedding(
        tk_batch.view(-1,2**9), 
        token_type_ids= sg_batch.view(-1,2**9),
        attention_mask=at_batch.view(-1,2**9))

    embedded_tokens = embedding[0].view(batch_size,2**9*num_sentences,-1)

    output = lstm_classifier.forward(embedded_tokens, seq_len)
    
    pred = torch.argmax(output,axis=1)

    acc = float(torch.sum(pred == label_batch))/batch_size
    print(f'\rbatch [{b}/{len(test_loader)}] acc: {acc}', end='\t')

    # build confusion matrix
    for i in range(10):
      c_mat[i,0] += int(torch.sum((pred == i)*(label_batch == i)))
      c_mat[i,1] += int(torch.sum((pred == i)*(label_batch != i)))
      c_mat[i,2] += int(torch.sum((pred != i)*(label_batch == i)))
      c_mat[i,3] += int(torch.sum((pred != i)*(label_batch != i)))


# calculate precision, recall and f1-score
precision = [c[0]/(c[0]+c[1]) if c[0] != 0 else 0 for c in c_mat]
recall = [c[0]/(c[0]+c[2]) if c[0] != 0 else 0 for c in c_mat]
f1 = [2*p*r/(p+r) if p*r != 0 else 0 for p, r in zip(precision,recall)]

batch [60/61] acc: 0.359375	

# Result

In [12]:
# show confusion matrix
print('confusion matrix. TP, FP, FN, TN')
for g, c in zip(genre_id2name,c_mat):
  print('%-15s'%(g), c)

confusion matrix. TP, FP, FN, TN
Electronic      [ 171.  764.  480. 6393.]
Country         [ 253.  401.  543. 6611.]
R&B             [ 113.  550.  386. 6759.]
Jazz            [ 278.  829.  373. 6328.]
Indie           [ 132.  804.  370. 6502.]
Pop             [ 111.  232.  987. 6478.]
Folk            [ 194.  698.  291. 6625.]
Metal           [ 580.  673.  213. 6342.]
Hip-Hop         [ 723.  154.  216. 6715.]
Rock            [  36.  112. 1358. 6302.]


In [15]:
# compare f1 score with random prediction
test_dataset = pd.read_csv('./'+TEST_FILENAME)

P = [n/len(test_dataset) for n in test_dataset['Genre'].value_counts()]
f1_score = {n:2*p*0.5/(0.5+p) for n, p in zip(test_dataset['Genre'].value_counts().index, P)}
print('%-15s %11s   %11s'%('Genre', 'f1 test', 'f1 at least'))
for i, g in enumerate(genre_id2name):
  print('%-15s %-2.9f   %-2.9f'%(g, f1[i], f1_score[g]))

print('-'*43)
print('%-15s %-2.9f   %-2.9f'%('average',np.mean(f1),sum(f1_score.values())/10))

Genre               f1 test   f1 at least
Electronic      0.215636822   0.142486486
Country         0.348965517   0.169597990
R&B             0.194492255   0.113743017
Jazz            0.316268487   0.142486486
Indie           0.183588317   0.113941019
Pop             0.154059681   0.218676123
Folk            0.281771968   0.110961668
Metal           0.566959922   0.169597990
Hip-Hop         0.796255507   0.194884287
Rock            0.046692607   0.262276786
-------------------------------------------
average         0.310469108   0.163865185
