This notebook demonstrates the implementation of deep learning models to classify genre of a given set of movies.

Training Dataset: train.csv

Test Dataset: test.csv

To make life easy, ​word embeddings are already extracted using BERT 

The goal is to train 3 different deep learning models far as this task is concerned,<br><br>
- Model1: Train at least 1 CNN on the BERT embeddings<br>
- Model2: Train any model of your choice on the BERT embeddings
- Model3: Train any other model of your choice + different embeddings other than BERT

and evaluate the predictions on the test dataset


### Accessing Google Drive from Google Colab

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Setting path variables for easier imports in google colab

In [2]:
%cd /content/gdrive/MyDrive/colab_notebooks/genre_classification/
import sys
sys.path.insert(0,'/content/gdrive/MyDrive/colab_notebooks/genre_classification')

/content/gdrive/MyDrive/colab_notebooks/genre_classification


### Install, Import required libraries

If you are working in google colab, run this cell, else run the following one

In [3]:
# Install Transformer library for models and tokenizer

!pip install -U -q transformers 

In [4]:
# Importing the required libraries
import numpy as np
import pandas as pd

from tqdm import tqdm

import transformers

import torch
from torch import nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader

from sklearn.preprocessing import MultiLabelBinarizer

### Load, Read the Data

In [5]:
# Read data from .csv file
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

In [6]:
# Print the shape of train and test dataframe(s)
print("Training dataset has {} rows and {} columns".format(train_df.shape[0], train_df.shape[1]))
print("Test dataset has {} rows and {} columns".format(test_df.shape[0], test_df.shape[1]))

Training dataset has 3054 rows and 8 columns
Test dataset has 3054 rows and 7 columns


In [7]:
# Display first 5 rows of training dataset
train_df.head()

Unnamed: 0,type,title,director,cast,country,rating,description,genres
0,Movie,The Ryan White Story,John Herzfeld,"Judith Light, Lukas Haas, Michael Bowen, Nikki...",United States,TV-PG,After contracting HIV from a tainted blood tre...,Drama
1,Movie,Mumbai Cha Raja,Manjeet Singh,"Rahul Bairagi, Arbaaz Khan, Tejas Parvatkar, D...",India,TV-14,"This coming-of-age tale follows Rahul, a young...","Drama, International"
2,Movie,Soekarno,Hanung Bramantyo,"Ario Bayu, Lukman Sardi, Maudy Koesnaedi, Tant...",Indonesia,TV-MA,This biographical drama about Indonesia's firs...,"Drama, International"
3,Movie,The Young Offenders,Peter Foott,"Alex Murphy, Chris Walley, Hilary Rose, Domini...",Ireland,TV-MA,"Never ones to think things through, two Irish ...","Comedy, International"
4,Movie,The King,David Michôd,"Timothée Chalamet, Joel Edgerton, Robert Patti...",,R,Wayward Prince Hal must turn from carouser to ...,Drama


In [20]:
# Display first 5 rows of test dataset
test_df.head()

Unnamed: 0,type,title,director,cast,country,rating,description
0,Movie,The Bill Murray Stories: Life Lessons Learned ...,Tommy Avallone,"Tommy Avallone, Bill Murray, Joel Murray, Pete...",United States,TV-MA,This documentary highlights spontaneous encoun...
1,Movie,The Short Game,Josh Greenbaum,"Sky Sudberry, Allan Kournikova, Jed Dy, Zamoku...",United States,PG,"They are fiercely competitive athletes, determ..."
2,Movie,The Bad Batch,Ana Lily Amirpour,"Suki Waterhouse, Jason Momoa, Keanu Reeves, Ji...",United States,R,"Banished to a wasteland of undesirables, a you..."
3,TV Show,The Twilight Zone (Original Series),,Rod Serling,United States,TV-PG,"Hosted by creator Rod Serling, this groundbrea..."
4,Movie,World Trade Center,Oliver Stone,"Nicolas Cage, Michael Peña, Maggie Gyllenhaal,...",United States,PG-13,"Working under treacherous conditions, an army ..."


In [9]:
# Features of training dataframe
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3054 entries, 0 to 3053
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   type         3054 non-null   object
 1   title        3054 non-null   object
 2   director     2112 non-null   object
 3   cast         2767 non-null   object
 4   country      2838 non-null   object
 5   rating       3053 non-null   object
 6   description  3054 non-null   object
 7   genres       3054 non-null   object
dtypes: object(8)
memory usage: 191.0+ KB


In [10]:
# Features of test dataframe
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3054 entries, 0 to 3053
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   type         3054 non-null   object
 1   title        3054 non-null   object
 2   director     2106 non-null   object
 3   cast         2791 non-null   object
 4   country      2831 non-null   object
 5   rating       3052 non-null   object
 6   description  3054 non-null   object
dtypes: object(7)
memory usage: 167.1+ KB


- For predicting genre, we will take into account `title` and `description` features only for obvious reasons and these don't have any missing values records/observations.
- Based on the nature of the dataset, this is a multi-label classification task
- Features: `title`, `description`
- Label: `genres`

We will now segregate `genres` for every observations followed by converting them into one-hot encoded matrices

In [11]:
# Split individual genres
train_df["genres"] = train_df.genres.apply(lambda x: [i.strip() for i in x.split(",")])

# Initialize multi-label transformer
mlb = MultiLabelBinarizer()

# Create a new dataframe 'ohe_labels' having one-hot encoded representation
ohe_labels = pd.DataFrame(mlb.fit_transform(train_df.genres), columns=mlb.classes_)

# Merge 'train_df' and 'ohe_labels'
train_df = pd.concat((train_df, ohe_labels), axis=1)

# Display the different genres available
print("These are the labels exising in the dataset\n", mlb.classes_)

These are the labels exising in the dataset
 ['Action' 'Anime' 'Comedy' 'Documentaries' 'Drama' 'Horror'
 'International' 'Kids' 'Romantic' 'Sci-Fi & Fantasy' 'Stand-up'
 'Thriller']


In [12]:
# Display first two rows of 'train_df' post transformation
train_df.head(2)

Unnamed: 0,type,title,director,cast,country,rating,description,genres,Action,Anime,Comedy,Documentaries,Drama,Horror,International,Kids,Romantic,Sci-Fi & Fantasy,Stand-up,Thriller
0,Movie,The Ryan White Story,John Herzfeld,"Judith Light, Lukas Haas, Michael Bowen, Nikki...",United States,TV-PG,After contracting HIV from a tainted blood tre...,[Drama],0,0,0,0,1,0,0,0,0,0,0,0
1,Movie,Mumbai Cha Raja,Manjeet Singh,"Rahul Bairagi, Arbaaz Khan, Tejas Parvatkar, D...",India,TV-14,"This coming-of-age tale follows Rahul, a young...","[Drama, International]",0,0,0,0,1,0,1,0,0,0,0,0


### Model 1: CNN Model with BERT pre-trained tokenizers

**Define the Architecture**

In [13]:
class simpleCNN(nn.Module):
    """Model that uses CNN and with pretrained-model- distil-bert model
     ARGS: 
          embed_dim: embedding dimension is 768 for BERT models
          class_num: total number of labels
          kernel_num: Number of kernels in cnn
          kernel_size: list of kernels of sizes for different kernel
          dropout: dropout to be used for regularization 
     Attributes:
          bert_model: BERT model with pretrianed weights for transfer learning
          convs1: list of cnn model with respective kernel size and dimensions
          fc: final linear/dense layer with class_num output
     Abreviations:
     N: Batch size
     Ci: input Channel size, it is 1 here 
     W: words size
     D: embedding size
     Co: output channel  
     Ks: Kernel size
     C: classes
    """
    def __init__(self, embed_dim, class_num, kernel_num, kernel_sizes, dropout):
        super(simpleCNN, self).__init__()
        self.bert_model = transformers.BertModel.from_pretrained(pretrained_weights)
        #pytorch uses special-list comprehension dedicated to its model classes
        self.convs1 = nn.ModuleList([nn.Conv2d(1, kernel_num, (K, embed_dim)) for K in kernel_sizes])
        self.dropout = nn.Dropout(dropout)
        self.fc1 = nn.Linear(len(kernel_sizes) * kernel_num, class_num) 
        
    def forward(self, ids,att,token):
        x_ = self.bert_model(ids,att,token)[0] 
        x = x_.unsqueeze(1)  # (N, Ci, W, D) 
        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]  # [(N, Co, W), ...]*len(Ks) 
        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  # [(N, Co), ...]*len(Ks)
        x = torch.cat(x, 1)
        x = self.dropout(x)  # (N, len(Ks)*Co)
        logit = self.fc1(x)  # (N, C)
        return logit 

**Dataset Module** 

In [14]:
from torch.utils.data import Dataset,DataLoader
tokenizer_class = transformers.BertTokenizer
pretrained_weights='distilbert-base-uncased'
class GenreDataset(Dataset):
  """Dataset class with pretrained-embeddings from the distil-bert model.
     ARGS: 
          description: total list of text description 
          title: total list of titles
          labels: one-hot encoded labels 
     Attributes:
          tokenizer: Tokenizing the text and embedding with ids.
          max_seq: maximum number of word to consider and truncate 
                   if there are more or pad with [0] they are less.
          
     Abreviations:
     N: Batch size
     Ci: input Channel size, it is 1 here 
     W: words size
     D: embedding size
     Co: output channel  
     Ks: Kernel size
     C: classes
  """
  def __init__(self,description,title,labels):
    self.title = title
    self.description = description
    self.labels = labels 
    self.max_seq = 250 #shouldn't be > 512 for BERT
    self.tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
  
  def __len__(self):
    return len(self.title)
  
  def __getitem__(self,idx): 
    #convert each text input to string without gaps and join them before tokenizing
    title = "".join(self.title[idx].split())
    description = "".join(self.description[idx].split()) 
    labels = self.labels[idx,:]
    inputs = self.tokenizer(title + description, add_special_tokens=True,truncation=True,max_length=self.max_seq)
    # here 'input_ids' implies the token numbers given by the embeddings
    input_ids = inputs["input_ids"]
    # 'token_type_ids' usually helpful if we are using separate two text data rather than single text data 
    token_type_ids = inputs["token_type_ids"]
    # 'attention_mask' will have 1:attending word and 0:padded word
    attention_mask = inputs["attention_mask"]
    #here padding with [0] if the tokens are less than the max_seq
    input_ids = input_ids + [0] * (self.max_seq - len(input_ids))
    token_type_ids = token_type_ids + [0] * (self.max_seq - len(token_type_ids))
    attention_mask = attention_mask + [0] * (self.max_seq - len(attention_mask))
    return {
        "input_ids": torch.tensor(input_ids,dtype=torch.long),
        "token_type_ids": torch.tensor(token_type_ids,dtype=torch.long),
        "attention_mask": torch.tensor(attention_mask,dtype=torch.long),
        "labels": torch.tensor(labels,dtype=torch.float)
    } 



**Defining Hyperparameters**

In [15]:
# Instantiating the cnn-bert model with arguments 
embed_dim = 768
class_num = len(mlb.classes_)
kernel_num = 3
kernel_sizes = [2, 3, 4]
dropout = 0.5

# Post running this cell model will be downloaded from web (ignore errors)
model = simpleCNN(
    embed_dim=embed_dim,
    class_num=class_num,
    kernel_num=kernel_num,
    kernel_sizes=kernel_sizes,
    dropout=dropout
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertModel: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.lin2.weight', 'd

**Define train, val, test functions**

In [16]:

# Training function 
def train_fn(dataloader,model,optimizer,device):
  # setting the supplied to training mode 
  model.train() 
  losses = []
  f_output = []
  f_target = [] 
  # loading the batchwise data to the model 
  for d in dataloader:
    in_ids,token_ids,att_mask,targets = d["input_ids"].to(device),d["token_type_ids"].to(device),d["attention_mask"].to(device),d["labels"].to(device)
    # setting optimizer to to zero 
    optimizer.zero_grad()
    # forward propogation 
    outs = model(in_ids,token_ids,att_mask)
    loss = loss_fn(outs,targets)
    # backpropogation 
    loss.backward()
    # weights update 
    optimizer.step()
    # getting loss,targets,output values as return 
    losses.append(loss.cpu().detach())
    # final layer values with sigmoid to get values between 0 and 1
    outs = torch.sigmoid(outs)
    f_output.extend(outs.cpu().detach().numpy())
    f_target.extend(targets.cpu().detach().numpy())
  return f_output,f_target,np.sum(losses)/len(dataloader) 

def validation_fn(dataloader,model,device):
    model.eval()
    losses = []
    f_output = []
    f_target = []
    # keeping the no-gradient on  
    with torch.no_grad():
      # loading the batchwise data to the model
      for d in dataloader:
        in_ids,token_ids,att_mask,targets = d["input_ids"].to(device),d["token_type_ids"].to(device),d["attention_mask"].to(device),d["labels"].to(device)
        # forward propogation 
        outs = model(in_ids,token_ids,att_mask)
        loss = loss_fn(outs,targets)
        # final layer with sigmoid to get values between 0 and 1
        outs = torch.sigmoid(outs)
        # getting loss,targets,output values as return 
        losses.append(loss.cpu().detach())
        f_output.extend(outs.cpu().detach().numpy())
        f_target.extend(targets.cpu().detach().numpy())
    return f_output,f_target,np.sum(losses)/len(dataloader) 

def test_fn(dataloader,model,device):
    model.eval()
    f_output = []
    # keeping the no-gradient on  
    with torch.no_grad():
      # loading the batchwise data to the model
      for d in dataloader:
        in_ids,token_ids,att_mask = d["input_ids"].to(device),d["token_type_ids"].to(device),d["attention_mask"].to(device)
        # forward propogation 
        outs = model(in_ids,token_ids,att_mask)
        # final layer with sigmoid to get values between 0 and 1
        outs = torch.sigmoid(outs)
        # getting loss,targets,output values as return 
        f_output.extend(outs.cpu().detach().numpy())
    return f_output 

# This loss combines a Sigmoid layer and the BCELoss in one single class. 
def loss_fn(out,target):
  return nn.BCEWithLogitsLoss()(out,target)

**Train Model**

In [None]:
# define number of epochs
epochs = 10
from sklearn.model_selection import train_test_split
from eval_metric import evaluate_results
from sklearn.metrics import accuracy_score

# splitting the train and test dataset 
train,val = train_test_split(train_df,test_size=0.1)

# instantiating dataset class with supplied data and labels 
t_dataset = GenreDataset(train["description"].values,train["title"].values,train[mlb.classes_].values)
# dataset and batchwise loader from pytorch 
t_loader = DataLoader(t_dataset,batch_size=32,shuffle=True,num_workers=4)

v_dataset = GenreDataset(val["description"].values,val["title"].values,val[mlb.classes_].values)
v_loader = DataLoader(v_dataset,batch_size=32,shuffle=False,num_workers=4)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#optimizer methodology used with very low learning rate 
optimizer = torch.optim.Adam(model.parameters(),lr=1e-5)
# loading the model to device for getting the weights to device 
model.to(device)
best_loss = np.inf 
print("TRAINING STARTED...") 

for e in range(epochs):
  t_out,t_target,train_loss = train_fn(t_loader,model,optimizer,device)
  v_out,v_target,val_loss = validation_fn(v_loader,model,device)  
  # here evaluation results from the given file 
  acc,f1,fpr,fnr = evaluate_results(np.array(t_target),np.round(t_out))
  acc2,f1_2,fpr2,fnr2 = evaluate_results(np.array(v_target),np.round(v_out))
  print(f"{e+1}-")
  if val_loss < best_loss:
    torch.save(model.state_dict(),"model_bert.pth")
    best_loss = val_loss 
  # uncomment below code to get the accuracy score directly from sklearn 
  #print((accuracy_score(np.array(t_target),np.round(t_out)),accuracy_score(np.array(v_target),np.round(v_out))))

  print("train_loss:", round(train_loss,3),"train_accuracy:",round(acc,3), "train_f1:", f1, "train_fpr:", fpr, "train_fnr:", fnr) 
  print("val_loss:",round(val_loss,3),"val_accuracy:",round(acc2,3), "val_f1:", f1_2, "val_fpr:", fpr2, "val_fnr:", fnr2)


TRAINING STARTED...
1-
train_loss: 0.6 train_accuracy: 0.003 train_f1: [0.1384820239680426, 0.0588235294117647, 0.26220735785953175, 0.18352941176470586, 0.18521284540702015, 0.07969151670951158, 0.2624384909786769, 0.0, 0.0, 0.0, 0.10328638497652581, 0.10964083175803402] train_fpr: [0.15713698066639242, 0.26905829596412556, 0.28865979381443296, 0.3155402496771416, 0.12450028555111364, 0.23596792668957617, 0.14904552129221732, 0.0012432656444260257, 0.0, 0.0, 0.4085483249903735, 0.7083820662768031] train_fnr: [0.8359621451104101, 0.6666666666666666, 0.7243319268635724, 0.7247058823529412, 0.8756268806419257, 0.7596899224806202, 0.8268398268398268, 1.0, 1.0, 1.0, 0.5629139072847682, 0.366120218579235]
val_loss: 0.572 val_accuracy: 0.003 val_f1: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1398176291793313] val_fpr: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0] val_fnr: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0]
2-
train_loss: 0.603 train_acc

**Inference - Model1**

In [18]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [22]:
from torch.utils.data import Dataset,DataLoader
tokenizer_class = transformers.BertTokenizer
pretrained_weights='distilbert-base-uncased'

class GenreTestDataset(Dataset):
  """Dataset class with pretrained-embeddings from the distil-bert model
     ARGS: 
          description: total list of text description 
          title: total list of titles
          labels: one-hot encoded labels 
     Attributes:
          tokenizer: Tokenizing the text and embedding with ids.
          max_seq: maximum number of word to consider and truncate 
                   if there are more or pad with [0] they are less.
          
     Abreviations:
     N: Batch size
     Ci: input Channel size, it is 1 here 
     W: words size
     D: embedding size
     Co: output channel  
     Ks: Kernel size
     C: classes
  """
  def __init__(self,description,title):
    self.title = title
    self.description = description
    self.max_seq = 250
    self.tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
  
  def __len__(self):
    return len(self.title)
  
  def __getitem__(self,idx): 
    # convert each text input to  string without gaps and join they before tokenizing
    title = "".join(self.title[idx].split(" "))
    description = "".join(self.description[idx].split(" ")) 
    inputs = self.tokenizer(title + description, add_special_tokens=True,truncation=True,max_length=self.max_seq)
    # here input_ids means the token numbers given by the embeddings
    input_ids = inputs["input_ids"]
    # token_type_ids ususally helpful if we are using seperate two text data rather than as single text data 
    token_type_ids = inputs["token_type_ids"]
    # attention_mask will have 1:attending word and 0:padded word
    attention_mask = inputs["attention_mask"]
    # here padding with [0] if the tokens are less than the max_seq
    input_ids = input_ids + [0] * (self.max_seq - len(input_ids))
    token_type_ids = token_type_ids + [0] * (self.max_seq - len(token_type_ids))
    attention_mask = attention_mask + [0] * (self.max_seq - len(attention_mask))
    return {
        "input_ids": torch.tensor(input_ids,dtype=torch.long),
        "token_type_ids": torch.tensor(token_type_ids,dtype=torch.long),
        "attention_mask": torch.tensor(attention_mask,dtype=torch.long)
    }
#test_df = pd.read_csv("dataset/test.csv") 
test_dataset = GenreTestDataset(test_df["description"].values,test_df["title"].values)
test_loader = DataLoader(test_dataset,batch_size=32,shuffle=False,num_workers=4)
#Instantiating the cnn-bert model with arguments 
embed_dim = 768
class_num = len(mlb.classes_)
kernel_num = 3
kernel_sizes = [2, 3, 4]
dropout = 0.5

#after running this cell model will be downloaded from web (ignore errors)
model = simpleCNN(
    embed_dim=embed_dim,
    class_num=class_num,
    kernel_num=kernel_num,
    kernel_sizes=kernel_sizes,
    dropout=dropout)
model.load_state_dict(torch.load("model_bert.pth", map_location=device))
model.to(device) 
model.eval()
test_results = test_fn(test_loader,model,device)

# this function will take list of sigmoid values as input, round it to 1/0
# gets the indices of value 1 or get the index of maximum sigmoid value and return corresponding labels.
def get_labels(result_labels):
  result_indices = np.where(np.round(result_labels)==1.0)[1]
  if len(result_indices)==0:
    genre = mlb.classes_[np.argmax(result_labels)]
    return genre
  else:
    genres = mlb.classes_[np.where(np.round(result_labels)==1.0)[1]]
    return genres

test_df["genres"] = test_results
test_df["genres"] = test_df["genres"].apply(lambda x: get_labels([x]))
test_df.to_csv("test_result_cnn.csv",index=False)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertModel: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.lin2.weight', 'd

In [62]:
model = simpleCNN(
    embed_dim=embed_dim,
    class_num=class_num,
    kernel_num=kernel_num,
    kernel_sizes=kernel_sizes,
    dropout=dropout)
#device = "cpu"
model.load_state_dict(torch.load("model_bert.pth",map_location=device))
model.to(device) 


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertModel: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.lin2.weight', 'd

<All keys matched successfully>

In [64]:
device

device(type='cuda')

**Prediction - Model1**

In [None]:
description = "Set nearly a decade after the finale of the original series, this revival follows Lorelai, Rory and Emily Gilmore through four seasons of change."
title = "Gilmore Girls: A Year in the Life"
test = GenreTestDataset([description,],[title,])
test_loader = DataLoader(test,batch_size=1,shuffle=False)
result_labels = test_fn(test_loader,model,device) 
#showing the result which has higest confidence in genres
result_indices = np.where(np.round(result_labels)==1.0)[1]
if len(result_indices)==0:
  print(mlb.classes_[np.argmax(result_labels)])
else:
  print(mlb.classes_[np.where(np.round(result_labels)==1.0)[1]])

['Thriller']


### Model 2: LSTM Model with BERT pre-trained tokenizers

**Define the Architecture**

In [24]:
class GenreLSTM(nn.Module):
    
    # define all the layers used in model
    def __init__(self,embedding_dim, hidden_dim, classes, n_layers=2, 
                 bidirectional=True, dropout=0.2):
        
        #Constructor
        super().__init__()          
        
        self.bert_model = transformers.BertModel.from_pretrained(pretrained_weights)
        #lstm layer
        self.lstm = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout,
                           batch_first=True)
        
        #dense layer 
        self.fc = nn.Linear(hidden_dim * 2, classes)

        
    def forward(self, ids,att,token):
        
        # text = [batch size,sent_length]
        embedded = self.bert_model(ids,att,token)[0]
        # embedded = [batch size, sent_len, emb dim]

        packed_output, (hidden, cell) = self.lstm(embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
        
        #concat the final forward and backward hidden state
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
                
        #hidden = [batch size, hid dim * num directions]
        outputs=self.fc(hidden)
        
        return outputs

**Train Model**

In [None]:
epochs = 15
from sklearn.model_selection import train_test_split 
from eval_metric import evaluate_results
from sklearn.metrics import accuracy_score

train,val = train_test_split(train_df,test_size=0.1)

t_dataset = GenreDataset(train["description"].values,train["title"].values,train[mlb.classes_].values)
t_loader = DataLoader(t_dataset,batch_size=32,shuffle=True,num_workers=4)

v_dataset = GenreDataset(val["description"].values,val["title"].values,val[mlb.classes_].values)
v_loader = DataLoader(v_dataset,batch_size=32,shuffle=False,num_workers=4)

model = GenreLSTM(768,100,len(mlb.classes_))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
optimizer = torch.optim.Adam(model.parameters(),lr=1e-5)
best_loss = np.inf 
model.to(device) 
print("TRAINING STARTED...")

for e in range(epochs):
  t_out,t_target,train_loss = train_fn(t_loader,model,optimizer,device)
  v_out,v_target,val_loss = validation_fn(v_loader,model,device)  
  acc,f1,fpr,fnr = evaluate_results(np.array(t_target),np.round(t_out))
  acc2,f1_2,fpr2,fnr2 = evaluate_results(np.array(v_target),np.round(v_out))
  print(f"{e+1}-")
  #save the model as .pth if the loss is less than the earlier epochs 
  if val_loss < best_loss:
    torch.save(model.state_dict(),"model_lstm_bert.pth")
    best_loss = val_loss
  #print((accuracy_score(np.array(t_target),np.round(t_out)),accuracy_score(np.array(v_target),np.round(v_out))))
  print("train_loss:", round(train_loss,3),"train_accuracy:",round(acc,3), "train_f1:", f1, "train_fpr:", fpr, "train_fnr:", fnr) 
  print("val_loss:",round(val_loss,3),"val_accuracy:",round(acc2,3), "val_f1:", f1_2, "val_fpr:", fpr2, "val_fnr:", fnr2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertModel: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.lin2.weight', 'd

TRAINING STARTED...
1-
train_loss: 0.556 train_accuracy: 0.026 train_f1: [0.04700854700854701, 0.0, 0.10032715376226828, 0.0683111954459203, 0.0019821605550049554, 0.03636363636363636, 0.6567926455566904, 0.011049723756906075, 0.037234042553191495, 0.0, 0.011049723756906077, 0.015325670498084292] train_fpr: [0.05720164609053498, 0.003727171077152441, 0.08170254403131115, 0.03905579399141631, 0.003436426116838488, 0.010711553175210406, 0.912267657992565, 0.009128630705394191, 0.025399426464563703, 0.0, 0.011166730843280709, 0.02887241513850956] train_fnr: [0.9654088050314465, 1.0, 0.9346590909090909, 0.9569377990430622, 0.999001996007984, 0.9776119402985075, 0.08339272986457591, 0.9940828402366864, 0.9771986970684039, 1.0, 0.9933774834437086, 0.9891891891891892]
val_loss: 0.496 val_accuracy: 0.02 val_f1: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6695652173913044, 0.0, 0.0, 0.0, 0.0, 0.0] val_fpr: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0] val_fnr: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0,

**Inference - Model2**

In [25]:
from torch.utils.data import Dataset,DataLoader
tokenizer_class = transformers.BertTokenizer
pretrained_weights='distilbert-base-uncased'
class LSTMTestDataset(Dataset):
  """Test Dataset class with pretrained-embeddings from the distil-bert model.
     ARGS: 
          description: total list of text description 
          title: total list of titles
          labels: one-hot encoded labels 
     Attributes:
          tokenizer: Tokenizing the text and embedding with ids.
          max_seq: maximum number of word to consider and truncate 
                   if there are more or pad with [0] they are less.
          
     Abreviations:
     N: Batch size
     Ci: input Channel size, it is 1 here 
     W: words size
     D: embedding size
     Co: output channel  
     Ks: Kernel size
     C: classes
  """
  def __init__(self,description,title):
    self.title = title
    self.description = description
    self.max_seq = 250
    self.tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
  
  def __len__(self):
    return len(self.title)
  
  def __getitem__(self,idx): 
    #convert each text input to  string without gaps and join they before tokenizing. 
    title = "".join(self.title[idx].split(" "))
    description = "".join(self.description[idx].split(" ")) 
    inputs = self.tokenizer(title + description, add_special_tokens=True,truncation=True,max_length=self.max_seq)
    # here input_ids means the token numbers given by the embeddings
    input_ids = inputs["input_ids"]
    # token_type_ids ususally helpful if we are using seperate two text data rather than as single text data 
    token_type_ids = inputs["token_type_ids"]
    # attention_mask will have 1:attending word and 0:padded word
    attention_mask = inputs["attention_mask"]
    #here padding with [0] if the tokens are less than the max_seq
    input_ids = input_ids + [0] * (self.max_seq - len(input_ids))
    token_type_ids = token_type_ids + [0] * (self.max_seq - len(token_type_ids))
    attention_mask = attention_mask + [0] * (self.max_seq - len(attention_mask))
    return {
        "input_ids": torch.tensor(input_ids,dtype=torch.long),
        "token_type_ids": torch.tensor(token_type_ids,dtype=torch.long),
        "attention_mask": torch.tensor(attention_mask,dtype=torch.long)
    }
#test = pd.read_csv("dataset/test.csv") 
test_dataset = LSTMTestDataset(test_df["description"].values,test_df["title"].values)
test_loader = DataLoader(test_dataset,batch_size=32,shuffle=False,num_workers=4)
#Instantiating the cnn-bert model with arguments 
#instantiate the model
model = GenreLSTM(768,100,len(mlb.classes_))
# load the model to instantiated model
model.load_state_dict(torch.load("model_lstm_bert.pth"))
model.to(device) 
test_results = test_fn(test_loader,model,device)

# this function will take list of sigmoid values as input, round it to 1/0
# gets the indices of value 1 or get the index of maximum sigmoid value and return corresponding labels.
def get_labels(result_labels):
  result_indices = np.where(np.round(result_labels)==1.0)[1]
  if len(result_indices)==0:
    genre = mlb.classes_[np.argmax(result_labels)]
    return genre
  else:
    genres = mlb.classes_[np.where(np.round(result_labels)==1.0)[1]]
    return genres

test_df["genres"] = test_results
test_df["genres"] = test_df["genres"].apply(lambda x: get_labels([x]))
test_df.to_csv("test_result_lstm.csv",index=False)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertModel: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.lin2.weight', 'd

**Prediction - Model2**

In [None]:
# get the description and title as string  
description = "After contracting HIV from a tainted blood treatment, teenaged hemophiliac Ryan White is forced to fight for his right to attend public school."
title = "The Ryan White Story"
test = GenreTestDataset([description,],[title,])
test_loader = DataLoader(test,batch_size=1,shuffle=False)
result_labels = test_fn(test_loader,model,device)
# show the results from the labels-names, which are rounded to 1
result_indices = np.where(np.round(result_labels)==1.0)[1]
if len(result_indices)==0:
  print(mlb.classes_[np.argmax(result_labels)])
else:
  print(mlb.classes_[np.where(np.round(result_labels)==1.0)[1]])

['International']


### Model 3: GPT-2 model with LSTM using GPT-2 pre-trained tokenizers

**Dataset Module**

In [27]:
from transformers import GPT2Tokenizer

class GPTDataset(Dataset):
  """ This is dataset module for gpt and 
      everything is similar to the earliar dataset modules seen
      Except the tokenizer and token_ids are not used 
      Tokenizer used here is GPT2  
  """
  def __init__(self,description,title,labels):
    self.title = title
    self.description = description
    self.labels = labels 
    self.max_seq = 250
    self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  
  def __len__(self):
    return len(self.title)
  
  def __getitem__(self,idx): 
    title = "".join(self.title[idx].split())
    description = "".join(self.description[idx].split()) 
    labels = self.labels[idx,:]
    inputs = self.tokenizer(title+description, add_special_tokens=True,truncation=True,max_length=self.max_seq)
    input_ids = inputs["input_ids"]
    token_type_ids = [0,]
    attention_mask = inputs["attention_mask"]
    input_ids = input_ids + [0] * (self.max_seq - len(input_ids))
    token_type_ids = token_type_ids + [0] * (self.max_seq - len(token_type_ids))
    attention_mask = attention_mask + [0] * (self.max_seq - len(attention_mask))
    return {
        "input_ids": torch.tensor(input_ids,dtype=torch.long),
        "token_type_ids": torch.tensor(token_type_ids,dtype=torch.long),
        "attention_mask": torch.tensor(attention_mask,dtype=torch.long),
        "labels": torch.tensor(labels,dtype=torch.float)
    } 

**Define the Architecture**

In [28]:
from transformers import GPT2Tokenizer, GPT2Model
class GPTModel(nn.Module):
    """Module with GPT2 model and pretrained from transformers-huggingface.
     ARGS: 
          embedding_dim: embedding dimention is 768 even for gpt models
          class_num: ouput label classes 
          hidden_dim: hidden dimentions for lstm models
    """
    # define all the layers used in model
    def __init__(self,embedding_dim,hidden_dim, classes):
        
        # Constructor
        super().__init__()          
        # pre-trained gpt2 model, transfer learning 
        self.model = GPT2Model.from_pretrained('gpt2')  
        # lstm layer with 2-layers 
        self.lstm = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=2, 
                           bidirectional=True, 
                           dropout=0.2,
                           batch_first=True)     
        # dense layer
        self.fc = nn.Linear(hidden_dim*2, classes)

        
    def forward(self, in_ids,token,att_mask):
        
        #text = [batch size,sent_length]
        embedded = self.model(input_ids=in_ids,attention_mask=att_mask)[0]              
        packed_output, (hidden, cell) = self.lstm(embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
        
        #concat the final forward and backward hidden state
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1) 
                
        #hidden = [batch size, hid dim * num directions]
        outputs=self.fc(hidden)
        return outputs 

**Train Model**

In [None]:
epochs = 15
from sklearn.model_selection import train_test_split
from eval_metric import evaluate_results
from sklearn.metrics import accuracy_score

# splitting the train and test dataset 
train,val = train_test_split(train_df,test_size=0.1)

# instantiating dataset class with supplied data and labels 
t_dataset = GPTDataset(train["description"].values,train["title"].values,train[mlb.classes_].values)
# dataset and batchwise loader from pytorch 
t_loader = DataLoader(t_dataset,batch_size=16,shuffle=True,num_workers=4)

v_dataset = GPTDataset(val["description"].values,val["title"].values,val[mlb.classes_].values)
v_loader = DataLoader(v_dataset,batch_size=16,shuffle=False,num_workers=4)

model = GPTModel(768,100,len(mlb.classes_))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#optimizer methodoly used with very low learning rate 
optimizer = torch.optim.Adam(model.parameters(),lr=1e-5)
# loadin the model to device for getting the weights to device 
model.to(device) 
best_loss = np.inf
print("TRAINING STARTED...")

for e in range(epochs):
  t_out,t_target,train_loss = train_fn(t_loader,model,optimizer,device)
  v_out,v_target,val_loss = validation_fn(v_loader,model,device)  
  # here evaluation results from the given file 
  acc,f1,fpr,fnr = evaluate_results(np.array(t_target),np.round(t_out))
  acc2,f1_2,fpr2,fnr2 = evaluate_results(np.array(v_target),np.round(v_out))
  print(f"{e+1}-")
  if val_loss < best_loss:
    torch.save(model.state_dict(),"model_gpt.pth")
    best_loss = val_loss
  #uncomment below code to get the accuracy score directly from sklearn 
  #print((accuracy_score(np.array(t_target),np.round(t_out)),accuracy_score(np.array(v_target),np.round(v_out))))
  print("train_loss:", round(train_loss,3),"train_accuracy:",round(acc,3), "train_f1:", f1, "train_fpr:", fpr, "train_fnr:", fnr) 
  print("val_loss:",round(val_loss,3),"val_accuracy:",round(acc2,3), "val_f1:", f1_2, "val_fpr:", fpr2, "val_fnr:", fnr2)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TRAINING STARTED...
1-
train_loss: 0.62 train_accuracy: 0.012 train_f1: [0.0060790273556231, 0.03978779840848806, 0.0027322404371584695, 0.004830917874396135, 0.09573542210617929, 0.0, 0.5763779527559056, 0.0, 0.16132858837485173, 0.02586206896551724, 0.013157894736842106, 0.03902439024390244] train_fpr: [0.00288421920065925, 0.25121133060007456, 0.0009905894006934125, 0.0004280821917808219, 0.04723502304147465, 0.0, 0.6347305389221557, 0.00041203131437989287, 0.19271685761047463, 0.040015243902439025, 0.0007695267410542517, 0.005466614603670442] train_fnr: [0.9968847352024922, 0.7692307692307693, 0.9986282578875172, 0.9975728155339806, 0.9456521739130435, 1.0, 0.3519830028328612, 1.0, 0.7763157894736842, 0.9758064516129032, 0.9932885906040269, 0.9786096256684492]
val_loss: 0.539 val_accuracy: 0.026 val_f1: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6267281105990783, 0.0, 0.0, 0.0, 0.0, 0.0] val_fpr: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9503105590062112, 0.0, 0.0, 0.0, 0.0, 0.0] val_fnr: [1.0, 1.0, 

**Inference - Model3**

In [29]:
from transformers import GPT2Tokenizer

class GPTTestDataset(Dataset):
  """ This is dataset module for gpt and 
      everything is similar to the earliar dataset modules seen. 
      Except the tokenizer and token_ids are not used 
      Tokenizer used here is GPT2  
  """
  def __init__(self,description,title):
    self.title = title
    self.description = description
    self.max_seq = 250
    self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
  
  def __len__(self):
    return len(self.title)
  
  def __getitem__(self,idx): 
    title = "".join(self.title[idx].split())
    description = "".join(self.description[idx].split()) 
    inputs = self.tokenizer(title+description, add_special_tokens=True,truncation=True,max_length=self.max_seq)
    input_ids = inputs["input_ids"]
    token_type_ids = [0,] 
    attention_mask = inputs["attention_mask"]
    input_ids = input_ids + [0] * (self.max_seq - len(input_ids))
    token_type_ids = token_type_ids + [0] * (self.max_seq - len(token_type_ids))
    attention_mask = attention_mask + [0] * (self.max_seq - len(attention_mask))
    return {
        "input_ids": torch.tensor(input_ids,dtype=torch.long),
        "token_type_ids": torch.tensor(token_type_ids,dtype=torch.long),
        "attention_mask": torch.tensor(attention_mask,dtype=torch.long),
    } 
#test = pd.read_csv("dataset/test.csv") 
test_dataset = GPTTestDataset(test_df["description"].values,test_df["title"].values)
test_loader = DataLoader(test_dataset,batch_size=32,shuffle=False,num_workers=4)
#Instantiating the cnn-bert model with arguments 
model = GPTModel(768,100,len(mlb.classes_))
model.load_state_dict(torch.load("model_gpt.pth"))
model.to(device) 
test_results = test_fn(test_loader,model,device) 

# this function will take list of sigmoid values as input, round it to 1/0
# gets the indices of value 1 or get the index of maximum sigmoid value and return corresponding labels.
def get_labels(result_labels):
  result_indices = np.where(np.round(result_labels)==1.0)[1]
  if len(result_indices)==0:
    genre = mlb.classes_[np.argmax(result_labels)]
    return genre
  else:
    genres = mlb.classes_[np.where(np.round(result_labels)==1.0)[1]]
    return genres

test_df["genres"] = test_results
test_df["genres"] = test_df["genres"].apply(lambda x: get_labels([x]))
test_df.to_csv("test_result_gpt.csv",index=False)

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Prediction - Model3**

In [None]:
# get the description and title as string  
description = "Banished to a wasteland of undesirables, a young woman struggles to find her feet among a drug-soaked desert society and an enclave of cannibals."
title = "The Bad Batch"
test = GPTTestDataset([description,],[title,])
test_loader = DataLoader(test,batch_size=1,shuffle=False)
result_labels = test_fn(test_loader,model,device)
# show the results from the labels-names, which are rounded to 1
result_indices = np.where(np.round(result_labels)==1.0)[1]
if len(result_indices)==0:
  print(mlb.classes_[np.argmax(result_labels)])
else:
  print(mlb.classes_[np.where(np.round(result_labels)==1.0)[1]])

['Drama' 'International']


### Conlcusion

Overall deep learning models didn't perform well on the given dataset. This could be attributed to the fact that the dataset size was pretty small.

Also, Model 3 based on accuracy performed better compared to other models. This is possibly because GPT-2 tokenizer is pre-trained on vast set of resources while BERT tokenizer is limited to Wikipedia only.